CogView3Plus DiT #9570

zRzRzRzRzRzRzR · 2024-10-02T13:51:36Z

This is a draft of CogView3-Plus, not yet fully perfected:
The content that needs to be produced includes:
SAT2diffusers link
VAE implementation and PipeLine integration
Automatic Documentation and Proofreading

Expected to be completed on October 7th
Keep in touch, looking forward to the community's help.
@a-r-r-o-w @yiyixuxu

zRzRzRzRzRzRzR · 2024-10-04T14:08:16Z

The current version has the same output shape, and the model conversion script is also normal. However, there is still a lot of work to be done.

The verification of each layer of transformers has not been carried out yet. Currently, the pos embed implementation is the same, but the tensor is different, and it is being verified.
@yiyixuxu mentioned that the VAE part can be directly converted using AutoEncoderKL, so I directly used the solution she provided.
Regarding the schedule, I first used DDPM, but this should need adjustment. Because I did not see any related operations for shift_scale.
The Pipeline has not been implemented today, so I will try to run it in tomorrow's work.
The doc documentation is incomplete

HuggingFaceDocBuilderDev · 2024-10-06T23:22:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w · 2024-10-06T23:24:50Z

Testing script for transformer implementation:

Code

import torch

from omegaconf import DictConfig
from sgm.modules.diffusionmodules.dit import DiffusionTransformer
from diffusers import CogView3PlusTransformer2DModel

@torch.no_grad()
def main():
    config = DictConfig({
        "in_channels": 16,
        "out_channels": 16,
        "hidden_size": 2560,
        "num_layers": 30,
        "patch_size": 2,
        "block_size": 16,
        "num_attention_heads": 64,
        "text_length": 224,
        "time_embed_dim": 512,
        "num_classes": "sequential",
        "adm_in_channels": 1536,
        "modules": {
            "pos_embed_config": {
                "target": "sgm.modules.diffusionmodules.dit.PositionEmbeddingMixin",
                "params": {
                    "max_height": 128,
                    "max_width": 128,
                    "max_length": 4096
                }
            },
            "patch_embed_config": {
                "target": "sgm.modules.diffusionmodules.dit.ImagePatchEmbeddingMixin",
                "params": {
                    "text_hidden_size": 4096
                }
            },
            "attention_config": {
                "target": "sgm.modules.diffusionmodules.dit.AdalnAttentionMixin",
                "params": {
                    "qk_ln": True
                }
            },
            "final_layer_config": {
                "target": "sgm.modules.diffusionmodules.dit.FinalLayerMixin"
            }
        },
    })

    transformer = DiffusionTransformer(**config)

    ckpt_path_cogview3_plus = "/raid/aryan/CogView3-SAT/cogview3plus_3b/1/mp_rank_00_model_states.pt"
    state_dict = torch.load(ckpt_path_cogview3_plus)["module"]
    state_dict = {k.replace("model.diffusion_model.", ""): v for k, v in state_dict.items()}
    transformer.load_state_dict(state_dict, strict=False)
    transformer = transformer.to("cuda", dtype=torch.bfloat16)

    transformer_diffusers = CogView3PlusTransformer2DModel.from_pretrained("/raid/aryan/CogView3Plus-trial/", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")

    print(sum(p.numel() for p in transformer.parameters() if p.requires_grad))
    print(sum(p.numel() for p in transformer_diffusers.parameters() if p.requires_grad))

    x = torch.ones((2, 16, 128, 128), device="cuda", dtype=torch.bfloat16)
    timesteps = torch.ones((2,), device="cuda", dtype=torch.bfloat16)
    context = torch.ones((2, 224, 4096), device="cuda", dtype=torch.bfloat16)
    y = torch.ones((2, 1536), device="cuda", dtype=torch.bfloat16)

    breakpoint()
    kwargs = {'target_size': [(1024, 1024)], "idx": timesteps}
    output = transformer(x, timesteps, context, y, **kwargs)
    output_diffusers = transformer_diffusers(x, context, y, timesteps)[0]

    print((output - output_diffusers).abs().max(), (output - output_diffusers).abs().sum())

main()

@yiyixuxu mentioned that the VAE part can be directly converted using AutoEncoderKL, so I directly used the solution she provided.

Based on her testing script, I've updated the conversion script. I see similar outputs between both the implementations, but I think it would be good to verify this on your end as well.

Regarding the schedule, I first used DDPM, but this should need adjustment. Because I did not see any related operations for shift_scale.

cc @yiyixuxu. I think we could support the shift_scale parameter if it makes sense in the schedulers, unless we need different scheduler implementations like CogVideoX

The Pipeline has not been implemented today, so I will try to run it in tomorrow's work.

For the label embeddings of shape [B, 1536], I think we will have to pass target_size_as_tu ple, crop_coords_top_left and original_size_as_tuple to the transformer.

{'target_size_as_tuple': tensor([[1024, 1024]], device='cuda:0'), 'txt': ['an astronaut riding a horse in space'], 'crop_coords_top_left': tensor([[0, 0]], device='cuda:0'), 'original_size_as_tuple': tensor([[1024, 1024]], device='cuda:0')}

But I don't think that's a clean approach.Maybe better would be to prepare the sinusoidal embeddings in the pipeline and then pass that. @yiyixuxu WDYT?

src/diffusers/models/embeddings.py

yiyixuxu · 2024-10-07T16:44:08Z

scripts/convert_cogview3_to_diffusers.py

+
+
+def main(args):
+    if args.dtype == "fp16":


I think it makes sense to have dtype default to None and just to by default to convert to the original dtype (we had many occasions that we just accidentally upcasted a mode during the conversion, which is very much undesired)

Oh okay, will update

Let's make sure to incorporate this change!

yiyixuxu · 2024-10-07T16:51:08Z

for the label embeddings of shape [B, 1536], I think we will have to pass target_size_as_tu ple, crop_coords_top_left and original_size_as_tuple to the transformer.

{'target_size_as_tuple': tensor([[1024, 1024]], device='cuda:0'), 'txt': ['an astronaut riding a horse in space'], 'crop_coords_top_left': tensor([[0, 0]], device='cuda:0'), 'original_size_as_tuple': tensor([[1024, 1024]], device='cuda:0')}

what are you talking about there? @a-r-r-o-w

YiYi Xu <[email protected]>

Co-Authored-By: YiYi Xu <[email protected]>

a-r-r-o-w · 2024-10-08T16:22:13Z

I think this is ready for review of the code parts. The additional embeddings used in timestep condition are similar to SDXL, so I can refactor it out like how we do for SDXL. Apart from that, if there are any changes you'd like, please let me know. The pipeline works and inference runs fine. The outputs are a bit oversaturated and worse in quality, and we also need to validate multiple image per batch - both of which I or Yuxuan will take a look at tomorrow.

a-r-r-o-w · 2024-10-09T15:02:02Z

Some results:

yiyixuxu · 2024-10-09T23:13:38Z

src/diffusers/pipelines/cogview3/pipeline_cogview3plus.py

+"""
+
+
+# Similar to diffusers.pipelines.hunyuandit.pipeline_hunyuandit.get_resize_crop_region_for_grid


we did not use this, no?

yiyixuxu · 2024-10-09T23:14:07Z

src/diffusers/pipelines/cogview3/pipeline_cogview3plus.py

+
+        self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)
+
+    def _get_t5_prompt_embeds(


is this copied from any place?

src/diffusers/pipelines/cogview3/pipeline_cogview3plus.py

yiyixuxu · 2024-10-10T03:02:45Z

src/diffusers/pipelines/cogview3/pipeline_cogview3plus.py

+            )
+
+        if do_classifier_free_guidance and negative_prompt_embeds is None:
+            negative_prompt = negative_prompt or ""


I think the encode_prompt for negative prompts is different - they use zeros for empty prompts https://github.com/THUDM/CogView3/blob/f80f1001a3bd276a7825bff30d910abeab7e593f/sat/sample_dit.py#L172

did you look into if it caused a difference?

Oh yes, sorry I had it in mind but forgot. I'll update to something like:

if negative_prompt is None: negative_prompt_embeds = torch.zeros(<shape>, device, dtype)

yiyixuxu · 2024-10-10T05:29:02Z

follow up on this comment #9570 (comment)

so I run one example to compare with using our encode_prompt vs using prompt embed directly - I think there is a difference and the original one is better


Using our encode_prompt to process prompt	Using encoded prompt embeds from cogview3 codebase	Output from cogview3 codebase

generate using prompt

import torch
device = torch.device("cuda:2") 
dtype = torch.bfloat16

from diffusers import CogView3PlusPipeline
pipe = CogView3PlusPipeline.from_pretrained("/raid/yiyi/cogview3_diffusers", torch_dtype=torch.bfloat16)
pipe.to(device)

latents = torch.load("/raid/yiyi/CogView3/sat/randn.pt").to(device).to(dtype)
prompt = "Portrait of a young woman with dark skin, bright violet eyes, and braided hair adorned with beads, standing in a mystical forest with glowing fireflies."

image = pipe(prompt, guidance_scale=5, latents=latents).images[0]
image.save("yiyi_test_10_out.png")

using encoded prompt embeds from original code base

import torch
device = torch.device("cuda:2") 
dtype = torch.bfloat16

from diffusers import CogView3PlusPipeline
pipe = CogView3PlusPipeline.from_pretrained("/raid/yiyi/cogview3_diffusers", torch_dtype=torch.bfloat16)
pipe.to(device)

latents = torch.load("/raid/yiyi/CogView3/sat/randn.pt").to(device).to(dtype)
prompt_embeds = torch.load("/raid/yiyi/CogView3/sat/cond.pt").to(device).to(dtype)    
negative_prompt_embeds = torch.load("/raid/yiyi/CogView3/sat/uc.pt").to(device).to(dtype)
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, guidance_scale=5, latents=latents).images[0]
image.save("yiyi_test_11_out.png")

a-r-r-o-w · 2024-10-10T23:11:50Z

Test model available here: https://huggingface.co/ZP2HF/CogView-3-Plus/. Once the PR is approved, can be moved to the THUDM org for release

…fusers into cogview3-plus

zRzRzRzRzRzRzR · 2024-10-11T17:04:40Z

'Follow up on this comment #9570 (comment)'
''
'So I ran a sample to compare the effect of using our encode_prompt versus using the prompt embed directly - I think the original one is better'
' '
'Using our encode_prompt to process the prompt Using the prompt embedding encoded from the cogview3 codebase Output from the cogview3 codebase'
''
'Use prompt to generate'
''
'```python'
'import torch'
device is set to torch.device("cuda:2")
data type is torch.bfloat16

import CogView3PlusPipeline from the diffusers module
pipe is initialized as CogView3PlusPipeline from the pretrained model at '/raid/yiyi/cogview3_diffusers' with torch_dtype set to torch.bfloat16
pipe.to(device)

latents = torch.load("/raid/yiyi/CogView3/sat/randn.pt").to(device).to(dtype)
prompt = "Portrait of a young woman with dark skin, bright violet eyes, and braided hair adorned with beads, standing in a mystical forest with glowing fireflies."

image = pipe(prompt, guidance_scale=5, latents=latents).images[0]
image.save("yiyi_test_10_out.png")

It seems there was a misunderstanding in your request. The provided YAML content contains code snippets and simple English phrases, which are already in English. There is no translation needed from another language into English, as all the text provided is in English. Therefore, the output is the same as the input, with the formatting adjusted to match the example provided.

Import CogView3PlusPipeline from the diffusers module
Create a CogView3PlusPipeline object from a pretrained model located at "/raid/yiyi/cogview3_diffusers" with the data type set to torch.bfloat16
Move the pipeline to the specified device

latents = torch.load("/raid/yiyi/CogView3/sat/randn.pt").to(device).to(dtype)
prompt_embeds = torch.load("/raid/yiyi/CogView3/sat/cond.pt").to(device).to(dtype)
negative_prompt_embeds = torch.load("/raid/yiyi/CogView3/sat/uc.pt").to(device).to(dtype)
image = pipe(prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, guidance_scale=5, latents=latents).images[0]
"image.save("yiyi_test_11_out.png")"

The provided text is already in English, so no translation is necessary. The output reflects the original input format with the text fields unchanged.
|
The content for this entry is empty and therefore nothing to translate.

|
Does this code have issues when running in FP16, I encountered a black screen issue; there were no errors, but it output a completely black image。

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.02it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:06<00:00,  1.37s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:20<00:00,  2.45it/s]
/share/home/zyx/Code/diffusers/src/diffusers/image_processor.py:111: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")

The logs like this

yiyixuxu

thanks!

yiyixuxu · 2024-10-13T08:44:28Z

src/diffusers/pipelines/cogview3/pipeline_cogview3plus.py

+        num_inference_steps: int = 50,
+        timesteps: Optional[List[int]] = None,
+        guidance_scale: float = 6,
+        use_dynamic_cfg: bool = False,


still a question, does this work? maybe @a-r-r-o-w you can test it out too

zRzRzRzRzRzRzR · 2024-10-14T04:03:35Z

use_dynamic_cfg is Not use in this pipeline with DDIM I believe.

zRzRzRzRzRzRzR · 2024-10-14T13:57:48Z

I believe this version is now ready; my colleagues and I have verified it, and it can run the model properly. Looking forward to the merge.

* merge 9588 * max_shard_size="5GB" for colab running * conversion script updates; modeling test; refactor transformer * make fix-copies * Update convert_cogview3_to_diffusers.py * initial pipeline draft * make style * fight bugs 🐛🪳 * add example * add tests; refactor * make style * make fix-copies * add co-author YiYi Xu <[email protected]> * remove files * add docs * add co-author Co-Authored-By: YiYi Xu <[email protected]> * fight docs * address reviews * make style * make model work * remove qkv fusion * remove qkv fusion tets * address review comments * fix make fix-copies error * remove None and TODO * for FP16(draft) * make style * remove dynamic cfg * remove pooled_projection_dim as a parameter * fix tests --------- Co-authored-by: Aryan <[email protected]> Co-authored-by: YiYi Xu <[email protected]>

merge 9588

45b6cb6

zRzRzRzRzRzRzR force-pushed the cogview3-plus branch from 86a59f9 to 45b6cb6 Compare October 5, 2024 05:05

zRzRzRzRzRzRzR and others added 3 commits October 5, 2024 15:23

max_shard_size="5GB" for colab running

8abfe00

conversion script updates; modeling test; refactor transformer

d668ad9

make fix-copies

53935c0

zRzRzRzRzRzRzR added 2 commits October 7, 2024 22:23

Merge branch 'huggingface:main' into cogview3-plus

8af6709

Update convert_cogview3_to_diffusers.py

56b5599

yiyixuxu reviewed Oct 7, 2024

View reviewed changes

src/diffusers/models/embeddings.py Outdated Show resolved Hide resolved

yiyixuxu reviewed Oct 7, 2024

View reviewed changes

a-r-r-o-w and others added 13 commits October 8, 2024 10:25

Merge branch 'main' into cogview3-plus

8d14495

initial pipeline draft

8e2ddd5

make style

0e33401

fight bugs 🐛🪳

3873c57

add example

e77b988

add tests; refactor

958d3e7

make style

ab8c65f

make fix-copies

059812a

add co-author

3138ad1

YiYi Xu <[email protected]>

remove files

86909dc

add docs

7da234b

add co-author

0114e3b

Co-Authored-By: YiYi Xu <[email protected]>

fight docs

4e8de65

a-r-r-o-w requested review from yiyixuxu, sayakpaul and stevhliu October 8, 2024 16:22

yiyixuxu reviewed Oct 9, 2024

View reviewed changes

src/diffusers/pipelines/cogview3/pipeline_cogview3plus.py Show resolved Hide resolved

yiyixuxu reviewed Oct 10, 2024

View reviewed changes

a-r-r-o-w added 2 commits October 11, 2024 01:10

address review comments

80e7cca

Merge branch 'main' into cogview3-plus

5cc2ade

a-r-r-o-w and others added 5 commits October 11, 2024 12:58

fix make fix-copies error

0e4577d

Merge branch 'main' into cogview3-plus

53ff253

Merge branch 'huggingface:main' into cogview3-plus

5c68cfd

remove None and TODO

9c3a81d

Merge branch 'cogview3-plus' of https://github.com/zRzRzRzRzRzRzR/dif…

6d1f1c9

…fusers into cogview3-plus

zRzRzRzRzRzRzR and others added 4 commits October 12, 2024 01:30

for FP16(draft)

6603901

Merge branch 'main' into cogview3-plus

fe33f0a

make style

db2a958

Merge branch 'huggingface:main' into cogview3-plus

52f97f6

yiyixuxu approved these changes Oct 13, 2024

View reviewed changes

a-r-r-o-w added 4 commits October 14, 2024 10:04

remove dynamic cfg

270d407

Merge branch 'main' into cogview3-plus

4ac4e52

remove pooled_projection_dim as a parameter

21dd890

fix tests

221b486

zRzRzRzRzRzRzR mentioned this pull request Oct 14, 2024

cannot import name 'CogView3PlusPipeline' from 'diffusers' THUDM/CogView4#10

Closed

2 tasks

a-r-r-o-w merged commit 8d81564 into huggingface:main Oct 14, 2024
15 checks passed

zRzRzRzRzRzRzR deleted the cogview3-plus branch January 14, 2025 06:47

		"""


		# Similar to diffusers.pipelines.hunyuandit.pipeline_hunyuandit.get_resize_crop_region_for_grid


		self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor)

		def _get_t5_prompt_embeds(

CogView3Plus DiT #9570

CogView3Plus DiT #9570

Uh oh!

Conversation

zRzRzRzRzRzRzR commented Oct 2, 2024

Uh oh!

zRzRzRzRzRzRzR commented Oct 4, 2024

Uh oh!

HuggingFaceDocBuilderDev commented Oct 6, 2024

Uh oh!

a-r-r-o-w commented Oct 6, 2024

Uh oh!

Uh oh!

yiyixuxu Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

sayakpaul Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Oct 7, 2024

Uh oh!

a-r-r-o-w commented Oct 8, 2024

Uh oh!

a-r-r-o-w commented Oct 9, 2024

Uh oh!

yiyixuxu Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Oct 9, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yiyixuxu Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiyixuxu commented Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-r-r-o-w commented Oct 10, 2024

Uh oh!

zRzRzRzRzRzRzR commented Oct 11, 2024

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Oct 13, 2024

Choose a reason for hiding this comment

Uh oh!

zRzRzRzRzRzRzR commented Oct 14, 2024

Uh oh!

zRzRzRzRzRzRzR commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

a-r-r-o-w Oct 10, 2024 •

edited

Loading

yiyixuxu commented Oct 10, 2024 •

edited

Loading

zRzRzRzRzRzRzR commented Oct 14, 2024 •

edited

Loading