Skip to content

Summary of CogVideoX-5B-I2V-v1.5 inference and fine-tuning about vae_scaling_factor_image and vertical video #761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JunyaoHu opened this issue Apr 13, 2025 · 1 comment
Assignees

Comments

@JunyaoHu
Copy link

JunyaoHu commented Apr 13, 2025

贵模型研究人员:

您好!我在使用CogVideoX-5B-I2V-v1.5模型时遇到了一些问题,通过检索仓库内和相关仓库issue,有一些初步的解决方案,但总结之后,仍对如下内容有一些疑问,望得到解决。

  • SAT模型和diffusers模型存在差异问题

  • CogVideoX 1.5 diffusers LoRA Fine-tuning问题

    • 问题2:如果上述解决方法正确的情况下,如何进行lora微调训练和推理

      • 方案一

        • lora微调训练时,原本lora微调训练代码中,手动修改去掉image latent乘上 self.vae_scaling_factor_image 系数相关代码

        • lora微调推理时,用 1.0 * image_latents

      • 方案二

        • lora微调训练时,保持原本lora微调训练代码不动 image latent的 self.vae_scaling_factor_image 系数

        • lora微调推理时,保持原版pipeline_cogvideox_image2video.py 不动,用 1 / self.vae_scaling_factor_image * image_latents

    • 背景:微调的训练代码中,观察到所有lora微调的代码中都有image latent vae_scaling_factor相乘的的部分,也就是这里并没有忘记要乘系数,所以后面才需要除以这个系数,然后就等于微调的时候系数也是1.0了,(只是官方团队在预训练模型的时候没有乘系数?)

    • 参考代码

  • CogVideoX 1.5 I2V 垂直视频不能推理问题

    • 问题3:这种解决方案是否正确?

    • 其他相关issue

    • 表现:不能推理垂直视频(除了 width480 x height720 可以跑通,但是效果并不好),其他比例会出错,例如:

      • --width 768 --height 1360: 不能实现,报错同一内容,RuntimeError: Sizes of tensors must match except in dimension 3. Expected size 85 but got size 48 for tensor number 1 in the list.
      • --width 768 --height 1080: 不能实现,报错同一内容,RuntimeError: Sizes of tensors must match except in dimension 3. Expected size 67 but got size 48 for tensor number 1 in the list.
      • --width 768 --height 960: 不能实现,报错同一内容,RuntimeError: Sizes of tensors must match except in dimension 3. Expected size 60 but got size 48 for tensor number 1 in the list.
    • 原因:rope旋转编码嵌入的逻辑假设 sample_width 大于 sample_height ,分别设置为 170 和 96 。

      • 解决方案:需要修改vae模型中的配置,位置CogVideoX1.5-5B-I2V/transformer/config.json,如果需要生成垂直视频,设置

        {
          ...,
          "sample_height": 170,
          "sample_width": 96,
          ...,
        }
@OwalnutO
Copy link

For the first question, It seems that, for I2V model, the input image condition should not multiply the scale. Therefore, during training, the video latent should multiply the scale, but the image condition shouldn't. I'm not full sure and I will try it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@OwalnutO @JunyaoHu @zRzRzRzRzRzRzR and others