CogView3 & CogView-3Plus

Read this in Chinese

📚 Check out the paper

👋 Join our WeChat

📍 Visit Qingyan and API Platform for larger-scale commercial video generation models.

Project Updates

🔥 2024/9/29: We have open-sourced CogView3 and CogView-3Plus-3B. CogView3 is a text-to-image system based on cascaded diffusion, using a relay diffusion framework. CogView-3Plus is a new series of text-to-image models based on Diffusion Transformers.

Model Introduction

CogView-3-Plus builds upon CogView3 (ECCV'24) by introducing the latest DiT framework for further overall performance improvements. CogView-3-Plus uses the Zero-SNR diffusion noise scheduling and incorporates a joint text-image attention mechanism. Compared to the commonly used MMDiT structure, it effectively reduces training and inference costs while maintaining the model's basic capabilities. CogView-3Plus utilizes a VAE with a latent dimension of 16.

The table below shows the list of text-to-image models we currently offer along with their basic information. At present, all models are only available in the SAT version, but we are participating in the development of the diffusers version.

Model Name	CogView3-Base-3B	CogView3-Base-3B-distill	CogView3-Plus-3B
Model Description	The base and relay stage models of CogView3, supporting 512x512 text-to-image generation and 2x super-resolution generation.	The distilled version of CogView3, with 4 and 1 step sampling in two stages (or 8 and 2 steps).	The DiT version image generation model, supporting image generation ranging from 512 to 2048.
Resolution	512 * 512		512 <= H, W <= 2048 H * W <= 2^{21} H, W \mod 32 = 0
Inference Precision	FP16 (recommended), BF16, FP32		*BF16 (recommended)**, FP16, FP32
Memory Usage (bs = 4)	17G	64G	30G (2048 * 2048) 20G (1024 * 1024)
Prompt Language	English*
Maximum Prompt Length	225 Tokens		224 Tokens
Download Link (SAT)	SAT

Data Explanation

All inference tests were conducted on a single A100 GPU with a batch size of 4, using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to save memory.
The models only support English input. Other languages can be translated into English when refining with large models.
This test environment uses the SAT framework. Many optimization points are not yet complete, and we will work with the community to create a version of the model for the diffusers library. Once the diffusers repository is supported, we will test using diffusers. The release is expected in November 2024.

Quick Start

Prompt Optimization

Although CogView3 series models are trained with long image descriptions, we highly recommend rewriting prompts using large language models (LLMs) before generating text-to-image, as this will significantly improve generation quality.

We provide an example script. We suggest running this script to refine the prompt:

python prompt_optimize.py --api_key "Zhipu AI API Key" --prompt {your prompt} --base_url "https://open.bigmodel.cn/api/paas/v4" --model "glm-4-plus"

Inference Model (SAT)

Please check the sat tutorial for step-by-step instructions on model inference.

Open Source Plan

Since the project is in its early stages, we are working on the following:

SAT version model fine-tuning, including SFT and Lora fine-tuning
Diffuser library version model reasoning, fine-tuning

CogView3 (ECCV'24)

Official paper repository: CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

CogView3 is a novel text-to-image generation system using relay diffusion. It breaks down the process of generating high-resolution images into multiple stages. Through the relay super-resolution process, Gaussian noise is added to low-resolution generation results, and the diffusion process begins from these noisy images. Our results show that CogView3 outperforms SDXL with a winning rate of 77.0%. Additionally, through progressive distillation of the diffusion model, CogView3 can generate comparable results while reducing inference time to only 1/10th of SDXL's.

Comparison results from human evaluations:

Citation

🌟 If you find our work helpful, feel free to cite our paper and leave a star.

@article{zheng2024cogview3,
  title={Cogview3: Finer and faster text-to-image generation via relay diffusion},
  author={Zheng, Wendi and Teng, Jiayan and Yang, Zhuoyi and Wang, Weihan and Chen, Jidong and Gu, Xiaotao and Dong, Yuxiao and Ding, Ming and Tang, Jie},
  journal={arXiv preprint arXiv:2403.05121},
  year={2024}
}

We welcome your contributions! Click here for more information.

Model License

This codebase is released under the Apache 2.0 License.

The CogView3-Base, CogView3-Relay, and CogView3-Plus models (including the UNet module, Transformers module, and VAE module) are released under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
resources		resources
sat		sat
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
prompt_optimize.py		prompt_optimize.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CogView3 & CogView-3Plus

Project Updates

Model Introduction