Summary of The New Research Paper, Goku: Flow-based Joint Image and Video Generative Foundation Models
A family of state-of-the-art joint image-and-video generation models
Introduction
The field of video generation has seen quick advancements, driven by the development of sophisticated generative algorithms, scalable model architectures, the availability of vast amounts of internet-sourced data, and the ongoing expansion of computing capabilities. These advancements have led to the creation of models capable of generating high-quality videos with diverse applications in media content creation, advertising, video games, and world model simulators.
Overview of Goku, a family of state-of-the-art joint image-and-video generation models that leverage rectified flow Transformers to achieve industry-leading performance. We will explore the key components of Goku, including its data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training.
Data Curation Pipeline
Goku's success is attributed in part to its comprehensive data processing pipeline, designed to construct large-scale, high-quality image and video-text datasets. The pipeline integrates multiple advanced techniques, including video and image filtering based on aesthetic scores, OCR-driven content analysis, and subjective evaluations, to ensure exceptional visual and contextual quality.
The pipeline also employs multimodal large language models (MLLMs) to generate dense and contextually aligned captions, which are subsequently refined using an additional large language model (LLM) to enhance their accuracy, fluency, and descriptive richness. This process results in a robust training dataset comprising approximately 36M video-text pairs and 160M image-text pairs, sufficient for training industry-level generative models.
Model Architecture Design
Goku employs a 3D joint image-video variational autoencoder (VAE) to compress image and video inputs into a shared latent space, facilitating unified representation. This shared latent space is coupled with a full-attention mechanism, enabling seamless joint training of image and video. This architecture delivers high-quality, coherent outputs across both images and videos, establishing a unified framework for visual generation tasks.
The core of the Goku framework is a Transformer architecture, which effectively models complex temporal and spatial dependencies. The design of the Goku Transformer block builds upon GenTron, an extension of the class-conditioned diffusion transformer for text-to-image/video tasks. It includes a self-attention module for capturing inter-token correlations, a cross-attention layer to integrate textual conditional embeddings, a feed-forward network (FFN) for feature projection, and a layer-wise adaLN-Zero block that incorporates timestep information to guide feature transformations.
Flow Formulation
Goku takes a pioneering step by applying rectified flow formulation for joint image and video generation. Rectified flow is a generative modeling algorithm where a sample is progressively transformed from a prior distribution, such as a standard normal distribution, to the target data distribution. This transformation is achieved by defining the forward process as a series of linear interpolations between the prior and target distributions.
By establishing a direct, linear interpolation between data and noise, rectified flow simplifies the modeling process, providing improved theoretical properties, conceptual clarity, and faster convergence across data distributions. Goku's adoption of rectified flow demonstrates its rapid convergence in comparison to traditional diffusion-based models.
Training Infrastructure Optimization
To support the training of Goku at scale, the researchers have developed a robust infrastructure tailored for large-scale model training. This infrastructure incorporates advanced parallelism strategies to manage memory efficiently during long-context training. Additionally, it employs ByteCheckpoint for high-performance checkpointing and integrates fault-tolerant mechanisms from MegaScale to ensure stability and scalability across large GPU clusters. These optimizations enable Goku to handle the computational and data challenges of generative modeling with exceptional efficiency and reliability.
Evaluation and Results
Goku has been evaluated on both text-to-image and text-to-video benchmarks, demonstrating its competitive advantages. For text-to-image generation, Goku-T2I demonstrates strong performance across multiple benchmarks, including T2I-CompBench, GenEval, and DPG-Bench, excelling in both visual quality and text-image alignment.
In text-to-video benchmarks, Goku-T2V achieves state-of-the-art performance on the UCF-101 zero-shot generation task. Additionally, Goku-T2V attains an impressive score of 84.85 on VBench, securing the top position on the leaderboard and surpassing several leading commercial text-to-video models.
Additional Technical Details
The Goku model family comprises Transformer architectures with 2B and 8B parameters. The models are trained using a multi-stage training strategy that progressively enhances their capabilities, ensuring effective and robust learning across both image and video modalities.
The training process also incorporates a cascaded resolution strategy, starting with low-resolution image and video data and progressively increasing the resolution to refine the model's understanding of intricate details and improve overall image fidelity.
Goku can also be extended for image-to-video generation by employing a widely used strategy of using the first frame of each clip as the reference image. The corresponding image tokens are broadcasted and concatenated with the paired noised video tokens along the channel dimension.
Future Directions
The researchers plan to further explore the potential of Goku by investigating its capabilities in generating longer videos with more complex scenes and actions. They also aim to improve the model's efficiency and reduce its computational cost, making it more accessible to a wider range of users.
Overall Assessment
Goku's comprehensive approach to data curation, model architecture design, flow formulation, and training infrastructure optimization has resulted in a family of models that achieve industry-leading performance in joint image and video generation. The models' ability to generate high-quality, coherent outputs across both images and videos establishes a unified framework for visual generation tasks.


