MMaDA is a new class of multimodal diffusion foundation models introduced by Hugging Face, designed for tasks like text-to-image generation, multimodal understanding, and textual reasoning
The model features a unified diffusion architecture that is modality-agnostic, allowing seamless integration and processing of different data types without the need for modality-specific components
The model features a unified diffusion architecture that is modality-agnostic, allowing seamless integration and processing of different data types without the need for modality-specific components
MMaDA uses a mixed long chain-of-thought (CoT) fine-tuning strategy, aligning reasoning processes across text and visual modalities for improved problem-solving and cold-start training in RL
MMaDA uses a mixed long chain-of-thought (CoT) fine-tuning strategy, aligning reasoning processes across text and visual modalities for improved problem-solving and cold-start training in RL
The unified policy-gradient-based RL algorithm (UniGRPO) is introduced, providing consistent performance gains across reasoning and generation tasks through unified reward modeling
Experimental results show that MMaDA-8B outperforms LLaMA-3-7B and Qwen2-7B in textual reasoning, Show-o and SEED-X in multimodal understanding, and SDXL and Janus in text-to-image generation
Experimental results show that MMaDA-8B outperforms LLaMA-3-7B and Qwen2-7B in textual reasoning, Show-o and SEED-X in multimodal understanding, and SDXL and Janus in text-to-image generation
MMaDA bridges the gap between pretraining and post-training in unified diffusion systems, demonstrating strong generalization capabilities
MMaDA-8B-Base is open-source, pretrained on ImageNet, image-text datasets, and text instructions, and can generate text, images, captions, and thoughts