Ming-Omni: A Unified Multimodal Model for Perception and Generation
GITHUB 📑 Technical Report|📖Project Page |🤗 Hugging Face| 🤖 ModelScope Introduction Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers....