Blog | INCLUSION AI

Ming-flash-omni-Preview: A Sparse, Unified Architecture for Multimodal Perception and Generation

GITHUB 🤗 Hugging Face｜ 🤖 ModelScope Omnimodal Ming-omni series update! Ming-flash-omni-Preview is the first open-source omnimodal large model with a parameter scale reaching the hundred billion-Scale level. Based on Ling 2.0’s sparse MoE architecture, Ming-flash-omni-Preview has a total of 103B parameters with 9B activated. Compared to the previous version Ming-lite-omni-1.5, Ming-flash-omni-Preview has improved in both omnimodal understanding and generation capabilities. The overall performance across various modalities has reached a leading level among open-source omnimodal models, with particularly outstanding performance in controllable image generation, streaming video understanding, and speech recognition....

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

GITHUB 🤗 Hugging Face｜ 🤖 ModelScope The Introduction Video of Ming-UniAudio Audio Edit Demo Editing Tasks Video demos 🚀 Technical Highlights First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio is a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks....

Ming-UniVision: Joint Image Understanding and Generation via a Unified Continuous Tokenizer

GITHUB 🤗 Hugging Face｜ 🤖 ModelScope 🚀 Technical Highlights First Continuous Unified Tokenizer for Vision: MingTok seamlessly supports both image understanding and generation within a single continuous latent space—eliminating quantization and bridging modalities. First NTP-style Autoregressive MLLM with Unified Continuous Visual Tokens: By building on MingTok, Ming-UniVision unifies vision and language under a shared next-token prediction framework, enabling end-to-end autoregressive modeling of diverse vision tasks. Reduced Representational Competition → 3.5× Faster Convergence: The unified continuous representation aligns semantic understanding and generative dynamics, significantly accelerating joint training without performance trade-offs....

Segmentation-as-Editing for Unified Multimodal AI

GITHUB 🤗 Hugging Face｜ 🤖 ModelScope Ming-lite-omni 1.5: Segmentation-as-Editing for Unified Multimodal AI The Hype and the Hidden Question The multimodal AI world has been thriving. From the debut of Qwen-Image to the interactive editing hype sparked by Nano Banana, image editing has rapidly become the next battlefield for generative AI. Editing fundamentally requires two distinct skill sets: Know where, what, and how to change (understanding the image) Produce the change with high visual quality (generating the image) Its rich gameplay and strong interactivity have pulled in users, developers, and creators alike....

Introducing Ring-lite-2507

📖 Technical Report | 🤗 Hugging Face｜ 🤖 ModelScope Overview We present Ring-lite-2507, an upgraded version of our previously released lightweight reasoning model, Ring-lite (2506). Built upon a 16.8B Mixture-of-Experts (MoE) large language model with 2.75B activated parameters, Ring-lite-2507 further advances its reasoning capabilities while demonstrating superior performance across a comprehensive range of LLM benchmarks, including general text understanding, alignment, coding, logical, and agentic tasks. Thanks to our innovative and robust reinforcement learning training pipeline, Ring-lite-2507 distinguishes itself from the latest public dense models under 10B parameters by offering competitive performance across various tasks, despite activating only 1/3 of their parameter size....

Blog [简体中文]

Blog^{[简体中文
]}