GITHUB 🤗 Hugging Face| 🤖 ModelScope
We are excited to introduce Ming-lite-omni V1.5, a comprehensive upgrade that significantly enhances the omni-modal capabilities of the original Ming-lite-omni model (find it on 🤗Hugging Face). This new version delivers remarkable improvements across a wide range of tasks, including image and text understanding, document analysis, video comprehension, speech understanding and synthesis, as well as image generation and editing.
Built on the Ling-lite-1.5 architecture, Ming-lite-omni V1.5 features a total of 20.3B parameters, with 3B active parameters in its Mixture of Experts (MoE) configuration. It has achieved impressive results across various multimodal benchmarks. Below, we highlight the key advancements in performance metrics and model architecture from this latest update.

Performance Comparison

Framework
Introduction
To achieve these advancements, we integrated our in-house innovations with cutting-edge developments from both academia and the open-source community. We explored several key areas, leading to significant breakthroughs and key findings detailed below.
Image and Speech Generation
- Dual-Branch Decoupling for Image Generation: We implemented a dual-branch decoupling strategy that processes the reference image and the noise image with separate weights before they enter the Diffusion Transformer (DiT). This approach effectively increases the model’s capacity for learning from the reference. To further refine the output, we added two extra transformer layers.
- ID & Scene Consistency Loss: To maintain character and scene identity during image editing, we introduced a new ID & Scene Consistency Loss. This function strategically increases focus on the edited region in the target image, boosts the influence of unedited areas from the reference image, and diminishes the influence of the edited areas from the reference.
- Perceptual Enhancement Strategy: We introduced a Perceptual Enhancement Strategy by improving the model’s ability to perceive structures through tasks like segmentation and keypoint detection. This provides greater structural control during generation and editing, leading to significant score improvements on metrics related to object position, structure, and count (see Table A for details).
- Multi-task Collaborative Learning: A Multi-task Collaborative Learning approach allows generation and editing tasks to synergistically improve one another. By treating segmentation as a colorization-based editing task within a unified training pipeline, we achieved significant gains in segmentation accuracy and enhanced the precision of local edits, resulting in smoother and more natural transitions at the edges of edited areas.
- New Audio Decoder: On the speech generation front, we developed a novel audio decoder that is fully context-aware, directly processing output features from the LLM.
- Efficient Speech Generation: To improve prosody and real-time generation capabilities, we applied BPE encoding to the discrete audio codec tokens, which successfully reduced the audio frame rate by 35%.
- Comprehensive Data Upgrades:
- Sourced high-quality portrait data, filtered by resolution, facial detail, and face size.
- Collected and curated a high-quality celebrity dataset with processed face crops.
- Expanded the model’s capabilities by building new training subsets for edge maps, segmentation maps, text-in-image, and facial expressions.
Image, Text, Video, and Speech Understanding
- MRoPE for Spatiotemporal Awareness: We integrated MRoPE (Multi-dimensional RoPE), which applies positional encoding across three dimensions (time, height, and width). This equips the model with a native understanding of space and time, enabling highly efficient cross-modal modeling and significantly boosting comprehension accuracy for videos and complex images.
- Efficient Full-Parameter Training: By optimizing learning rates and the multimodal data mixture, we upgraded our pre-training from a phased approach of freezing/unfreezing LLM layers to a streamlined, efficient full-parameter training regimen. This shortened the training cycle by 26.5% with no loss in performance.
- Video Understanding via Curriculum Learning: We adopted a curriculum learning strategy for video understanding, progressively increasing the complexity from short-form to long-form videos to enhance the model’s capabilities.
- Complex Document Understanding with CoT: For complex document analysis, we incorporated a Chain-of-Thought (CoT) strategy to build structured reasoning paths, effectively improving the model’s problem-solving abilities on intricate queries.
- Comprehensive Data Upgrades:
- Pre-training:
- Added structured text entity data to fill knowledge graph gaps.
- Expanded with high-quality product data to enhance general knowledge.
- Instruction Fine-Tuning:
- Improved data precision for fine-grained visual perception (e.g., object counting, color, and scene recognition).
- Increased data depth for specialized domain recognition (e.g., flora/fauna, vehicles, and ingredients).
- Enhanced cross-disciplinary reasoning in complex visual-text tasks through targeted data optimization.
- For speech understanding, we infused instruction text with metadata like domain, topic, and language (including dialects). This boosted model performance and enabled comprehensive support for Mandarin, English, Cantonese, Sichuanese, Shanghainese, and Minnan.
- Pre-training:
Aligning with User Preferences
To ensure that improvements on standard benchmarks translate directly to a superior real-world user experience, we created our own human evaluation benchmark and conducted internal, head-to-head evaluations against other models. Thanks to our high-quality preference alignment data, Ming-lite-omni V1.5 demonstrates a distinct advantage over leading models in the accuracy (low hallucination rate), relevance, formatting, and fluency of its visual question-answering responses. On our internal adversarial test set, Ming-lite-omni V1.5 achieved an 87.07% win rate against Ming-lite-omni V1, marking a significant enhancement in user experience.
Benchmark Dimension | Qwen2.5-VL-7B | Ming-Omni-Lite V1.5 |
---|---|---|
Relevance | 4.308 | 4.5 |
Fluency | 4.765 | 4.91 |
Richness | 3.828 | 3.69 |
Formatting | 4.727 | 4.8 |
Accuracy | 3.741 | 3.92 |
Average | 4.274 | 4.365 |
Demos
Image Editing
To address the challenge of maintaining character and scene identity during image editing, we’ve introduced the ID & Scene Consistency Loss. This function strategically increases the weight of the edited region in the target image, boosts the influence of unedited areas from the reference image, and reduces the influence of the edited areas within that same reference.





Image Generation
We introduced a Perceptual Enhancement Strategy by optimizing the model’s structural awareness through tasks like segmentation and keypoint detection. This improves its understanding of image details and spatial relationships, giving us greater structural control over the generation and editing process. The result is a significant boost in scores on the GenEval benchmark, particularly for metrics related to position, structure, and count:
1-Obj | 2-Obj | Counting | Colors | Position | Color Attr | Avg. | |
---|---|---|---|---|---|---|---|
Ming-lite-omni | 0.99 | 0.77 | 0.68 | 0.78 | 0.46 | 0.42 | 0.64 |
Ming-lite-omni V1.5 | 0.99 | 0.93 | 0.86 | 0.87 | 0.90 | 0.66 | 0.87 |
Original Image | Generated Depth Map | Generated Bounding Boxes | Generated Edge Map |
---|---|---|---|
![]() | ![]() | ![]() | ![]() |

1. Color and Material: - The main body of the mouse is black, with a smooth surface and a metallic finish. - The edges and button areas of the mouse have blue LED light strips, adding a high-tech feel and visual appeal.
2. Shape and Structure: - The mouse has an overall streamlined design, with a slightly sloped top that allows for natural hand placement. - The sides of the mouse feature pronounced curves, designed ergonomically to provide a comfortable grip.
3. Buttons and Scroll Wheel: - There is a scroll wheel on the left side of the mouse, with a blue LED light strip above it. - There are additional buttons on the right side of the mouse, which also have blue LED light strips, enhancing both functionality and aesthetics.
4. Brand Logo: - In the center of the mouse’s palm rest, there is an illuminated “G” logo, which is a signature design element of the brand.
5. Lighting Effects: - The blue LED light strips on the mouse stand out prominently against the dark background, enhancing its high-tech and futuristic feel.
6. Texture and Details: - The top of the mouse has a textured design, likely for an anti-slip effect. - The sides of the mouse feature fine, elongated textured patterns, which add a sense of visual depth.
Overall, the design of this mouse combines a modern, high-tech aesthetic with ergonomic design, making it both beautiful and practical, and well-suited for gamers.

1. Exquisite Brushwork: The calligraphy’s brushwork is fluid and natural, with rich variations in line thickness, demonstrating the calligrapher’s profound skill and proficient technique.
2. Rigorous Structure: The structure of each character is meticulously designed, with natural connections between strokes, maintaining a good balance and harmony between characters.
3. Composition and Layout: The overall composition of the piece is well-arranged, with even line and character spacing, creating a sense of order and neatness.
4. Ink Tone Variation: The variations in ink tone, from dark to light, are natural, adding layers and a three-dimensional feel to the work.
5. Elegant Content: The content of the calligraphy is a beautiful classical text, rich in philosophical and literary value, which enhances the cultural depth of the work.
6. Seal Embellishment: The red seals on the work not only add a decorative element but also serve to embellish and balance the composition.
Together, these features constitute the high artistic value of this calligraphy piece, making it a masterpiece worthy of appreciation and study.

1. Appearance Design: The phone features a simple and stylish design. The white body looks clean and neat, aligning with modern aesthetics.
2. Camera Configuration: The rear camera has a unique and neat arrangement, which may provide a good photography experience.
3. Brand: The Honor (荣耀) brand has a good reputation in the smartphone market, offering stable and reliable products.
4. Features: Phones of this type are typically equipped with a large-capacity battery and long battery life, suitable for daily use.
5. Ease of Use: It may offer a simple user interface and easy-to-operate functions, suitable for users who are not familiar with complex operations.
6. Value for Money: Honor phones are generally known for their high cost-performance ratio, making them suitable for users on a limited budget who still want a good experience.
Of course, whether it is truly suitable depends on the actual needs and preferences of the parents.

1. Open the Ctrip app. 2. Click on the “Flights” option. 3. On the flight page, select the departure city, destination, departure date, and return date. 4. Click the search button to view available flight information. 5. After selecting a suitable flight, click the “Book” button. 6. Fill in passenger information and choose a payment method. 7. Confirm the order details and submit the payment to complete the purchase.
1. Download and Install the Ctrip App: - If you haven’t installed the Ctrip App, you can search for “Ctrip Travel” (携程旅行) in the app store (like the Apple App Store or Android app markets) and download it.
2. Open the Ctrip App: - After installation is complete, open the Ctrip app.
3. Select the “Flights” Option: - On the homepage navigation bar, tap the “Flights” option.
4. Enter Departure and Destination Cities: - On the flight booking page, enter your departure and destination cities.
5. Select Departure and Return Dates: - Choose your departure date and return date (if booking a round trip).
6. Select Passenger Information: - Enter the number of passengers and their types (adult, child, infant).
7. Choose a Flight: - Browse and select your preferred flight.
8. Fill in Passenger Information: - Enter the passenger’s name, ID number, contact information, etc.
9. Select a Payment Method: - Choose your preferred payment method (e.g., Alipay, WeChat Pay, bank card, etc.).
10. Confirm Order and Pay: - After confirming that the order information is correct, tap the “Pay” button to complete the payment.
11. Receive Your E-ticket: - After successful payment, you will receive an e-ticket, which you can view or download from the “My Orders” section within the app.
By following these steps, you can easily purchase plane tickets using the Ctrip App.

Calculation Process:
1. Understand the meaning of Δct: Δct is the difference in relative expression between the target gene (cGAS-qF+qR or EGFP-cGAS) and the reference gene (cTubulin). A smaller Δct indicates a smaller difference in expression between the target and reference genes.
2. Compare Δct under different primers:
- cGAS-qF+qR primer: - Wild type: Δct = 8.06 - EGFP-cGAS monoclonal: Δct = -0.15 - Difference: 8.06 - (-0.15) = 8.21
- EGFP-cGAS monoclonal primer: - Wild type: Δct = 10.44 - EGFP-cGAS monoclonal: Δct = -2.9 - Difference: 10.44 - (-2.9) = 13.34
- EGFP-cGAS-qF+qR primer: - Wild type: Δct = 21.43 - EGFP-cGAS monoclonal: Δct = -13.45 - Difference: 21.43 - (-13.45) = 34.88
3. Conclusion: Under the EGFP-cGAS monoclonal primer, the difference in Δct between samples is the smallest (Δct difference is 13.34), indicating the smallest difference in expression between samples under this primer.
Summary: The difference in Δct between samples is smallest under the EGFP-cGAS monoclonal primer.
Speech Generation
Text Inputs | Ming-lite-omni | Ming-lite-omni V1.5 |
---|---|---|
Tyler, Lucy, Michelle, we’re going to space! | ||
what is the forecast for California for rain | ||
Television can make you dumb, but it can also be good education. |





The value for Lamb is 103.7 and for Corn is 103.13. To find the difference, subtract the smaller value from the larger value: 103.7 - 103.13 = 0.57.
[/think]
\boxed{0.57}
Video Understanding
Users | What is the little girl doing in the video? |
Ming-Lite-Omni | The scene in the video takes place in a store, possibly a supermarket or large retail store, with shelves filled with various products including snacks and household items. The little girl in the video is pulling a display rack, apparently trying to move or play with it. Price tags are clearly visible, with one prominently displaying '18,690' in blue and white. The store is well-lit with fluorescent lights illuminating the entire area. If you need more detailed analysis or other assistance, please let me know! |
Ming-Lite-Omni V1.5 | The little girl is sleeping in a shopping cart. |
Users | Describe the protagonist's clothing in the video. |
Ming-Lite-Omni | In the video, the protagonist is wearing a red top and carrying a bag with colorful patterns on their shoulder. |
Ming-Lite-Omni V1.5 | The protagonist in the video is wearing a red turtleneck sweater. |
Users | Guess the time and location where this video was filmed. |
Ming-Lite-Omni | Judging from the scenes in the video, this was likely filmed in New York's Central Park. The video shows iconic buildings and landscapes of Central Park, such as its lakes, bridges, and statues. Additionally, there are tourists walking, taking photos, and enjoying outdoor activities in the park, which further confirms that the filming location is New York's Central Park. |
Ming-Lite-Omni V1.5 | From the scenes in the video, the filming location appears to be New York's Central Park. The video showcases multiple attractions within the park, including lakes, bridges, statues, and fountains. The filming time seems to be during the daytime, with bright sunshine and clear weather, making it suitable for outdoor activities. |
Get Started with Ming-lite-omni v1.5
The model and code for Ming-lite-omni v1.5 are now open-source, and we invite everyone to try it out, share feedback, and join the discussion. Looking ahead, we’re excited to announce that a quantized and accelerated version is on the way. This future release will not only further enhance omni-modal performance but also make the model even more lightweight, all while strengthening its multimodal reasoning and generation capabilities. Stay tuned for more updates!