Introducing Ming-Lite-Omni V1.5

GITHUB 🤗 Hugging Face｜ 🤖 ModelScope

We are excited to introduce Ming-lite-omni V1.5, a comprehensive upgrade that significantly enhances the omni-modal capabilities of the original Ming-lite-omni model (find it on 🤗Hugging Face). This new version delivers remarkable improvements across a wide range of tasks, including image and text understanding, document analysis, video comprehension, speech understanding and synthesis, as well as image generation and editing.

Built on the Ling-lite-1.5 architecture, Ming-lite-omni V1.5 features a total of 20.3B parameters, with 3B active parameters in its Mixture of Experts (MoE) configuration. It has achieved impressive results across various multimodal benchmarks. Below, we highlight the key advancements in performance metrics and model architecture from this latest update.

Performance Comparison

Framework

Introduction

To achieve these advancements, we integrated our in-house innovations with cutting-edge developments from both academia and the open-source community. We explored several key areas, leading to significant breakthroughs and key findings detailed below.

Image and Speech Generation

Dual-Branch Decoupling for Image Generation: We implemented a dual-branch decoupling strategy that processes the reference image and the noise image with separate weights before they enter the Diffusion Transformer (DiT). This approach effectively increases the model’s capacity for learning from the reference. To further refine the output, we added two extra transformer layers.
ID & Scene Consistency Loss: To maintain character and scene identity during image editing, we introduced a new ID & Scene Consistency Loss. This function strategically increases focus on the edited region in the target image, boosts the influence of unedited areas from the reference image, and diminishes the influence of the edited areas from the reference.
Perceptual Enhancement Strategy: We introduced a Perceptual Enhancement Strategy by improving the model’s ability to perceive structures through tasks like segmentation and keypoint detection. This provides greater structural control during generation and editing, leading to significant score improvements on metrics related to object position, structure, and count (see Table A for details).
Multi-task Collaborative Learning: A Multi-task Collaborative Learning approach allows generation and editing tasks to synergistically improve one another. By treating segmentation as a colorization-based editing task within a unified training pipeline, we achieved significant gains in segmentation accuracy and enhanced the precision of local edits, resulting in smoother and more natural transitions at the edges of edited areas.
New Audio Decoder: On the speech generation front, we developed a novel audio decoder that is fully context-aware, directly processing output features from the LLM.
Efficient Speech Generation: To improve prosody and real-time generation capabilities, we applied BPE encoding to the discrete audio codec tokens, which successfully reduced the audio frame rate by 35%.
Comprehensive Data Upgrades:
- Sourced high-quality portrait data, filtered by resolution, facial detail, and face size.
- Collected and curated a high-quality celebrity dataset with processed face crops.
- Expanded the model’s capabilities by building new training subsets for edge maps, segmentation maps, text-in-image, and facial expressions.

Image, Text, Video, and Speech Understanding

MRoPE for Spatiotemporal Awareness: We integrated MRoPE (Multi-dimensional RoPE), which applies positional encoding across three dimensions (time, height, and width). This equips the model with a native understanding of space and time, enabling highly efficient cross-modal modeling and significantly boosting comprehension accuracy for videos and complex images.
Efficient Full-Parameter Training: By optimizing learning rates and the multimodal data mixture, we upgraded our pre-training from a phased approach of freezing/unfreezing LLM layers to a streamlined, efficient full-parameter training regimen. This shortened the training cycle by 26.5% with no loss in performance.
Video Understanding via Curriculum Learning: We adopted a curriculum learning strategy for video understanding, progressively increasing the complexity from short-form to long-form videos to enhance the model’s capabilities.
Complex Document Understanding with CoT: For complex document analysis, we incorporated a Chain-of-Thought (CoT) strategy to build structured reasoning paths, effectively improving the model’s problem-solving abilities on intricate queries.
Comprehensive Data Upgrades:
- Pre-training:
  - Added structured text entity data to fill knowledge graph gaps.
  - Expanded with high-quality product data to enhance general knowledge.
- Instruction Fine-Tuning:
  - Improved data precision for fine-grained visual perception (e.g., object counting, color, and scene recognition).
  - Increased data depth for specialized domain recognition (e.g., flora/fauna, vehicles, and ingredients).
  - Enhanced cross-disciplinary reasoning in complex visual-text tasks through targeted data optimization.
  - For speech understanding, we infused instruction text with metadata like domain, topic, and language (including dialects). This boosted model performance and enabled comprehensive support for Mandarin, English, Cantonese, Sichuanese, Shanghainese, and Minnan.

Aligning with User Preferences

To ensure that improvements on standard benchmarks translate directly to a superior real-world user experience, we created our own human evaluation benchmark and conducted internal, head-to-head evaluations against other models. Thanks to our high-quality preference alignment data, Ming-lite-omni V1.5 demonstrates a distinct advantage over leading models in the accuracy (low hallucination rate), relevance, formatting, and fluency of its visual question-answering responses. On our internal adversarial test set, Ming-lite-omni V1.5 achieved an 87.07% win rate against Ming-lite-omni V1, marking a significant enhancement in user experience.

Benchmark Dimension	Qwen2.5-VL-7B	Ming-Omni-Lite V1.5
Relevance	4.308	4.5
Fluency	4.765	4.91
Richness	3.828	3.69
Formatting	4.727	4.8
Accuracy	3.741	3.92
Average	4.274	4.365

Demos

Image Editing

To address the challenge of maintaining character and scene identity during image editing, we’ve introduced the ID & Scene Consistency Loss. This function strategically increases the weight of the edited region in the target image, boosts the influence of unedited areas from the reference image, and reduces the influence of the edited areas within that same reference.

Generative Image Segmentation Next

User

Given the following instructions: little girl, pink, your monitors colors off friend p pink shirt girl; please perform referring segmentation on this image.

Ming-Lite-Omni V1.5

User

Please segment different classes in this image.

Ming-Lite-Omni V1.5

User

Please segment different instances in this image.

Ming-Lite-Omni V1.5

Image Generation

We introduced a Perceptual Enhancement Strategy by optimizing the model’s structural awareness through tasks like segmentation and keypoint detection. This improves its understanding of image details and spatial relationships, giving us greater structural control over the generation and editing process. The result is a significant boost in scores on the GenEval benchmark, particularly for metrics related to position, structure, and count:

	1-Obj	2-Obj	Counting	Colors	Position	Color Attr	Avg.
Ming-lite-omni	0.99	0.77	0.68	0.78	0.46	0.42	0.64
Ming-lite-omni V1.5	0.99	0.93	0.86	0.87	0.90	0.66	0.87

Original Image	Generated Depth Map	Generated Bounding Boxes	Generated Edge Map

Visual Dialogue Next

User

This picture looks very design-oriented. Can you describe its various design elements in detail?

Ming-Lite-Omni

This image shows a high-end gaming mouse with a modern and technological design. The mouse body is black with a smooth metallic finish. There is a blue glowing logo on the top of the mouse, and below the logo is the mouse wheel and left and right button area. The side of the mouse has a blue light decoration, which increases the visual appeal. The bottom of the mouse has a non-slip texture design to ensure stability during use. The overall design is simple and futuristic, suitable for gamers.

Ming-Lite-Omni V1.5

This image showcases a gaming mouse with a striking design, featuring the following design elements:
1. Color and Material: - The main body of the mouse is black, with a smooth surface and a metallic finish. - The edges and button areas of the mouse have blue LED light strips, adding a high-tech feel and visual appeal.
2. Shape and Structure: - The mouse has an overall streamlined design, with a slightly sloped top that allows for natural hand placement. - The sides of the mouse feature pronounced curves, designed ergonomically to provide a comfortable grip.
3. Buttons and Scroll Wheel: - There is a scroll wheel on the left side of the mouse, with a blue LED light strip above it. - There are additional buttons on the right side of the mouse, which also have blue LED light strips, enhancing both functionality and aesthetics.
4. Brand Logo: - In the center of the mouse’s palm rest, there is an illuminated “G” logo, which is a signature design element of the brand.
5. Lighting Effects: - The blue LED light strips on the mouse stand out prominently against the dark background, enhancing its high-tech and futuristic feel.
6. Texture and Details: - The top of the mouse has a textured design, likely for an anti-slip effect. - The sides of the mouse feature fine, elongated textured patterns, which add a sense of visual depth.
Overall, the design of this mouse combines a modern, high-tech aesthetic with ergonomic design, making it both beautiful and practical, and well-suited for gamers.

User

Why is the calligraphy in the picture good?

Ming-Lite-Omni

This calligraphy work showcases superb calligraphic skills and profound cultural heritage. The brushwork is fluid and natural, the character structures are rigorous, and the layout is well-proportioned, reflecting the calligrapher’s deep skill and artistic cultivation.

Ming-Lite-Omni V1.5

This piece of calligraphy displays several notable characteristics that make it an excellent work of art:
1. Exquisite Brushwork: The calligraphy’s brushwork is fluid and natural, with rich variations in line thickness, demonstrating the calligrapher’s profound skill and proficient technique.
2. Rigorous Structure: The structure of each character is meticulously designed, with natural connections between strokes, maintaining a good balance and harmony between characters.
3. Composition and Layout: The overall composition of the piece is well-arranged, with even line and character spacing, creating a sense of order and neatness.
4. Ink Tone Variation: The variations in ink tone, from dark to light, are natural, adding layers and a three-dimensional feel to the work.
5. Elegant Content: The content of the calligraphy is a beautiful classical text, rich in philosophical and literary value, which enhances the cultural depth of the work.
6. Seal Embellishment: The red seals on the work not only add a decorative element but also serve to embellish and balance the composition.
Together, these features constitute the high artistic value of this calligraphy piece, making it a masterpiece worthy of appreciation and study.

User

Is the phone in the picture suitable to buy for parents?

Ming-Lite-Omni

This phone has a simple and stylish appearance, suitable for parents who prefer a minimalist style. The camera configuration is also quite advanced, making it suitable for users who enjoy taking photos.

Ming-Lite-Omni V1.5

This phone is suitable for buying for parents for the following reasons:
1. Appearance Design: The phone features a simple and stylish design. The white body looks clean and neat, aligning with modern aesthetics.
2. Camera Configuration: The rear camera has a unique and neat arrangement, which may provide a good photography experience.
3. Brand: The Honor (荣耀) brand has a good reputation in the smartphone market, offering stable and reliable products.
4. Features: Phones of this type are typically equipped with a large-capacity battery and long battery life, suitable for daily use.
5. Ease of Use: It may offer a simple user interface and easy-to-operate functions, suitable for users who are not familiar with complex operations.
6. Value for Money: Honor phones are generally known for their high cost-performance ratio, making them suitable for users on a limited budget who still want a good experience.
Of course, whether it is truly suitable depends on the actual needs and preferences of the parents.

User

How to buy a plane ticket using the software in the picture? Please list the specific steps

Ming-Lite-Omni

The specific steps to buy a plane ticket using the software in the picture are as follows:
1. Open the Ctrip app. 2. Click on the “Flights” option. 3. On the flight page, select the departure city, destination, departure date, and return date. 4. Click the search button to view available flight information. 5. After selecting a suitable flight, click the “Book” button. 6. Fill in passenger information and choose a payment method. 7. Confirm the order details and submit the payment to complete the purchase.

Ming-Lite-Omni V1.5

The specific steps to purchase a plane ticket using the Ctrip app are as follows:
1. Download and Install the Ctrip App: - If you haven’t installed the Ctrip App, you can search for “Ctrip Travel” (携程旅行) in the app store (like the Apple App Store or Android app markets) and download it.
2. Open the Ctrip App: - After installation is complete, open the Ctrip app.
3. Select the “Flights” Option: - On the homepage navigation bar, tap the “Flights” option.
4. Enter Departure and Destination Cities: - On the flight booking page, enter your departure and destination cities.
5. Select Departure and Return Dates: - Choose your departure date and return date (if booking a round trip).
6. Select Passenger Information: - Enter the number of passengers and their types (adult, child, infant).
7. Choose a Flight: - Browse and select your preferred flight.
8. Fill in Passenger Information: - Enter the passenger’s name, ID number, contact information, etc.
9. Select a Payment Method: - Choose your preferred payment method (e.g., Alipay, WeChat Pay, bank card, etc.).
10. Confirm Order and Pay: - After confirming that the order information is correct, tap the “Pay” button to complete the payment.
11. Receive Your E-ticket: - After successful payment, you will receive an e-ticket, which you can view or download from the “My Orders” section within the app.
By following these steps, you can easily purchase plane tickets using the Ctrip App.

User

Based on the chart, under which primer is the Δct difference between samples the smallest? | Based on the chart, the Δct difference between samples is smallest under the cGAS-qF+qR primer. (Image source: Publicly available online - 舒桐科技公众号)

Ming-Lite-Omni

Based on the chart, the Δct difference between samples is smallest under the cGAS-qF+qR primer.

Ming-Lite-Omni V1.5

The difference in Δct between samples is smallest under the EGFP-cGAS monoclonal primer.

Calculation Process:
1. Understand the meaning of Δct: Δct is the difference in relative expression between the target gene (cGAS-qF+qR or EGFP-cGAS) and the reference gene (cTubulin). A smaller Δct indicates a smaller difference in expression between the target and reference genes.
2. Compare Δct under different primers:
- cGAS-qF+qR primer: - Wild type: Δct = 8.06 - EGFP-cGAS monoclonal: Δct = -0.15 - Difference: 8.06 - (-0.15) = 8.21
- EGFP-cGAS monoclonal primer: - Wild type: Δct = 10.44 - EGFP-cGAS monoclonal: Δct = -2.9 - Difference: 10.44 - (-2.9) = 13.34
- EGFP-cGAS-qF+qR primer: - Wild type: Δct = 21.43 - EGFP-cGAS monoclonal: Δct = -13.45 - Difference: 21.43 - (-13.45) = 34.88
3. Conclusion: Under the EGFP-cGAS monoclonal primer, the difference in Δct between samples is the smallest (Δct difference is 13.34), indicating the smallest difference in expression between samples under this primer.

Summary: The difference in Δct between samples is smallest under the EGFP-cGAS monoclonal primer.

Speech Generation

Text Inputs	Ming-lite-omni	Ming-lite-omni V1.5
Tyler, Lucy, Michelle, we’re going to space!
what is the forecast for California for rain
Television can make you dumb, but it can also be good education.

OCR Next

User

In true/false question 95, what is the candidate’s handwritten answer? If the handwritten answer is √, output True. If the handwritten answer is ×, output False. Please output True or False

Ming-Lite-Omni

True

Ming-Lite-Omni V1.5

False

User

Output the content of the CAPTCHA directly. Characters are only numbers and uppercase/lowercase letters.

Ming-Lite-Omni

9743

Ming-Lite-Omni V1.5

9742

User

What is the Latex tag for mathematical expression in images?

Ming-Lite-Omni

A C = C B \therefore C D / / B E

Ming-Lite-Omni V1.5

A C = C B \because C D / / B E

User

从图中提取: 日期

Ming-Lite-Omni

日期：2019年12月03日05时08分从图中提取: 里程里程：23.9公里从图中提取: 金额金额：52.6元从图中提取: 状态状态：从图中提取: 客票印务客票印务2019年12月印900000卷从图中提取: 号码号码从00000001至90000000止

Ming-Lite-Omni V1.5

日期: 2021年03月05日

User

What is the difference in value between Lamb and Corn?’ You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within [think] [/think] tags. The final answer MUST BE put in \boxed{}.'

Ming-Lite-Omni

0.37

Ming-Lite-Omni V1.5

[think]
The value for Lamb is 103.7 and for Corn is 103.13. To find the difference, subtract the smaller value from the larger value: 103.7 - 103.13 = 0.57.
[/think]
\boxed{0.57}

Video Understanding

Users	What is the little girl doing in the video?
Ming-Lite-Omni	The scene in the video takes place in a store, possibly a supermarket or large retail store, with shelves filled with various products including snacks and household items. The little girl in the video is pulling a display rack, apparently trying to move or play with it. Price tags are clearly visible, with one prominently displaying '18,690' in blue and white. The store is well-lit with fluorescent lights illuminating the entire area. If you need more detailed analysis or other assistance, please let me know!
Ming-Lite-Omni V1.5	The little girl is sleeping in a shopping cart.
Users	Describe the protagonist's clothing in the video.
Ming-Lite-Omni	In the video, the protagonist is wearing a red top and carrying a bag with colorful patterns on their shoulder.
Ming-Lite-Omni V1.5	The protagonist in the video is wearing a red turtleneck sweater.
Users	Guess the time and location where this video was filmed.
Ming-Lite-Omni	Judging from the scenes in the video, this was likely filmed in New York's Central Park. The video shows iconic buildings and landscapes of Central Park, such as its lakes, bridges, and statues. Additionally, there are tourists walking, taking photos, and enjoying outdoor activities in the park, which further confirms that the filming location is New York's Central Park.
Ming-Lite-Omni V1.5	From the scenes in the video, the filming location appears to be New York's Central Park. The video showcases multiple attractions within the park, including lakes, bridges, statues, and fountains. The filming time seems to be during the daytime, with bright sunshine and clear weather, making it suitable for outdoor activities.

Get Started with Ming-lite-omni v1.5

The model and code for Ming-lite-omni v1.5 are now open-source, and we invite everyone to try it out, share feedback, and join the discussion. Looking ahead, we’re excited to announce that a quantized and accelerated version is on the way. This future release will not only further enhance omni-modal performance but also make the model even more lightweight, all while strengthening its multimodal reasoning and generation capabilities. Stay tuned for more updates!

Github: https://github.com/inclusionAI/Ming
Hugging Face: https://huggingface.co/inclusionAI/Ming-Lite-Omni-1.5
ModelScope: https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni-1.5

Introducing Ming-Lite-Omni V1.5

Introduction#

Image and Speech Generation#

Image, Text, Video, and Speech Understanding#

Aligning with User Preferences#

Demos#

Image Editing#

Image Generation#

Speech Generation#

Video Understanding#

Get Started with Ming-lite-omni v1.5#

Introduction

Image and Speech Generation

Image, Text, Video, and Speech Understanding

Aligning with User Preferences

Demos

Image Editing

Image Generation

Speech Generation

Video Understanding

Get Started with Ming-lite-omni v1.5