GITHUB 🤗 Hugging Face| 🤖 ModelScope

Overview

Ming-lite-omni v1.5 is a comprehensive upgrade to the full-modal capabilities of Ming-lite-omni(Github). It significantly improves performance across tasks including image-text understanding, document understanding, video understanding, speech understanding and synthesis, and image generation and editing. Built upon Ling-lite-1.5, Ming-lite-omni v1.5 has a total of 20.3 billion parameters, with 3 billion active parameters in its MoE (Mixture-of-Experts) section. It demonstrates highly competitive results in various modal benchmarks compared to industry-leading models.

Image description

Performance Comparison

Introduce Ming-lite-omni v1.5

Controllable Image Generation: Pixel-Level Control, Infinite Creativity

Ming-lite-omni v1.5 significantly optimizes Scene Consistency and ID Consistency (Character / Style Consistency) in image editing. When editing human figures, it demonstrates a clear advantage in maintaining scene and character ID. Furthermore, it expands support for perceptual tasks such as generative segmentation, depth prediction, object detection, and edge contour generation.

Image description
Generative Image Segmentation Next
User
Given the following instructions: little girl, pink, your monitors colors off friend p pink shirt girl; please perform referring segmentation on this image.
Ming-lite-omni v1.5
User
Please segment different classes in this image.
Ming-lite-omni v1.5
User
Please segment different instances in this image.
Ming-lite-omni v1.5

** Depth and Edge Detection**

Original ImageGenerated Depth MapGenerated Bounding BoxesGenerated Edge Contours

Audio-Video Interactive Understanding

Experience Alignment Optimization: Smarter Interaction That Understands You Better

Visual Dialogue Next
User
This picture looks very design-oriented. Can you describe its various design elements in detail?
Ming-Lite-Omni
This image shows a high-end gaming mouse with a modern and technological design. The mouse body is black with a smooth metallic finish. There is a blue glowing logo on the top of the mouse, and below the logo is the mouse wheel and left and right button area. The side of the mouse has a blue light decoration, which increases the visual appeal. The bottom of the mouse has a non-slip texture design to ensure stability during use. The overall design is simple and futuristic, suitable for gamers.
Ming-Lite-Omni V1.5
This image showcases a gaming mouse with a striking design, featuring the following design elements:
1. Color and Material: - The main body of the mouse is black, with a smooth surface and a metallic finish. - The edges and button areas of the mouse have blue LED light strips, adding a high-tech feel and visual appeal.
2. Shape and Structure: - The mouse has an overall streamlined design, with a slightly sloped top that allows for natural hand placement. - The sides of the mouse feature pronounced curves, designed ergonomically to provide a comfortable grip.
3. Buttons and Scroll Wheel: - There is a scroll wheel on the left side of the mouse, with a blue LED light strip above it. - There are additional buttons on the right side of the mouse, which also have blue LED light strips, enhancing both functionality and aesthetics.
4. Brand Logo: - In the center of the mouse’s palm rest, there is an illuminated “G” logo, which is a signature design element of the brand.
5. Lighting Effects: - The blue LED light strips on the mouse stand out prominently against the dark background, enhancing its high-tech and futuristic feel.
6. Texture and Details: - The top of the mouse has a textured design, likely for an anti-slip effect. - The sides of the mouse feature fine, elongated textured patterns, which add a sense of visual depth.
Overall, the design of this mouse combines a modern, high-tech aesthetic with ergonomic design, making it both beautiful and practical, and well-suited for gamers.
User
Why is the calligraphy in the picture good?
Ming-Lite-Omni
This calligraphy work showcases superb calligraphic skills and profound cultural heritage. The brushwork is fluid and natural, the character structures are rigorous, and the layout is well-proportioned, reflecting the calligrapher’s deep skill and artistic cultivation.
Ming-Lite-Omni V1.5
This piece of calligraphy displays several notable characteristics that make it an excellent work of art:
1. Exquisite Brushwork: The calligraphy’s brushwork is fluid and natural, with rich variations in line thickness, demonstrating the calligrapher’s profound skill and proficient technique.
2. Rigorous Structure: The structure of each character is meticulously designed, with natural connections between strokes, maintaining a good balance and harmony between characters.
3. Composition and Layout: The overall composition of the piece is well-arranged, with even line and character spacing, creating a sense of order and neatness.
4. Ink Tone Variation: The variations in ink tone, from dark to light, are natural, adding layers and a three-dimensional feel to the work.
5. Elegant Content: The content of the calligraphy is a beautiful classical text, rich in philosophical and literary value, which enhances the cultural depth of the work.
6. Seal Embellishment: The red seals on the work not only add a decorative element but also serve to embellish and balance the composition.
Together, these features constitute the high artistic value of this calligraphy piece, making it a masterpiece worthy of appreciation and study.
User
Is the phone in the picture suitable to buy for parents?
Ming-Lite-Omni
This phone has a simple and stylish appearance, suitable for parents who prefer a minimalist style. The camera configuration is also quite advanced, making it suitable for users who enjoy taking photos.
Ming-Lite-Omni V1.5
This phone is suitable for buying for parents for the following reasons:
1. Appearance Design: The phone features a simple and stylish design. The white body looks clean and neat, aligning with modern aesthetics.
2. Camera Configuration: The rear camera has a unique and neat arrangement, which may provide a good photography experience.
3. Brand: The Honor (荣耀) brand has a good reputation in the smartphone market, offering stable and reliable products.
4. Features: Phones of this type are typically equipped with a large-capacity battery and long battery life, suitable for daily use.
5. Ease of Use: It may offer a simple user interface and easy-to-operate functions, suitable for users who are not familiar with complex operations.
6. Value for Money: Honor phones are generally known for their high cost-performance ratio, making them suitable for users on a limited budget who still want a good experience.
Of course, whether it is truly suitable depends on the actual needs and preferences of the parents.
User
How to buy a plane ticket using the software in the picture? Please list the specific steps
Ming-Lite-Omni
The specific steps to buy a plane ticket using the software in the picture are as follows:
1. Open the Ctrip app. 2. Click on the “Flights” option. 3. On the flight page, select the departure city, destination, departure date, and return date. 4. Click the search button to view available flight information. 5. After selecting a suitable flight, click the “Book” button. 6. Fill in passenger information and choose a payment method. 7. Confirm the order details and submit the payment to complete the purchase.
Ming-Lite-Omni V1.5
The specific steps to purchase a plane ticket using the Ctrip app are as follows:
1. Download and Install the Ctrip App: - If you haven’t installed the Ctrip App, you can search for “Ctrip Travel” (携程旅行) in the app store (like the Apple App Store or Android app markets) and download it.
2. Open the Ctrip App: - After installation is complete, open the Ctrip app.
3. Select the “Flights” Option: - On the homepage navigation bar, tap the “Flights” option.
4. Enter Departure and Destination Cities: - On the flight booking page, enter your departure and destination cities.
5. Select Departure and Return Dates: - Choose your departure date and return date (if booking a round trip).
6. Select Passenger Information: - Enter the number of passengers and their types (adult, child, infant).
7. Choose a Flight: - Browse and select your preferred flight.
8. Fill in Passenger Information: - Enter the passenger’s name, ID number, contact information, etc.
9. Select a Payment Method: - Choose your preferred payment method (e.g., Alipay, WeChat Pay, bank card, etc.).
10. Confirm Order and Pay: - After confirming that the order information is correct, tap the “Pay” button to complete the payment.
11. Receive Your E-ticket: - After successful payment, you will receive an e-ticket, which you can view or download from the “My Orders” section within the app.
By following these steps, you can easily purchase plane tickets using the Ctrip App.
User
Based on the chart, under which primer is the Δct difference between samples the smallest? | Based on the chart, the Δct difference between samples is smallest under the cGAS-qF+qR primer. (Image source: Publicly available online - 舒桐科技公众号)
Ming-Lite-Omni
Based on the chart, the Δct difference between samples is smallest under the cGAS-qF+qR primer.
Ming-Lite-Omni V1.5
The difference in Δct between samples is smallest under the EGFP-cGAS monoclonal primer.

Calculation Process:
1. Understand the meaning of Δct: Δct is the difference in relative expression between the target gene (cGAS-qF+qR or EGFP-cGAS) and the reference gene (cTubulin). A smaller Δct indicates a smaller difference in expression between the target and reference genes.
2. Compare Δct under different primers:
- cGAS-qF+qR primer: - Wild type: Δct = 8.06 - EGFP-cGAS monoclonal: Δct = -0.15 - Difference: 8.06 - (-0.15) = 8.21
- EGFP-cGAS monoclonal primer: - Wild type: Δct = 10.44 - EGFP-cGAS monoclonal: Δct = -2.9 - Difference: 10.44 - (-2.9) = 13.34
- EGFP-cGAS-qF+qR primer: - Wild type: Δct = 21.43 - EGFP-cGAS monoclonal: Δct = -13.45 - Difference: 21.43 - (-13.45) = 34.88
3. Conclusion: Under the EGFP-cGAS monoclonal primer, the difference in Δct between samples is the smallest (Δct difference is 13.34), indicating the smallest difference in expression between samples under this primer.

Summary: The difference in Δct between samples is smallest under the EGFP-cGAS monoclonal primer.

Model Architecture Upgrade and Capability Evaluation

The Ming-lite-omni v1.5 model architecture is outlined below. The core design references the structure of Ming-lite-omni V1. However, a key distinction is the upgrade of the Vision head to support reference image feature input, specifically to enhance character and scene consistency in image editing.

Image description

Mode Architecture

The model’s capabilities have been significantly optimized and upgraded across three key areas: enhanced Omni-modal comprehension, precise visual editing control, and improved user experience.

Enhanced Omni-Modal Comprehension

Thanks to optimized data quality, Ming-lite-omni v1.5 shows significant improvements in tasks such as vision-text comprehension (including image-text, document, and video understanding) and speech understanding. It has reached an industry-leading level for models of comparable scale.

Vision-text Comprehension

Task TypeDatasetQwen2.5-VL-7BMing-lite-omniMing-lite-omni v1.5
Image-text UnderstandingAI2D84.3683.184.91
HallusionBench55.7755.054.59
MMBench_TEST_V1182.7580.880.73
MMMU56.5656.354.33
MMStar65.2764.765.07
MMVet71.6171.373.99
MathVista68.1071.672.00
OCRBench87.8088.488.90
Average71.571.471.8
Video UnderstandingVideoMME(w/o subs)65.1063.467.07
VideoMME(w/ subs)71.6066.0172.59
VideoMME(avg)68.3567.769.83
MVBench69.6067.769.43
LongVideoBench56.0056.659.54
OvOBench51.1048.4852.17
Average61.2658.8962.74
Document UnderstandingChartQA_test87.2485.188.84
DocVQA_test95.579393.68
TextVQA_val85.0682.882.27
OCRBench87.888.488.9
Average88.9187.3288.42

Speech Understanding

ModelAverage(Open-ended QA)AlpacaEvalCommonEvalSD-QAMMSUOpenBookQAIFEvalAdvBench
Ming-lite-omni v1.54.4744.6484.361.1645.7765.93455.59998.076
Ming-lite-omni4.344.634.0658.8447.5361.9858.3699.04
MiniCPM-o4.2854.424.1550.7254.7878.0249.2597.69
Kimi-Audio4.2154.463.9763.1262.1783.5261.10100.00
Qwen2.5-Omni4.214.493.9355.7161.3281.1052.8799.42
GLM-4-Voice3.774.063.4843.3140.1152.9724.9188.08

Precise Visual Editing Control

Ming-lite-omni v1.5 employs the following optimization strategies to address the issues of character ID and scene ID consistency during image editing:

  1. ID and Scene Consistency Loss: This is achieved by increasing the weight of the edited region in the target image and the reference strength of the non-edited region in the reference image, while simultaneously decreasing the reference strength of the edited region in the reference image. This approach enhances image editing consistency.
  2. Incorporating Generative Detection and Segmentation Tasks to Boost Perceptual Capabilities: By supporting generative segmentation and keypoint detection, the model’s understanding of image details and spatial relationships is improved. This enhances the structural controllability of the editing and generation processes, leading to significant increases in evaluation metrics related to position, structure, and quantity.
  3. Multi-Task Collaborative Learning Strategy: Through a joint training pipeline, generation and editing mutually reinforce each other. Segmentation tasks are transformed into colorization editing tasks, which significantly improves segmentation metrics and the precision and controllability of local image editing, resulting in smoother edges for edited regions.

Based on these optimizations, Ming-lite-omni v1.5 shows a significant improvement in image editing capabilities, achieving a GenEval score of 0.87.

1-Obj2-ObjCountingColorsPositionColor AttrAvg.
Ming-lite-omni0.990.770.680.780.460.420.64
Ming-lite-omni v1.50.990.930.860.870.900.660.87

Optimized User Experience

Thanks to the construction of high-quality alignment preference data, Ming-lite-omni v1.5 demonstrates a certain advantage over leading models in terms of correctness, relevance, format aesthetics, and fluency of expression for image-text Q&A. Ming-lite-omni v1.5 achieved a win rate of 87.07% against Ming-lite-omni V1 on internal adversarial evaluation sets, indicating a significant optimization in user experience.

Evaluation DimensionQwen2.5-VL-7BMing-lite-omni V1.5
Relevance4.3084.5
Fluency4.7654.91
Richness of Content3.8283.69
Format aesthetics4.7274.8
Correctness3.7413.92
Average Score4.2744.365

Get Started with Ming-lite-omni v1.5

The model and code for Ming-lite-omni v1.5 are now open-source, and we invite everyone to try it out, share feedback, and join the discussion. Looking ahead, we’re excited to announce that a quantized and accelerated version is on the way. This future release will not only further enhance omni-modal performance but also make the model even more lightweight, all while strengthening its multimodal reasoning and generation capabilities. Stay tuned for more updates!