GITHUB 📑 Paper|🤗 Hugging Face|🤖 ModelScope

Introduction

Ming-Lite-Uni is an open-source multimodal framework that includes a newly developed unified visual generator, and a native multimodal autoregressive model meant to integrate vision and language.

This project offers an open-source implementation of the integrated MetaQueries and M2-omni framework, while offering the innovative multi-scale learnable tokens and multi-scale representation alignment strategy. Ming-Lite-Uni utilizes a fixed MLLM and a learnable diffusion model, allowing native multimodal AR models to execute text-to-image production and instruction-based image editing tasks, hence enhancing their functionalities beyond mere visual comprehension. Our experimental findings demonstrate the robust efficacy of Ming-Lite-Uni and highlight the remarkable fluidity of its interactive process. Ming-Lite-Uni is now in the alpha phase and will soon undergo additional refinement.

We appreciate everyone’s ongoing support and attention! We sincerely value your patience as we progressively enhance our solutions and model efficacy. We are now achieving significant progress and observing favorable outcomes, with forthcoming updates anticipated—remain attentive!

📌 Updates

Why It Matters

Ming-Lite-Uni’s unified architecture overcomes fundamental limitations of conventional approaches:

Conventional MethodsMing-Lite-Uni’s Advantages
Modular Pipelines
(CLIP/SigLIP + Diffusion Models)
End-to-End Unified Model
Seamless understanding-generation integration
Discrete Token AR
(Limited visual grounding)
Continuous Token Space
Native support for fine-grained visual concepts
Fixed-Resolution Processing
(Artifacts in upscaling)
Multi-Scale Adaptation
Consistent quality across resolutions
Separate Editing Workflows
(Manual alignment required)
Dialog-Driven Control
Natural language guided pixel-level editing
Understanding Bottlenecks
(Visual-semantic mismatch)
Joint Representation Learning
Mutually enhanced comprehension and generation

Key Enhancements

  • Unified Visual Understanding & Generation Architecture. Ming-Lite-Uni achieves an average understanding score of 69.7 on the OpenCompass leaderboard, surpassing DeepSeek-VL2 (66.4). At the same time, it achieves an image generation score of 0.62 on the GenEval benchmark, outperforming SDXL (0.55).
  • Multi-Scale Learnable Tokens. We employ a novel mechanism to establish feature correlations across resolutions of 4Ă—/8Ă—/16Ă—. By introducing hierarchical tokens, the model captures global layout (low-res), object structures (mid-res), and fine textures (high-res), improving GenEval by 3.5%.
  • Multi-Scale Representation Alignment. We introduce a novel scale wised consistency loss to enforce alignment between hierarchical representations and final outputs through native-resolution optimization. This strategy directly enhances the high-res reconstruction quality (>2dB PSNR) and boosts GenEval by 1.5%.
  • AGI-Capable System. Our model supports complex chained operations, such as “generate castle → add sunset → adjust perspective”, with a swift response time of under 1 second (benchmarked with RTX 4090). The system is designed to handle instruction-driven generation-editing and is synchronized with ChatGPT-4o(aligned with the industry milestone of March 2025).

Empowering Multimodal Interaction with Ming-Lite-Uni

Ming-Lite-Uni acts as a unified model for multimodal understanding, extending beyond traditional NLP tasks and multimodal comprehension to enable interactive multimodal generation. This includes capabilities such as image generation, image editing, and style transfer.

Model Structure

Ming-Lite-Uni is a unified multimodal model designed for both image understanding and high-fidelity image generation. It achieves this by compressing image representations into continuous visual tokens, which are processed alongside discrete text tokens using a scaled auto-regressive Transformer. The generation capability is powered by an externally trained diffusion model (SANA), conditioned on tokens produced by the Transformer.

B106FE9E-5839-48c3-A175-AE8A4D2D8BB8

Benchmark Evaluations

We conduct separate quantitative evaluations of Ming-Lite-Uni on multimodal understanding and text-to-image generation using public benchmarks. For multimodal understanding, we compare against traditional models that take images and text as input and output text, as well as against recent models with visual generative capabilities. For multimodal generation, we evaluate text-to-image performance on GenEval. Please refer to our TechReport for details.

Multimodal Understanding

TypeModelAvg.MMBMMSMMMUMathVHallAI2DMM-Vet
Und. OnlyLLaVA-72B68.084.565.856.668.447.986.260.6
Qwen2.5-VL-7B76.287.871.167.970.858.888.276.7
Emu3-Chat-58.5-31.6---37.2
InternVL2.5-78B75.287.569.57071.457.489.171.8
DeepSeek-VL266.481.261.050.759.451.584.560.0
GPT-4o-20241120 (closed)72.084.365.170.759.956.284.974.5
Step-1o (closed)77.787.369.369.974.755.889.182.8
Und. and Gen.TokenFlow-XL-68.9-38.7---40.7
Janus-Pro-7B-79.2-41.0---50.0
Ours (Ming-Lite-Uni)69.780.760.551.268.351.884.572.3

Image Generation

TypeMethodSingle Obj.Two Obj.CountingColorsPositionColor Attri.Overall
Gen. OnlyLlamaGen0.710.340.210.580.070.040.32
SDv2.10.980.510.440.850.070.170.50
Emu3-Gen0.980.710.340.810.170.210.54
SDXL0.980.740.390.850.150.230.55
DALL-E 30.960.870.470.830.430.450.67
SD3-Medium0.990.940.720.890.330.600.74
Und. and Gen.Show-o0.950.520.490.820.110.280.53
TokenFlow-XL0.950.600.410.810.160.240.55
Janus-Pro-1B0.980.820.510.890.650.560.73
Ours (Ming-Lite-Uni)0.990.760.530.870.260.300.62

Example Usage

System Requirements

  • Python: >= 3.8
  • PyTorch: >= 2.4.1+cu12.2 (CUDA 12.2 compatible)
  • flash-attn: >= 2.6.3

Installation

We recommend installing the following versions to set up your environment using pip:

pip install -r requirements.txt
  • Usage Guided

Below is an example of how to load and use the model:

import torch
import os
from Ming_Uni.MingUniInference import Ming_Uni_Inference
from Ming_Uni.process import MyProcessor
device = torch.cuda.current_device()
device = torch.device(device)

model_path='../Ming-Lite-Uni/'
model = Ming_Uni_Inference(model_path)
model.to(torch.bfloat16)
model.to(device)
model.eval()

llm_model=os.path.join(model_path, 'qwen2_5_llm')
my_proc=MyProcessor(llm_model)

image_file = "tests/cake.jpg"
prompt = "add a candle on top of the cake"
inputs = my_proc.process(image_file=image_file, prompt=prompt, device=device)

result = model.image_gen_generate(inputs, steps=30, seed=42, cfg=5.0, height=512, width=512)[1]
result.save("result.png")

For more advanced usage, such as fine-tuning or generating images, refer to the documentation.

Acknowledgments

The project is currently in its early stages. While some preliminary results have been promising, substantial progress is needed to achieve seamless integration of understanding and generation. Both the code and models require further refinement and optimization, which is why we have chosen to open-source the project. We invite contributions from the community to help enhance and develop it collaboratively. If you have any suggestions or identify issues within the code, please contribute via Pull Requests. Thank you for your support and interest!

Open Collaboration

We’re open-sourcing Ming-Lite-Uni to accelerate progress toward AGI, featuring:

  • đź“‚ Full model weights & test code
  • đź§© Modular architecture for easy extension
  • 📊 Comprehensive benchmarks (vs GPT-4V, SDXL, etc.)

“The simultaneous release of ChatGPT-4’s image generation in March 2025 confirms our vision of unified multimodal AI as the next paradigm.”

Contact Information

If you require assistance or encounter troubles while utilizing our project, please open a GitHub issue.

Ming is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project’s root directory.

Citation

If you find our work helpful, feel free to give us a cite.

@article{Mingunify2025,
    title   = {Ming-Lite-Uni: Advancements in Unified Architecture for Natural Multimodal Interaction},
    author  = {Inclusion AI, Ant Group},
    journal = {arXiv preprint},
    year    = {2025}
}