SGLang dLLM Inference Guide
===========================
Overview
--------
The dLLM (diffusion language model) paradigm is rapidly evolving, and inference ecosystems are still maturing. This guide provides a practical approach to launching dLLM inference services using SGLang.
Installation
------------
Install the specific SGLang version with dLLM support:
.. code-block:: bash
pip install git+https://github.com/sgl-project/sglang.git@refs/pull/12588/head
Server Launch
-------------
Launch the SGLang inference server with dLLM-specific parameters:
.. code-block:: bash
python3 -m sglang.launch_server \
--model-path /path/to/LLaDA2.0-flash-preview/ \
--host 127.0.0.1 \
--port 8188 \
--trust-remote-code \
--disable-cuda-graph \
--disable-radix-cache \
--mem-fraction-static 0.9 \
--attention-backend flashinfer \
--diffusion-algorithm "LowConfidence" \
--diffusion-block-size 32 \
--tp-size 4 \
--max-running-requests 1
Key Parameters
~~~~~~~~~~~~~~
- ``--diffusion-algorithm``: Set to "LowConfidence" for dLLM inference
- ``--diffusion-block-size``: Block size for diffusion generation (default: 32)
- ``--attention-backend``: Use "flashinfer" for optimal performance
- ``--tp-size``: Tensor parallelism size (adjust based on GPU count)
- ``--max-running-requests``: Limit concurrent requests for stability
API Usage
---------
Send requests to the inference endpoint:
.. code-block:: bash
curl -X POST "http://127.0.0.1:8188/generate" \
-H "Content-Type: application/json" \
-d '{
"text": "SYSTEMdetailed thinking off<|role_end|>HUMANWhy did Camus say that Sisyphus was happy?<|role_end|>ASSISTANT",
"stream": true,
"sampling_params": {
"temperature": 0,
"max_new_tokens": 1024
}
}'
Request Format
~~~~~~~~~~~~~~
- ``text``: Input text with role-based formatting
- ``stream``: Enable streaming responses
- ``sampling_params``: Generation parameters
- ``temperature``: Sampling temperature (0 for deterministic)
- ``max_new_tokens``: Maximum tokens to generate
Additional Resources
--------------------
For detailed implementation discussion and RFC, see:
https://github.com/sgl-project/sglang/issues/12766