Quickstart: SFT upon dFactory

Last updated: 2025-11-04

This quickstart guide provides comprehensive VeOmni best practices for training, including installation, configuration, and advanced usage patterns.

VeOmni Best Practices

Usage

Run Example Script

Verify training startup (need to download the dataset first):

sh train.sh tasks/train_llada2_bd.py configs/sft/llada2_mini_bd_sft.yaml

Create Custom Task Directory

train_torch.py can be used for most pre-training and post-training tasks. You can just modify the train config to complete your task. However, if you want to create a new task, you can copy the train_torch.py file from the tasks directory and modify it, like tasks/omni/train_qwen2_vl.py.

mkdir tasks/your_task
cp tasks/train_torch.py tasks/your_task/train.py

Launch Custom Training

You can overwrite the default arguments in train yaml by passing them to the script.

bash train.sh tasks/your_task/train.py \
    $CONFIG.yaml \
    --model.model_path your_path_to_model \
    --data.train_path your_path_to_dataset \
    --train.output_dir your_path_to_save_checkpoints \
    --train.wandb_project your_project_name \
    --train.wandb_name your_experiment_name

Arguments

Default Parameter Access

VeOmni offers a unified argument management system that can be easily extended to support custom arguments. For default arguments explanation, refer to Config arguments Explanation.

Source code: veomni/utils/arguments.py.

from dataclasses import dataclass, field
from veomni.utils.arguments import DataArguments, ModelArguments, TrainingArguments, parse_args

@dataclass
class Arguments:
    model: "ModelArguments" = field(default_factory=ModelArguments)
    data: "DataArguments" = field(default_factory=DataArguments)
    train: "TrainingArguments" = field(default_factory=TrainingArguments)

if __name__ == "__main__":
    args = parse_args(Arguments)
    print(args.train.lr)  # Access default arguments

Custom Parameter Extension

You can extend the default arguments by creating a new class that inherits from the existing class.

@dataclass
class CustomTrainingArguments(TrainingArguments):
    enable_xxx: bool = field(
        default=False,
        metadata={"help": "Enable me if necessary."},
    )

@dataclass
class Arguments:
    model: "ModelArguments" = field(default_factory=ModelArguments)
    data: "DataArguments" = field(default_factory=DataArguments)
    train: "CustomTrainingArguments" = field(default_factory=CustomTrainingArguments)

Parallel State

VeOmni uses torch device mesh to manage all parallel states, which is useful for multi-dimensional parallelism (i.e., 3-D parallel) where parallelism composability is required. You can create the parallel state by calling the init_parallel_state function and get the parallel state by calling the get_parallel_state function.

For more details about torch device mesh, refer to Getting Started with DeviceMesh.

Source code: veomni/distributed/parallel_state.py.

Note

The parallel state system provides a unified interface for managing different types of parallelism including data parallel, tensor parallel, expert parallel, and pipeline parallel.

from veomni.distributed.parallel_state import get_parallel_state, init_parallel_state

init_parallel_state(
    dp_size=args.train.data_parallel_size,  # data parallel size
    dp_replicate_size=args.train.data_parallel_replicate_size,  # data parallel replicate size
    dp_shard_size=args.train.data_parallel_shard_size,  # data parallel shard degree
    tp_size=args.train.tensor_parallel_size,  # tensor parallel size
    ep_size=args.train.expert_parallel_size,  # expert parallel size
    pp_size=args.train.pipeline_parallel_size,  # pipeline parallel size, not supported now
    cp_size=args.train.context_parallel_size,  # context parallel size, not supported now
    ulysses_size=args.train.ulysses_parallel_size,  # ulysses parallel size
    dp_mode=args.train.data_parallel_mode,  # data parallel mode, can be "ddp", "fsdp1", "fsdp2"
)

parallel_state = get_parallel_state()

# Access dp state
dp_mesh = parallel_state.dp_mesh
dp_group = parallel_state.dp_group

# Access sp state
sp_group = parallel_state.sp_group
sp_rank = parallel_state.sp_rank

# Access tp state
tp_group = parallel_state.tp_group
tp_mesh = parallel_state.tp_mesh

Dataset

VeOmni supports two types of datasets by default:

Source code: veomni/data/dataset.py

Dataset Types

IterativeDataset (recommended for large datasets)
MappingDataset (default for small datasets)

from veomni.data import (
    build_iterative_dataset,
    build_mapping_dataset,
)

if args.data.datasets_type == "iterable":
    train_dataset = build_iterative_dataset(args.data.train_path, transform=transform, seed=args.train.seed)
    args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size)
elif args.data.datasets_type == "mapping":
    train_dataset = build_mapping_dataset(args.data.train_path, transform=transform)
    args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))

Important

Training Steps Calculation

Iterable datasets: Add data.train_size (tokens to consume) to config. Train steps ≈ train_size / (global_batch_size * max_seq_len)
Mapping datasets: Pass len(train_dataset) to compute correct train steps

Custom Datasets

VeOmni is a flexible framework that supports custom datasets. You can implement your own dataset function and use it with VeOmni.

def build_custom_dataset(data_path, transform) -> Dataset:
    # Implement your custom dataset logic
    pass

elif args.data.datasets_type == "custom":
    logger.info_rank0("Start building custom dataset")
    train_dataset = build_custom_dataset(args.data.train_path, transform=transform)
    # For iterable datasets, remove len(train_dataset)
    args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))

Data Transform (Preprocess)

VeOmni supports two types of transforms by default:

Source code: veomni/data/data_transform.py

Transform Types

process_pretrain_example (recommended for pretrain task)
process_sft_example (recommended for sft task)

Pretrain Example

from functools import partial
from veomni.data.data_transform import process_pretrain_example
from veomni.models import build_tokenizer

tokenizer = build_tokenizer(args.model.tokenizer_path)
# To use AutoTokenizer, replace the line above with the following:
# tokenizer = AutoTokenizer.from_pretrained(args.model.tokenizer_path)

transform = partial(
    process_pretrain_example,
    tokenizer=tokenizer,
    max_seq_len=args.data.max_seq_len,
)

SFT Example

from functools import partial
from veomni.data.chat_template import build_chat_template
from veomni.data.data_transform import process_sft_example

chat_template = build_chat_template(args.data.chat_template, tokenizer)
transform = partial(
    process_sft_example,
    chat_template=chat_template,
    max_seq_len=args.data.max_seq_len,
)

Chat Template

VeOmni supports several chat templates by default and you can add your custom chat template by implementing the ChatTemplate class.

Source code: veomni/data/chat_template.py

Custom Template Implementation

from collections.abc import Sequence
from veomni.data.chat_template import ChatTemplate

class CustomTemplate(ChatTemplate):
    def encode_messages(self, messages: Sequence[dict[str, str]], max_seq_len: int = 8192) -> dict[str, list[int]]:
        # Implement encoding logic
        pass

    def get_jinja_template(self) -> str:
        return ""  # Jinja template string

DataLoader

VeOmni offers a flexible and powerful dataloader implementation that supports:

Both padding and remove padding (packing) strategy
Dynamic batching strategy

Source code: veomni/data/data_loader.py

Basic Usage

from veomni.data import build_dataloader, build_mapping_dataset

transform = YOUR_TRANSFORM_FUNCTION

train_dataset = build_mapping_dataset(
    data_path=args.data.train_path,
    transform=transform,
)

args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))

train_dataloader = build_dataloader(
    dataset=train_dataset,
    micro_batch_size=args.train.micro_batch_size,
    global_batch_size=args.train.global_batch_size,
    dataloader_batch_size=args.train.dataloader_batch_size,
    seed=args.train.seed,
    max_seq_len=args.data.max_seq_len,
    collate_fn=None,
    train_steps=args.train.train_steps,
    rmpad=args.train.rmpad,
    rmpad_with_pos_ids=args.train.rmpad_with_pos_ids,
    bsz_warmup_ratio=args.train.bsz_warmup_ratio,
    bsz_warmup_init_mbtoken=args.train.bsz_warmup_init_mbtoken,
    dyn_bsz_margin=args.train.dyn_bsz_margin,
    dyn_bsz_buffer_size=args.train.dyn_bsz_buffer_size,
    num_workers=args.data.num_workers,
    drop_last=args.data.drop_last,
    pin_memory=args.data.pin_memory,
    prefetch_factor=args.data.prefetch_factor,
)

Collate Function

VeOmni supports three types of collate functions for text tasks by default:

Text Tasks:

DataCollatorWithPadding (enabled when rmpad is False and rmpad_with_pos_ids is False)
DataCollatorWithPacking (enabled when rmpad is True and rmpad_with_pos_ids is False)
DataCollatorWithPositionIDs (enabled when rmpad is False and rmpad_with_pos_ids is True)

Omni Model Tasks:

OmniDataCollatorWithPacking (for when rmpad_with_pos_ids is True)
OmniDataCollatorWithPadding (for when rmpad is False and rmpad_with_pos_ids is False)

Source code: veomni/data/data_collator.py

Omni model details: veomni/data/multimodal/data_collator.py and usage in train_omni_model.py”

Model and Optimizer

Model Initialization

build_foundation_model implements model initialization with config and weights path:

Meta device initialization
Initialize model from model config or weights path

Source code: veomni/models/auto.py

from veomni.models import build_foundation_model

model = build_foundation_model(
    config_path=args.model.config_path,  # model config path, can be None if weights_path is not None
    weights_path=args.model.model_path,  # model weights path, can be None if config_path is not None
    init_device=args.train.init_device,  # model init device
)

# You can replace with the following code if you want to use AutoModelForCausalLM from transformers
# model = AutoModelForCausalLM.from_pretrained(args.model.model_path)

Parallelize Your Model

Source code: veomni/distributed/torch_parallelize.py

from veomni.distributed.torch_parallelize import build_parallelize_model

model = build_foundation_model(...)

model = build_parallelize_model(
    model,
    enable_full_shard=args.train.enable_full_shard,
    enable_mixed_precision=args.train.enable_mixed_precision,
    enable_gradient_checkpointing=args.train.enable_gradient_checkpointing,
    init_device=args.train.init_device,
    enable_fsdp_offload=args.train.enable_fsdp_offload,
    basic_modules=model._no_split_modules + args.model.basic_modules,
)

Optimizer and LR Scheduler

Source code: veomni/optim

from veomni.optim import build_lr_scheduler, build_optimizer

optimizer = build_optimizer(
    model,
    lr=args.train.lr,
    weight_decay=args.train.weight_decay,
    # ... other parameters
)

lr_scheduler = build_lr_scheduler(
    optimizer,
    train_steps=args.train.train_steps * args.train.num_train_epochs,
    # ... other parameters
)

Train Loop

After the parallel_state, model, optimizer, and dataloader are initialized, you can start the training loop.

Basic Training Loop

for epoch in range(args.train.num_train_epochs):
    data_iterator = iter(train_dataloader)
    for _ in range(args.train.train_steps):
        micro_batches = next(data_iterator)
        for micro_batch in micro_batches:
            loss = model(**micro_batch).loss / len(micro_batches)
            loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

Custom Loss Function

import torch

loss_fct = torch.nn.CrossEntropyLoss()

def loss_func(logits, labels):
    return loss_fct(logits, labels)

# In train loop:
output = model(**micro_batch)
logits = output.logits
loss = loss_func(logits, labels) / len(micro_batches)

Prerequisites

The latest version of veomni and its dependencies installed following the installation guide
A compatible GPU with sufficient memory (e.g., NVIDIA A100 with 40GB or higher)