Quickstart: SFT upon dFactory

Last updated: 2025-11-04

This quickstart guide provides comprehensive VeOmni best practices for training, including installation, configuration, and advanced usage patterns.

VeOmni Best Practices

Usage

Run Example Script

Verify training startup (need to download the dataset first):

sh train.sh tasks/train_llada2_bd.py configs/sft/llada2_mini_bd_sft.yaml

Create Custom Task Directory

train_torch.py can be used for most pre-training and post-training tasks. You can just modify the train config to complete your task. However, if you want to create a new task, you can copy the train_torch.py file from the tasks directory and modify it, like tasks/omni/train_qwen2_vl.py.

mkdir tasks/your_task
cp tasks/train_torch.py tasks/your_task/train.py

Launch Custom Training

You can overwrite the default arguments in train yaml by passing them to the script.

bash train.sh tasks/your_task/train.py \
    $CONFIG.yaml \
    --model.model_path your_path_to_model \
    --data.train_path your_path_to_dataset \
    --train.output_dir your_path_to_save_checkpoints \
    --train.wandb_project your_project_name \
    --train.wandb_name your_experiment_name

Arguments

Default Parameter Access

VeOmni offers a unified argument management system that can be easily extended to support custom arguments. For default arguments explanation, refer to Config arguments Explanation.

Source code: veomni/utils/arguments.py.

from dataclasses import dataclass, field
from veomni.utils.arguments import DataArguments, ModelArguments, TrainingArguments, parse_args

@dataclass
class Arguments:
    model: "ModelArguments" = field(default_factory=ModelArguments)
    data: "DataArguments" = field(default_factory=DataArguments)
    train: "TrainingArguments" = field(default_factory=TrainingArguments)

if __name__ == "__main__":
    args = parse_args(Arguments)
    print(args.train.lr)  # Access default arguments

Custom Parameter Extension

You can extend the default arguments by creating a new class that inherits from the existing class.

@dataclass
class CustomTrainingArguments(TrainingArguments):
    enable_xxx: bool = field(
        default=False,
        metadata={"help": "Enable me if necessary."},
    )

@dataclass
class Arguments:
    model: "ModelArguments" = field(default_factory=ModelArguments)
    data: "DataArguments" = field(default_factory=DataArguments)
    train: "CustomTrainingArguments" = field(default_factory=CustomTrainingArguments)

Parallel State

VeOmni uses torch device mesh to manage all parallel states, which is useful for multi-dimensional parallelism (i.e., 3-D parallel) where parallelism composability is required. You can create the parallel state by calling the init_parallel_state function and get the parallel state by calling the get_parallel_state function.

For more details about torch device mesh, refer to Getting Started with DeviceMesh.

Source code: veomni/distributed/parallel_state.py.

Note

The parallel state system provides a unified interface for managing different types of parallelism including data parallel, tensor parallel, expert parallel, and pipeline parallel.

from veomni.distributed.parallel_state import get_parallel_state, init_parallel_state

init_parallel_state(
    dp_size=args.train.data_parallel_size,  # data parallel size
    dp_replicate_size=args.train.data_parallel_replicate_size,  # data parallel replicate size
    dp_shard_size=args.train.data_parallel_shard_size,  # data parallel shard degree
    tp_size=args.train.tensor_parallel_size,  # tensor parallel size
    ep_size=args.train.expert_parallel_size,  # expert parallel size
    pp_size=args.train.pipeline_parallel_size,  # pipeline parallel size, not supported now
    cp_size=args.train.context_parallel_size,  # context parallel size, not supported now
    ulysses_size=args.train.ulysses_parallel_size,  # ulysses parallel size
    dp_mode=args.train.data_parallel_mode,  # data parallel mode, can be "ddp", "fsdp1", "fsdp2"
)

parallel_state = get_parallel_state()

# Access dp state
dp_mesh = parallel_state.dp_mesh
dp_group = parallel_state.dp_group

# Access sp state
sp_group = parallel_state.sp_group
sp_rank = parallel_state.sp_rank

# Access tp state
tp_group = parallel_state.tp_group
tp_mesh = parallel_state.tp_mesh

Dataset

VeOmni supports two types of datasets by default:

Source code: veomni/data/dataset.py

Dataset Types

  1. IterativeDataset (recommended for large datasets)

  2. MappingDataset (default for small datasets)

from veomni.data import (
    build_iterative_dataset,
    build_mapping_dataset,
)

if args.data.datasets_type == "iterable":
    train_dataset = build_iterative_dataset(args.data.train_path, transform=transform, seed=args.train.seed)
    args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size)
elif args.data.datasets_type == "mapping":
    train_dataset = build_mapping_dataset(args.data.train_path, transform=transform)
    args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))

Important

Training Steps Calculation

  • Iterable datasets: Add data.train_size (tokens to consume) to config. Train steps ≈ train_size / (global_batch_size * max_seq_len)

  • Mapping datasets: Pass len(train_dataset) to compute correct train steps

Custom Datasets

VeOmni is a flexible framework that supports custom datasets. You can implement your own dataset function and use it with VeOmni.

def build_custom_dataset(data_path, transform) -> Dataset:
    # Implement your custom dataset logic
    pass

elif args.data.datasets_type == "custom":
    logger.info_rank0("Start building custom dataset")
    train_dataset = build_custom_dataset(args.data.train_path, transform=transform)
    # For iterable datasets, remove len(train_dataset)
    args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))

Data Transform (Preprocess)

VeOmni supports two types of transforms by default:

Source code: veomni/data/data_transform.py

Transform Types
  1. process_pretrain_example (recommended for pretrain task)

  2. process_sft_example (recommended for sft task)

Pretrain Example
from functools import partial
from veomni.data.data_transform import process_pretrain_example
from veomni.models import build_tokenizer

tokenizer = build_tokenizer(args.model.tokenizer_path)
# To use AutoTokenizer, replace the line above with the following:
# tokenizer = AutoTokenizer.from_pretrained(args.model.tokenizer_path)

transform = partial(
    process_pretrain_example,
    tokenizer=tokenizer,
    max_seq_len=args.data.max_seq_len,
)
SFT Example
from functools import partial
from veomni.data.chat_template import build_chat_template
from veomni.data.data_transform import process_sft_example

chat_template = build_chat_template(args.data.chat_template, tokenizer)
transform = partial(
    process_sft_example,
    chat_template=chat_template,
    max_seq_len=args.data.max_seq_len,
)

Chat Template

VeOmni supports several chat templates by default and you can add your custom chat template by implementing the ChatTemplate class.

Source code: veomni/data/chat_template.py

Custom Template Implementation
from collections.abc import Sequence
from veomni.data.chat_template import ChatTemplate

class CustomTemplate(ChatTemplate):
    def encode_messages(self, messages: Sequence[dict[str, str]], max_seq_len: int = 8192) -> dict[str, list[int]]:
        # Implement encoding logic
        pass

    def get_jinja_template(self) -> str:
        return ""  # Jinja template string

DataLoader

VeOmni offers a flexible and powerful dataloader implementation that supports:

  • Both padding and remove padding (packing) strategy

  • Dynamic batching strategy

Source code: veomni/data/data_loader.py

Basic Usage

from veomni.data import build_dataloader, build_mapping_dataset

transform = YOUR_TRANSFORM_FUNCTION

train_dataset = build_mapping_dataset(
    data_path=args.data.train_path,
    transform=transform,
)

args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))

train_dataloader = build_dataloader(
    dataset=train_dataset,
    micro_batch_size=args.train.micro_batch_size,
    global_batch_size=args.train.global_batch_size,
    dataloader_batch_size=args.train.dataloader_batch_size,
    seed=args.train.seed,
    max_seq_len=args.data.max_seq_len,
    collate_fn=None,
    train_steps=args.train.train_steps,
    rmpad=args.train.rmpad,
    rmpad_with_pos_ids=args.train.rmpad_with_pos_ids,
    bsz_warmup_ratio=args.train.bsz_warmup_ratio,
    bsz_warmup_init_mbtoken=args.train.bsz_warmup_init_mbtoken,
    dyn_bsz_margin=args.train.dyn_bsz_margin,
    dyn_bsz_buffer_size=args.train.dyn_bsz_buffer_size,
    num_workers=args.data.num_workers,
    drop_last=args.data.drop_last,
    pin_memory=args.data.pin_memory,
    prefetch_factor=args.data.prefetch_factor,
)

Collate Function

VeOmni supports three types of collate functions for text tasks by default:

Text Tasks:

  • DataCollatorWithPadding (enabled when rmpad is False and rmpad_with_pos_ids is False)

  • DataCollatorWithPacking (enabled when rmpad is True and rmpad_with_pos_ids is False)

  • DataCollatorWithPositionIDs (enabled when rmpad is False and rmpad_with_pos_ids is True)

Omni Model Tasks:

  • OmniDataCollatorWithPacking (for when rmpad_with_pos_ids is True)

  • OmniDataCollatorWithPadding (for when rmpad is False and rmpad_with_pos_ids is False)

Source code: veomni/data/data_collator.py

Omni model details: veomni/data/multimodal/data_collator.py and usage in train_omni_model.py

Model and Optimizer

Model Initialization

build_foundation_model implements model initialization with config and weights path:

  • Meta device initialization

  • Initialize model from model config or weights path

Source code: veomni/models/auto.py

from veomni.models import build_foundation_model

model = build_foundation_model(
    config_path=args.model.config_path,  # model config path, can be None if weights_path is not None
    weights_path=args.model.model_path,  # model weights path, can be None if config_path is not None
    init_device=args.train.init_device,  # model init device
)

# You can replace with the following code if you want to use AutoModelForCausalLM from transformers
# model = AutoModelForCausalLM.from_pretrained(args.model.model_path)

Parallelize Your Model

Source code: veomni/distributed/torch_parallelize.py

from veomni.distributed.torch_parallelize import build_parallelize_model

model = build_foundation_model(...)

model = build_parallelize_model(
    model,
    enable_full_shard=args.train.enable_full_shard,
    enable_mixed_precision=args.train.enable_mixed_precision,
    enable_gradient_checkpointing=args.train.enable_gradient_checkpointing,
    init_device=args.train.init_device,
    enable_fsdp_offload=args.train.enable_fsdp_offload,
    basic_modules=model._no_split_modules + args.model.basic_modules,
)

Optimizer and LR Scheduler

Source code: veomni/optim

from veomni.optim import build_lr_scheduler, build_optimizer

optimizer = build_optimizer(
    model,
    lr=args.train.lr,
    weight_decay=args.train.weight_decay,
    # ... other parameters
)

lr_scheduler = build_lr_scheduler(
    optimizer,
    train_steps=args.train.train_steps * args.train.num_train_epochs,
    # ... other parameters
)

Train Loop

After the parallel_state, model, optimizer, and dataloader are initialized, you can start the training loop.

Basic Training Loop

for epoch in range(args.train.num_train_epochs):
    data_iterator = iter(train_dataloader)
    for _ in range(args.train.train_steps):
        micro_batches = next(data_iterator)
        for micro_batch in micro_batches:
            loss = model(**micro_batch).loss / len(micro_batches)
            loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()

Custom Loss Function

import torch

loss_fct = torch.nn.CrossEntropyLoss()

def loss_func(logits, labels):
    return loss_fct(logits, labels)

# In train loop:
output = model(**micro_batch)
logits = output.logits
loss = loss_func(logits, labels) / len(micro_batches)

Prerequisites

  • The latest version of veomni and its dependencies installed following the installation guide

  • A compatible GPU with sufficient memory (e.g., NVIDIA A100 with 40GB or higher)

Dataset Introduction