Quickstart: SFT upon dFactory
Last updated: 2025-11-04
This quickstart guide provides comprehensive VeOmni best practices for training, including installation, configuration, and advanced usage patterns.
VeOmni Best Practices
Usage
Run Example Script
Verify training startup (need to download the dataset first):
sh train.sh tasks/train_llada2_bd.py configs/sft/llada2_mini_bd_sft.yaml
Create Custom Task Directory
train_torch.py can be used for most pre-training and post-training tasks. You can just modify the train config to complete your task. However, if you want to create a new task, you can copy the train_torch.py file from the tasks directory and modify it, like tasks/omni/train_qwen2_vl.py.
mkdir tasks/your_task
cp tasks/train_torch.py tasks/your_task/train.py
Launch Custom Training
You can overwrite the default arguments in train yaml by passing them to the script.
bash train.sh tasks/your_task/train.py \
$CONFIG.yaml \
--model.model_path your_path_to_model \
--data.train_path your_path_to_dataset \
--train.output_dir your_path_to_save_checkpoints \
--train.wandb_project your_project_name \
--train.wandb_name your_experiment_name
Arguments
Default Parameter Access
VeOmni offers a unified argument management system that can be easily extended to support custom arguments. For default arguments explanation, refer to Config arguments Explanation.
Source code: veomni/utils/arguments.py.
from dataclasses import dataclass, field
from veomni.utils.arguments import DataArguments, ModelArguments, TrainingArguments, parse_args
@dataclass
class Arguments:
model: "ModelArguments" = field(default_factory=ModelArguments)
data: "DataArguments" = field(default_factory=DataArguments)
train: "TrainingArguments" = field(default_factory=TrainingArguments)
if __name__ == "__main__":
args = parse_args(Arguments)
print(args.train.lr) # Access default arguments
Custom Parameter Extension
You can extend the default arguments by creating a new class that inherits from the existing class.
@dataclass
class CustomTrainingArguments(TrainingArguments):
enable_xxx: bool = field(
default=False,
metadata={"help": "Enable me if necessary."},
)
@dataclass
class Arguments:
model: "ModelArguments" = field(default_factory=ModelArguments)
data: "DataArguments" = field(default_factory=DataArguments)
train: "CustomTrainingArguments" = field(default_factory=CustomTrainingArguments)
Parallel State
VeOmni uses torch device mesh to manage all parallel states, which is useful for multi-dimensional parallelism (i.e., 3-D parallel) where parallelism composability is required. You can create the parallel state by calling the init_parallel_state function and get the parallel state by calling the get_parallel_state function.
For more details about torch device mesh, refer to Getting Started with DeviceMesh.
Source code: veomni/distributed/parallel_state.py.
Note
The parallel state system provides a unified interface for managing different types of parallelism including data parallel, tensor parallel, expert parallel, and pipeline parallel.
from veomni.distributed.parallel_state import get_parallel_state, init_parallel_state
init_parallel_state(
dp_size=args.train.data_parallel_size, # data parallel size
dp_replicate_size=args.train.data_parallel_replicate_size, # data parallel replicate size
dp_shard_size=args.train.data_parallel_shard_size, # data parallel shard degree
tp_size=args.train.tensor_parallel_size, # tensor parallel size
ep_size=args.train.expert_parallel_size, # expert parallel size
pp_size=args.train.pipeline_parallel_size, # pipeline parallel size, not supported now
cp_size=args.train.context_parallel_size, # context parallel size, not supported now
ulysses_size=args.train.ulysses_parallel_size, # ulysses parallel size
dp_mode=args.train.data_parallel_mode, # data parallel mode, can be "ddp", "fsdp1", "fsdp2"
)
parallel_state = get_parallel_state()
# Access dp state
dp_mesh = parallel_state.dp_mesh
dp_group = parallel_state.dp_group
# Access sp state
sp_group = parallel_state.sp_group
sp_rank = parallel_state.sp_rank
# Access tp state
tp_group = parallel_state.tp_group
tp_mesh = parallel_state.tp_mesh
Dataset
VeOmni supports two types of datasets by default:
Source code: veomni/data/dataset.py
Dataset Types
IterativeDataset (recommended for large datasets)
MappingDataset (default for small datasets)
from veomni.data import (
build_iterative_dataset,
build_mapping_dataset,
)
if args.data.datasets_type == "iterable":
train_dataset = build_iterative_dataset(args.data.train_path, transform=transform, seed=args.train.seed)
args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size)
elif args.data.datasets_type == "mapping":
train_dataset = build_mapping_dataset(args.data.train_path, transform=transform)
args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))
Important
Training Steps Calculation
Iterable datasets: Add
data.train_size(tokens to consume) to config. Train steps ≈train_size / (global_batch_size * max_seq_len)Mapping datasets: Pass
len(train_dataset)to compute correct train steps
Custom Datasets
VeOmni is a flexible framework that supports custom datasets. You can implement your own dataset function and use it with VeOmni.
def build_custom_dataset(data_path, transform) -> Dataset:
# Implement your custom dataset logic
pass
elif args.data.datasets_type == "custom":
logger.info_rank0("Start building custom dataset")
train_dataset = build_custom_dataset(args.data.train_path, transform=transform)
# For iterable datasets, remove len(train_dataset)
args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))
Data Transform (Preprocess)
VeOmni supports two types of transforms by default:
Source code: veomni/data/data_transform.py
Transform Types
process_pretrain_example (recommended for pretrain task)
process_sft_example (recommended for sft task)
Pretrain Example
from functools import partial
from veomni.data.data_transform import process_pretrain_example
from veomni.models import build_tokenizer
tokenizer = build_tokenizer(args.model.tokenizer_path)
# To use AutoTokenizer, replace the line above with the following:
# tokenizer = AutoTokenizer.from_pretrained(args.model.tokenizer_path)
transform = partial(
process_pretrain_example,
tokenizer=tokenizer,
max_seq_len=args.data.max_seq_len,
)
SFT Example
from functools import partial
from veomni.data.chat_template import build_chat_template
from veomni.data.data_transform import process_sft_example
chat_template = build_chat_template(args.data.chat_template, tokenizer)
transform = partial(
process_sft_example,
chat_template=chat_template,
max_seq_len=args.data.max_seq_len,
)
Chat Template
VeOmni supports several chat templates by default and you can add your custom chat template by implementing the ChatTemplate class.
Source code: veomni/data/chat_template.py
Custom Template Implementation
from collections.abc import Sequence
from veomni.data.chat_template import ChatTemplate
class CustomTemplate(ChatTemplate):
def encode_messages(self, messages: Sequence[dict[str, str]], max_seq_len: int = 8192) -> dict[str, list[int]]:
# Implement encoding logic
pass
def get_jinja_template(self) -> str:
return "" # Jinja template string
DataLoader
VeOmni offers a flexible and powerful dataloader implementation that supports:
Both padding and remove padding (packing) strategy
Dynamic batching strategy
Source code: veomni/data/data_loader.py
Basic Usage
from veomni.data import build_dataloader, build_mapping_dataset
transform = YOUR_TRANSFORM_FUNCTION
train_dataset = build_mapping_dataset(
data_path=args.data.train_path,
transform=transform,
)
args.train.compute_train_steps(args.data.max_seq_len, args.data.train_size, len(train_dataset))
train_dataloader = build_dataloader(
dataset=train_dataset,
micro_batch_size=args.train.micro_batch_size,
global_batch_size=args.train.global_batch_size,
dataloader_batch_size=args.train.dataloader_batch_size,
seed=args.train.seed,
max_seq_len=args.data.max_seq_len,
collate_fn=None,
train_steps=args.train.train_steps,
rmpad=args.train.rmpad,
rmpad_with_pos_ids=args.train.rmpad_with_pos_ids,
bsz_warmup_ratio=args.train.bsz_warmup_ratio,
bsz_warmup_init_mbtoken=args.train.bsz_warmup_init_mbtoken,
dyn_bsz_margin=args.train.dyn_bsz_margin,
dyn_bsz_buffer_size=args.train.dyn_bsz_buffer_size,
num_workers=args.data.num_workers,
drop_last=args.data.drop_last,
pin_memory=args.data.pin_memory,
prefetch_factor=args.data.prefetch_factor,
)
Collate Function
VeOmni supports three types of collate functions for text tasks by default:
Text Tasks:
DataCollatorWithPadding(enabled whenrmpadis False andrmpad_with_pos_idsis False)DataCollatorWithPacking(enabled whenrmpadis True andrmpad_with_pos_idsis False)DataCollatorWithPositionIDs(enabled whenrmpadis False andrmpad_with_pos_idsis True)
Omni Model Tasks:
OmniDataCollatorWithPacking(for whenrmpad_with_pos_idsis True)OmniDataCollatorWithPadding(for whenrmpadis False andrmpad_with_pos_idsis False)
Source code: veomni/data/data_collator.py
Omni model details: veomni/data/multimodal/data_collator.py and usage in train_omni_model.py”
Model and Optimizer
Model Initialization
build_foundation_model implements model initialization with config and weights path:
Meta device initialization
Initialize model from model config or weights path
Source code: veomni/models/auto.py
from veomni.models import build_foundation_model
model = build_foundation_model(
config_path=args.model.config_path, # model config path, can be None if weights_path is not None
weights_path=args.model.model_path, # model weights path, can be None if config_path is not None
init_device=args.train.init_device, # model init device
)
# You can replace with the following code if you want to use AutoModelForCausalLM from transformers
# model = AutoModelForCausalLM.from_pretrained(args.model.model_path)
Parallelize Your Model
Source code: veomni/distributed/torch_parallelize.py
from veomni.distributed.torch_parallelize import build_parallelize_model
model = build_foundation_model(...)
model = build_parallelize_model(
model,
enable_full_shard=args.train.enable_full_shard,
enable_mixed_precision=args.train.enable_mixed_precision,
enable_gradient_checkpointing=args.train.enable_gradient_checkpointing,
init_device=args.train.init_device,
enable_fsdp_offload=args.train.enable_fsdp_offload,
basic_modules=model._no_split_modules + args.model.basic_modules,
)
Optimizer and LR Scheduler
Source code: veomni/optim
from veomni.optim import build_lr_scheduler, build_optimizer
optimizer = build_optimizer(
model,
lr=args.train.lr,
weight_decay=args.train.weight_decay,
# ... other parameters
)
lr_scheduler = build_lr_scheduler(
optimizer,
train_steps=args.train.train_steps * args.train.num_train_epochs,
# ... other parameters
)
Train Loop
After the parallel_state, model, optimizer, and dataloader are initialized, you can start the training loop.
Basic Training Loop
for epoch in range(args.train.num_train_epochs):
data_iterator = iter(train_dataloader)
for _ in range(args.train.train_steps):
micro_batches = next(data_iterator)
for micro_batch in micro_batches:
loss = model(**micro_batch).loss / len(micro_batches)
loss.backward()
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
Custom Loss Function
import torch
loss_fct = torch.nn.CrossEntropyLoss()
def loss_func(logits, labels):
return loss_fct(logits, labels)
# In train loop:
output = model(**micro_batch)
logits = output.logits
loss = loss_func(logits, labels) / len(micro_batches)
Prerequisites
The latest version of
veomniand its dependencies installed following the installation guideA compatible GPU with sufficient memory (e.g., NVIDIA A100 with 40GB or higher)