Configuration Design Guide
YAML Configuration Structure
Configuration files use hierarchical YAML structure with three main sections:
model:
# Model-specific settings
data:
# Data loading and preprocessing
train:
# Training hyperparameters and setup
Model Configuration
Model Section
Required fields:
model:
config_path: "./configs/model_configs/[model_name]"
model_path: "./path/to/model"
tokenizer_path: "./path/to/tokenizer"
attn_implementation: "sdpa" # sdpa|eager|flex_attention
moe_implementation: "fused" # fused|standard
Warning
Flash Attention Limitation: The flash attention backend can only be used with full_attention or causal_attention modes. It cannot adapt to custom attention types used in LLaDA2.0 models. Do not use FlashAttention (flash_attn2/flash_attn) for Block Diffusion Mode models. See Block Diffusion for detailed explanation of block diffusion training.
Data Configuration
Data Section
Template for conversation data:
data:
train_path: "./datasets/train.jsonl"
data_type: "conversation" # conversation|plain|instruction
datasets_type: "mapping" # mapping|streaming
dataloader_type: "native" # native|custom
max_seq_len: 2048
text_keys: "messages" # field name in JSON
noise_range_low: 0.3 # diffusion noise lower bound
noise_range_high: 0.8 # diffusion noise upper bound
num_workers: 16
Support multiple data formats (JSONL, Parquet)
Configurable noise ranges for diffusion training
Flexible text field mapping
Worker count based on CPU cores
Training Configuration
Training Section
Distributed training setup:
train:
output_dir: "./outputs/experiment_name"
# Parallel configuration
data_parallel_mode: "fsdp2" # fsdp2
tensor_parallel_size: 1 # model parallel
ulysses_parallel_size: 1 # sequence parallel
expert_parallel_size: 1 # MoE parallel
# Batch configuration
global_batch_size: 16 # total batch across all GPUs
micro_batch_size: 1 # batch per GPU
# Training schedule
num_train_epochs: 1
save_epochs: 1 # checkpoint frequency
log_steps: 1 # logging frequency
Optimization parameters:
train:
optimizer: "adamw"
beta1: 0.9
beta2: 0.999
lr: 1.0e-5 # learning rate
lr_warmup_ratio: 0.03 # warmup steps ratio
lr_decay_style: "cosine" # cosine|linear|constant
weight_decay: 0.1
max_grad_norm: 1.0
Memory optimization:
train:
enable_mixed_precision: true
enable_gradient_checkpointing: true
enable_full_shard: true # FSDP parameter sharding
enable_fsdp_offload: true # CPU offloading
empty_cache_steps: 500 # GPU memory cleanup
Configuration Patterns
Model Scaling
Small model template:
train:
global_batch_size: 8 # Reduce for smaller models
micro_batch_size: 1
Large model template:
train:
global_batch_size: 64 # Increase for larger models
micro_batch_size: 1
tensor_parallel_size: 2 # Enable model parallelism
expert_parallel_size: 2 # Distribute experts
Dataset Adaptation
For large datasets:
data:
datasets_type: "streaming" # Memory-efficient loading
num_workers: 32 # Increase workers
For small datasets:
data:
datasets_type: "mapping" # Full dataset in memory
num_workers: 8 # Reduce overhead
Hardware Adaptation
Single GPU Setup
train:
data_parallel_mode: "fsdp2"
tensor_parallel_size: 1
expert_parallel_size: 1
global_batch_size: 4 # Fit single GPU
micro_batch_size: 1
enable_fsdp_offload: false # Disable offloading
Multi-GPU Setup
train:
data_parallel_mode: "fsdp2"
tensor_parallel_size: 1
expert_parallel_size: 2 # Distribute experts
global_batch_size: 32 # Scale with GPU count
micro_batch_size: 1
enable_fsdp_offload: false # Faster training
Memory-Constrained Setup
train:
enable_gradient_checkpointing: true
enable_full_shard: true
enable_fsdp_offload: true # Enable CPU offloading
enable_activation_offload: true # Reduce GPU memory
micro_batch_size: 1 # Minimal per-GPU batch
Best Practices
- Path Management
- Use relative paths for configs
Store absolute paths in environment variables Create separate output directories per experiment
- Parameter Tuning
- Start with conservative batch sizes
Increase learning rate for larger batches Adjust warmup ratio based on dataset size
- Monitoring
- Enable W&B for experiment tracking
Set appropriate logging frequency Monitor gradient norms and loss curves
- Reproducibility
- Fix random seeds in training scripts
Document configuration changes Version control configuration files