Configuration Design Guide

YAML Configuration Structure

Configuration files use hierarchical YAML structure with three main sections:

model:
  # Model-specific settings
data:
  # Data loading and preprocessing
train:
  # Training hyperparameters and setup

Model Configuration

Model Section

Required fields:

model:
  config_path: "./configs/model_configs/[model_name]"
  model_path: "./path/to/model"
  tokenizer_path: "./path/to/tokenizer"
  attn_implementation: "sdpa"        # sdpa|eager|flex_attention
  moe_implementation: "fused"        # fused|standard

Warning

Flash Attention Limitation: The flash attention backend can only be used with full_attention or causal_attention modes. It cannot adapt to custom attention types used in LLaDA2.0 models. Do not use FlashAttention (flash_attn2/flash_attn) for Block Diffusion Mode models. See Block Diffusion for detailed explanation of block diffusion training.

Data Configuration

Data Section

Template for conversation data:

data:
  train_path: "./datasets/train.jsonl"
  data_type: "conversation"          # conversation|plain|instruction
  datasets_type: "mapping"           # mapping|streaming
  dataloader_type: "native"          # native|custom
  max_seq_len: 2048
  text_keys: "messages"              # field name in JSON
  noise_range_low: 0.3               # diffusion noise lower bound
  noise_range_high: 0.8              # diffusion noise upper bound
  num_workers: 16
  • Support multiple data formats (JSONL, Parquet)

  • Configurable noise ranges for diffusion training

  • Flexible text field mapping

  • Worker count based on CPU cores

Training Configuration

Training Section

Distributed training setup:

train:
  output_dir: "./outputs/experiment_name"

  # Parallel configuration
  data_parallel_mode: "fsdp2"        # fsdp2
  tensor_parallel_size: 1            # model parallel
  ulysses_parallel_size: 1           # sequence parallel
  expert_parallel_size: 1            # MoE parallel

  # Batch configuration
  global_batch_size: 16              # total batch across all GPUs
  micro_batch_size: 1                # batch per GPU

  # Training schedule
  num_train_epochs: 1
  save_epochs: 1                     # checkpoint frequency
  log_steps: 1                       # logging frequency

Optimization parameters:

train:
  optimizer: "adamw"
  beta1: 0.9
  beta2: 0.999
  lr: 1.0e-5                        # learning rate
  lr_warmup_ratio: 0.03             # warmup steps ratio
  lr_decay_style: "cosine"          # cosine|linear|constant
  weight_decay: 0.1
  max_grad_norm: 1.0

Memory optimization:

train:
  enable_mixed_precision: true
  enable_gradient_checkpointing: true
  enable_full_shard: true           # FSDP parameter sharding
  enable_fsdp_offload: true         # CPU offloading
  empty_cache_steps: 500            # GPU memory cleanup

Configuration Patterns

Model Scaling

Small model template:

train:
  global_batch_size: 8               # Reduce for smaller models
  micro_batch_size: 1

Large model template:

train:
  global_batch_size: 64              # Increase for larger models
  micro_batch_size: 1
  tensor_parallel_size: 2            # Enable model parallelism
  expert_parallel_size: 2            # Distribute experts

Dataset Adaptation

For large datasets:

data:
  datasets_type: "streaming"        # Memory-efficient loading
  num_workers: 32                   # Increase workers

For small datasets:

data:
  datasets_type: "mapping"          # Full dataset in memory
  num_workers: 8                    # Reduce overhead

Hardware Adaptation

Single GPU Setup

train:
  data_parallel_mode: "fsdp2"
  tensor_parallel_size: 1
  expert_parallel_size: 1
  global_batch_size: 4            # Fit single GPU
  micro_batch_size: 1
  enable_fsdp_offload: false      # Disable offloading

Multi-GPU Setup

train:
  data_parallel_mode: "fsdp2"
  tensor_parallel_size: 1
  expert_parallel_size: 2         # Distribute experts
  global_batch_size: 32           # Scale with GPU count
  micro_batch_size: 1
  enable_fsdp_offload: false      # Faster training

Memory-Constrained Setup

train:
  enable_gradient_checkpointing: true
  enable_full_shard: true
  enable_fsdp_offload: true       # Enable CPU offloading
  enable_activation_offload: true # Reduce GPU memory
  micro_batch_size: 1             # Minimal per-GPU batch

Best Practices

Path Management
Use relative paths for configs

Store absolute paths in environment variables Create separate output directories per experiment

Parameter Tuning
Start with conservative batch sizes

Increase learning rate for larger batches Adjust warmup ratio based on dataset size

Monitoring
Enable W&B for experiment tracking

Set appropriate logging frequency Monitor gradient norms and loss curves

Reproducibility
Fix random seeds in training scripts

Document configuration changes Version control configuration files