Dataset#
This guide provides detailed examples of how to create custom datasets in AReaL for model training.
Define Your Dataset#
Create a new file under realhf/impl/dataset/
, for example, my_custom_dataset.py
. Your Dataset
must implement the torch.utils.data.Dataset
interface and follow the framework’s conventions.
class MyCustomDataset(torch.utils.data.Dataset):
def __init__(
self,
util: data_api.DatasetUtility,
max_length: Optional[int] = None,
dataset_path: Optional[str] = None,
dataset_builder: Optional[Callable[[], List[Dict]]] = None,
# Your custom parameters
custom_param: float = 1.0,
):
"""Custom dataset initialization
Args:
util: Dataset utility class containing tokenizer, seed, distributed info, etc.
max_length: Maximum sequence length
dataset_path: Path to dataset file (optional)
dataset_builder: Data construction function (optional, alternative to dataset_path)
custom_param: Your custom parameter
"""
self._util = util
self.max_length = max_length
# Load and split dataset
data = data_api.load_shuffle_split_dataset(util, dataset_path, dataset_builder)
# Your custom data processing logic
...
Implement Core Methods#
Every dataset class must implement the following two core methods:
1. __len__
Method#
Returns the size of the dataset:
def __len__(self):
return len(self.data_samples)
2. __getitem__
Method#
Returns the sample at the specified index, must return a SequenceSample
object:
def __getitem__(self, idx):
# Get raw data
sample = self.data_samples[idx]
# Process data
...
# Return SequenceSample object
return data_api.SequenceSample.from_default(
ids=[sample["id"]],
seqlens=[len(processed_data["input_ids"])],
data=dict(
packed_prompts=torch.tensor(processed_data["input_ids"], dtype=torch.long),
# Other necessary data fields
),
)
Dataset Examples#
We provide some examples of dataset under realhf/impl/dataset/
:
For SFT, please refer
prompt_answer_dataset.py
.For Reward model training, please refer
rw_paired_dataset.py
For RL training, please refer
math_code_dataset.py
Data Format Requirements#
JSONL File Format#
Your data file should be in JSONL format, with one JSON object per line. If you are using our PromptDataset implementation, your data should be like:
Math Data
{"qid": "sample_1", "prompt": "Solve this math problem: 2+2=", "solutions": ["\\boxed{4}"]}
Code Data
{"qid": "sample_2", "prompt": "Code problem", "input_output": "{\"inputs\": [\"5\\n2 3 5 10 12\\n\"], \"outputs\": [\"17\\n\"]}"}
qid
: Unique identifier for the sampleprompt
: Input prompt texttask
: Task type, used to distinguish how to calculate the reward. (“math” and “code” are supported now.)
Note: There is no format restriction for a customized dataset as long as it can be loaded by your custom code.
Registration and Configuration#
Register Dataset#
Register your dataset at the end of your dataset file:
# in realhf/impl/dataset/my_custom_dataset.py
data_api.register_dataset("my-custom", MyCustomDataset)
Modify Experiment Configuration#
Use your new dataset in the experiment configuration (refer to realhf/experiments/common/*_exp.py
):
# in your experiment config file
@property
def datasets(self) -> List[DatasetAbstraction]:
return [
DatasetAbstraction(
"my-custom", # Your registered name
args=dict(
dataset_path=self.dataset_path,
max_length=self.max_length,
custom_param=self.custom_param,
# Other initialization parameters
),
)
]