Training Data Export for GRPO
This guide explains how to export episodic memories as training data for Group Relative Policy Optimization (GRPO) using the Daydreams AI core package.
What is GRPO Training?
GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm designed to enhance reasoning capabilities in large language models. It optimizes memory usage and is particularly effective for tasks requiring complex problem-solving, such as:
- Mathematical reasoning
- Decision-making scenarios
- Step-by-step problem solving
- Game-based learning environments
Key Benefits of GRPO:
- Improves reasoning capabilities beyond standard fine-tuning
- Optimizes memory usage compared to traditional PPO
- Particularly effective for complex problem-solving tasks
Workflow Overview
Your Daydreams agent can build reasoning traces for GRPO training by following this structured workflow:
- Define Prompt Sources - Use static datasets or interactive environments
- Generate Reasoning Traces - Create completions that include thought processes
- Store and Save Data - Export in JSONL format compatible with training tools
Enabling Automatic Export
You can configure Daydreams to automatically export training data after each episode:
Note: If you don't specify trainingDataPath
, Daydreams will save the data
to ./training-data.jsonl
in your project root.
Manual Export
You can manually export all episodes as training data:
Understanding the Data Format for GRPO
Daydreams exports training data in JSONL (JSON Lines) format, optimized for GRPO training. Each line contains a JSON object with:
The format includes:
- prompt: The observation or context provided to the agent
- completion: The agent's reasoning process and action results
For interactive environments, ensure completions include both reasoning and an explicit action statement:
Creating Custom Training Pairs for GRPO
For advanced use cases, you can create custom training data pairs specifically designed for GRPO:
Optimizing Data for GRPO Training
To maximize the effectiveness of your GRPO training data:
- Include diverse scenarios - Ensure your agent encounters a variety of situations
- Capture step-by-step reasoning - The completion should show the agent's thought process
- Format actions consistently - Use patterns like "Action: [action]" for easy parsing
- Balance task difficulty - Include both simple and complex reasoning challenges
Customizing the Export Format
If you need a different format for your specific GRPO training framework:
- Create your own formatter function based on the Daydreams utilities
- Process the episodic memories to match your required format
- Save the data using your preferred file structure
Example use case: You might need to add additional metadata fields like task difficulty or domain type to help with training organization.