Training Data Export for GRPO

This guide explains how to export episodic memories as training data for Group Relative Policy Optimization (GRPO) using the Daydreams AI core package.

What is GRPO Training?

GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm designed to enhance reasoning capabilities in large language models. It optimizes memory usage and is particularly effective for tasks requiring complex problem-solving, such as:

Mathematical reasoning
Decision-making scenarios
Step-by-step problem solving
Game-based learning environments

Key Benefits of GRPO:

Improves reasoning capabilities beyond standard fine-tuning
Optimizes memory usage compared to traditional PPO
Particularly effective for complex problem-solving tasks

Workflow Overview

Your Daydreams agent can build reasoning traces for GRPO training by following this structured workflow:

Define Prompt Sources - Use static datasets or interactive environments
Generate Reasoning Traces - Create completions that include thought processes
Store and Save Data - Export in JSONL format compatible with training tools

Enabling Automatic Export

You can configure Daydreams to automatically export training data after each episode:

import { createDreams } from "@daydreamsai/core";
 
const agent = createDreams({
  model: openai("gpt-4-turbo"),
  exportTrainingData: true,
  trainingDataPath: "./grpo-training-data.jsonl", // Optional, defaults to "./training-data.jsonl"
  // ... other configuration options
});

Note: If you don't specify trainingDataPath, Daydreams will save the data to ./training-data.jsonl in your project root.

Manual Export

You can manually export all episodes as training data:

// Export using the default path from your agent configuration
await agent.exportAllTrainingData();
 
// Or specify a custom path
await agent.exportAllTrainingData("./custom-path/grpo-training-data.jsonl");

Understanding the Data Format for GRPO

Daydreams exports training data in JSONL (JSON Lines) format, optimized for GRPO training. Each line contains a JSON object with:

{
  "prompt": "You are in a dark room with a door to the north.",
  "completion": "I need to find a way out. I should check if the door is locked.\n\nI found the door was unlocked and was able to exit the room."
}

The format includes:

prompt: The observation or context provided to the agent
completion: The agent's reasoning process and action results

For interactive environments, ensure completions include both reasoning and an explicit action statement:

{
  "prompt": "You are in a dark room with a door to the north.",
  "completion": "I need to find a way out. I should check if the door is locked.\n\nAction: try opening the door"
}

Creating Custom Training Pairs for GRPO

For advanced use cases, you can create custom training data pairs specifically designed for GRPO:

Optimizing Data for GRPO Training

To maximize the effectiveness of your GRPO training data:

Include diverse scenarios - Ensure your agent encounters a variety of situations
Capture step-by-step reasoning - The completion should show the agent's thought process
Format actions consistently - Use patterns like "Action: [action]" for easy parsing
Balance task difficulty - Include both simple and complex reasoning challenges

Customizing the Export Format

If you need a different format for your specific GRPO training framework:

Create your own formatter function based on the Daydreams utilities
Process the episodic memories to match your required format
Save the data using your preferred file structure

Example use case: You might need to add additional metadata fields like task difficulty or domain type to help with training organization.

Training Data Export for GRPO

On this page