infinite - a rubric driven prioritized replay to maximise continual learning

August 20, 2025·research·6 min read·research·continual-learningreinforcement-learningevaluation

contents

"an infinite game is played for the purpose of continuing the play. not for winning or achieving a specific end." — James P. Carse

abstract

continual learning systems face a fundamental challenge: how to efficiently retain and build upon previously learned knowledge while adapting to new information. traditional training methods often suffer from catastrophic forgetting and inefficient resource utilization.

infinite introduces a rubric-driven prioritized replay mechanism that transforms how continual learning systems select, prioritize, and replay experiences. by implementing a diverse and adaptive evaluation framework, infinite aims to ensure that the most educationally valuable experiences are replayed with optimal frequency.

key innovations:

curriculum-based domain selection: dynamically prioritizes training domains based on performance bands (low/medium/high) and staleness metrics
on-policy training with fresh rollouts: maintains policy freshness without storing old trajectories, using distributed state tracking across domains
contamination detection: pre-training validation ensures evaluation data hasn't leaked into training sets
upgrade mode: enhances post-trained models with new capabilities while preserving prior skills via KL anchoring
mixed/single batch alternation: alternates between focused single-domain and cross-domain mixed batches for optimal generalization

this approach addresses both (1) minimizing forgetfulness across multiple domains over long horizons, and (2) upgrading post-trained models when original training data is unavailable.

understanding infinite replay

imagine you're training an AI system to master multiple domains. these domains are not necessarily related to each other. they may also vary in complexity. you want to have a single base model that can learn from all of these domains. but also it should be able to learn from new domains as they come in. currently we don't have anything that tackles this effectively. my question is: how do we do this with what we already have?

there are many approaches to this problem. most of them being architecture level changes. i want to explore the possibility of doing this with the assumption that it is already possible given the right training methodology.

the intuition

the idea borrows from spaced repetition mechanisms - replay the most important experiences with highest frequency while maintaining minimal coverage of stable areas.

concrete mechanisms:

performance band assignment:

convert rubric grades (1-4 scale) to pass/fail indicators (pass ≥ 3)
track exponential moving average (EMA) of pass rates per domain
assign performance bands: low (<0.4), medium (0.4-0.8), high (>0.8)

adaptive scheduling priorities:

low performers: 60% of training capacity (frequent practice)
medium performers: 30% capacity (regular practice)
high performers: 10% capacity (occasional refresh)
staleness boost: domains not seen recently get priority increase
uncertainty factor: high variance in recent grades indicates exploration value

anti-forgetting feedback loop:

domain performance drops → low band assignment → increased sampling priority → 
more training → performance recovery → higher band → reduced sampling

distributed state tracking: each domain maintains: acc_ema, performance_band, last_seen_step, grade_uncertainty

this creates a self-regulating system where struggling domains automatically receive more attention while stable domains are maintained with minimal overhead.

visual flow

INFINITE: Rubric-Driven Prioritized Replay for Continual Learning
═══════════════════════════════════════════════════════════════════

┌─────────────────────────────────────────┐
│              INPUT DOMAINS              │
│                                         │
│  ┌─────────────┐ ┌─────────────┐        │
│  │   Domain 1  │ │   Domain 2  │        │
│  │   (Math)    │ │  (Language) │        │
│  └─────────────┘ └─────────────┘        │
│                                         │
│  ┌─────────────┐ ┌─────────────┐        │
│  │   Domain 3  │ │   Domain N  │        │
│  │  (Science)  │ │  (New Task) │        │
│  └─────────────┘ └─────────────┘        │
└─────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│         RUBRIC EVALUATION MODULE       │
│                                         │
│  ├─ Performance Assessment              │
│  ├─ Cross-Domain Scoring                │
│  └─ Task-Specific Metrics              │
└─────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│       SCORE DRIFT DETECTION MODULE     │
│                                         │
│  ├─ Value Function Tracking             │
│  ├─ Performance Change Analysis         │
│  └─ Forgetting Detection Algorithm      │
└─────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│      PRIORITIZED REPLAY SCHEDULER      │
│                                         │
│  ├─ Adaptive Curriculum Generation      │
│  ├─ Spaced Repetition Algorithm         │
│  └─ Priority Queue Management          │
└─────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│           REPLAY EXECUTION             │
│                                         │
│  ├─ High Priority: Slipping Domains     │
│  ├─ Medium Priority: New Learning       │
│  └─ Low Priority: Stable Domains       │
└─────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────┐
│            CONTINUAL LEARNING           │
│             OBJECTIVES                  │
│                                         │
│  ✓ Knowledge Retention                  │
│  ✓ New Domain Acquisition               │
│  ✓ Catastrophic Forgetting Prevention   │
│  ✓ Adaptive Learning Rate               │
└─────────────────────────────────────────┘

PROCESS FLOW: Domains → Evaluation → Detection → Scheduling → Execution → Learning

contamination detection and data integrity

before any training begins, infinite implements comprehensive contamination detection to ensure evaluation data hasn't leaked into training sets:

pre-training validation protocol:

sample representative subsets from each training domain
compute semantic similarity (cosine similarity of embeddings) between training and evaluation prompts
flag matches exceeding configurable threshold (default: 0.95 cosine similarity)
generate detailed contamination audit report with exact matches and near-duplicates
either automatically remove contaminated samples or halt with error if contamination exceeds tolerance
persist contamination logs for reproducibility and compliance

why this matters: contaminated evaluation data leads to inflated performance metrics and false confidence in model capabilities. this validation step ensures legitimate measurement of continual learning progress.

implementation architecture

core training loop (every step k):

domain state tracking
- maintain domain statistics (performance, staleness, uncertainty)
- synchronize across training nodes when distributed
domain health assessment
- calculate performance bands from acc_ema thresholds
- compute staleness (steps since domain last trained)
- measure uncertainty from recent grade variance
batch composition strategy
- every 10th step: single-domain batch (focused learning)
- other steps: mixed-domain batch (cross-domain transfer)
- research suggests 12% better generalization from alternation

priority-driven domain selection

priority = band_weight + staleness_factor + uncertainty_factor + base_weight
domain_shares = softmax(priorities + anti_starvation_epsilon)

rollout execution and grading
- generate model responses for selected prompts
- evaluate using domain-specific rubrics (1-4 scale)
- update acc_ema with pass/fail indicators
GRPO gradient updates with KL regularization

key components to build:

InfiniteGRPOTrainer: extends base GRPO with curriculum scheduling
TriageStateManager: persistent distributed state storage
ContaminationDetector: pre-training validation (0.95 cosine similarity threshold)
TriageSampler: priority calculation and batch assembly
RubricEvaluator: 4-point grading framework

domain state structure:

class TriageState:
    acc_ema: float           # exponential moving average of pass rates
    performance_band: str    # "low", "medium", "high" 
    last_seen_step: int      # staleness tracking
    grade_uncertainty: float # variance in recent rubric scores

upgrade mode: enhancing post-trained models

upgrade mode addresses a critical real-world scenario: you have a well-trained model M0, but original training data is unavailable. you want to add new capabilities without losing existing skills.

initialization workflow:

contamination check: validate new training data doesn't overlap with evaluation suites
baseline establishment: evaluate frozen model M0 on all domain evaluation suites
state initialization: set initial acc_ema values for each domain based on M0's performance
anchor setup: configure KL divergence penalties toward M0 to prevent forgetting

training modifications:

conservative scheduling: allocate 70% batch capacity to new domains, 30% to prior domain maintenance
stronger KL regularization: apply higher penalties (coefficient ≥0.1) to maintain similarity to M0
anti-starvation guarantees: ensure prior domains get minimum sampling even when performing well
gradual capability transfer: start with lower learning rates on new domains

safety mechanisms:

regression monitoring: track performance on prior domains at every evaluation
multi-tier alerts: escalating responses when prior domain performance drops
automatic rollback: return to previous checkpoint if regression exceeds thresholds
human-in-the-loop gating: require manual review after repeated safety violations

evaluation metrics and success criteria

forgetting and retention tracking:

backward transfer (BWT): performance change on earlier domains after learning new ones
forward transfer (FWT): zero-shot performance gains on unseen domains
average accuracy (ACC): macro-average across all domains over time
area under retention curve (AURC): long-term stability per domain
time-to-decay: steps before performance degrades without practice

continual learning benchmarks:

# stability curves showing per-domain pass-rate EMA over training steps
stability_curve = acc_ema_per_domain_over_time

# compare against baselines
baseline_grpo = standard_grpo_without_scheduling
infinite_a1 = curriculum_domain_selection_only  
infinite_a2 = add_staleness_priority_boosting
infinite_a3 = add_uncertainty_factors
infinite_full = mixed_single_batch_alternation

success criteria:

curriculum scheduling improves AURC by ≥25% over baseline GRPO
mixed/single alternation shows measurable generalization benefit
upgrade mode: new domain improvement ≥5 points, prior domain drop ≤1 point
contamination detection catches known overlaps with 95%+ accuracy

safety gating policy for upgrade mode:

first alert: increase domain bucket weight to boost sampling
second alert: strengthen KL penalty toward baseline model M0
third alert: reduce new domain sampling temporarily
final gate: halt training and require human review after H failed evaluations

planning the details

we need to plan the details of the implementation.

choice of base model?
which domains to use?
detecting catastrophic forgetting in standard RL training for the base model?
what to measure for each domain?
expected challenges with reward hacking?
known works that tackle this or something similar?
rough timeline?
what people with various backgrounds can contribute?

division by contribution areas

broadly there are six areas of contribution where there are lots of things to be done:

contamination check scripts - to test the base/instruct model on the domains we pick
collecting small datasets, evals, RL env for math/code/creative language tasks
contributing to code based on already decided algorithms (scheduling the replay, how to weight the domains, any other policy gradient design decisions)
contributing to improving the algorithms based on some identified disadvantage
compute/running experiments
miscellaneous (any software level, uncategorised feedback/improvement)

this is a work in progress. all updates will be posted here.

join the discussion:

e/Xperiments discord server
github repo: https://github.com/tokenbender/infinite

reach out if you think it is cool and can contribute in any way - collaboration, compute or sponsorship.