
From Autoresearch to Autorefining: Karpathy's 630 Lines That Could Change How AI Improves Itself
Karpathy's autoresearch runs hundreds of AI experiments overnight on one GPU. The autorefining pattern it demonstrates could transform any system with a feedback loop.
From Autoresearch to Autorefining: Karpathy's 630 Lines That Could Change How AI Improves Itself
Andrej Karpathy just open-sourced a 630-line script that lets AI agents run hundreds of experiments overnight on a single GPU. The implications go far beyond training language models.
By The Global Federation AI Lab | March 15, 2026 Category: AI Lab | Read Time: 8 min
On March 8, 2026, Andrej Karpathy -- former Tesla AI lead, OpenAI co-founder, and arguably the most influential AI educator alive -- quietly posted a repository on GitHub called autoresearch. It is 630 lines of Python. It requires one GPU. It has no complex dependencies beyond PyTorch. And it may represent the most important architectural pattern to emerge in AI development this year.
The premise is deceptively simple: give an AI agent access to a training script, let it read the code, form a hypothesis, modify the code, run a five-minute training experiment, evaluate the result, and repeat. Keep improvements, discard failures. Do this while you sleep.
Over two days pointed at nanochat -- Karpathy's already well-tuned GPT-2 training codebase -- the agent ran approximately 700 experiments, found roughly 20 genuine improvements, and stacked them to reduce time-to-GPT-2 from 2.02 hours to 1.80 hours. An 11% efficiency gain, found autonomously, on code that was already optimised by one of the best in the field.
Shopify CEO Tobi Lutke reportedly pointed autoresearch at an internal 0.8 billion parameter model. After 37 experiments over a single overnight run, the agent achieved a 19% improvement in model quality.
These are not theoretical numbers. These are production results.
The Autoresearch Loop
The architecture is elegant in its minimalism. Three files:
prepare.py -- fixed constants, data preparation, tokeniser, dataloader, evaluation. The agent does not touch this file. It is the controlled variable.
train.py -- the model architecture, optimiser, hyperparameters, and training loop. Everything in this file is fair game. The agent can change the learning rate, the depth, the attention mechanism, the batch size, the optimiser -- anything.
program.md -- the agent's instructions. A system prompt that tells it: read the code, form a hypothesis, make one change, run the experiment, evaluate.
The metric is val_bpb (validation bits per byte) -- lower is better, vocabulary-size-independent so architectural changes are fairly compared. Training runs for exactly five minutes of wall clock time, regardless of hardware. This means approximately 12 experiments per hour, approximately 100 experiments per night.
The agent reads its own experimental history. It knows what has been tried, what worked, what failed. It builds on successes and avoids repeated failures. It is, in the most literal sense, a scientist -- forming hypotheses, testing them, recording results, and iterating.
Why This Matters Beyond Model Training
Here is where most commentary on autoresearch stops: a neat tool for training better language models. But the pattern Karpathy has demonstrated is far more general than LLM training. What he has built is a generalised autonomous refinement loop. And that pattern -- which we might call autorefining -- has implications that extend into every domain where iterative improvement against a measurable metric is possible.
Consider what the autoresearch loop actually abstracts:
- A system with modifiable parameters (the code)
- A fixed evaluation framework (the metric)
- An agent capable of reading, reasoning, and modifying (the LLM)
- A keep-or-discard gate (did the metric improve?)
- A history of experiments (context for the next hypothesis)
This is not specific to neural network training. This is the structure of any optimisation problem with a feedback loop. And that means the pattern can be lifted out of its ML context and applied to domains that have never had access to autonomous improvement.
Autorefining: The Generalised Pattern
Autorefining takes the autoresearch loop and applies it to any system where:
- The output quality can be measured objectively
- The system's configuration can be modified programmatically
- The modification space is large enough that human exploration is impractical
- The evaluation cycle is short enough for rapid iteration
Here are domains where this pattern could be applied immediately:
Software Performance Optimisation
Point an autorefining agent at a web application's configuration: database query strategies, caching policies, connection pool sizes, compression algorithms, CDN rules. Define the metric as p99 latency or throughput. Let the agent run 100 configuration experiments overnight against a staging environment. By morning, you have a performance-tuned system that would have taken a human engineer weeks of profiling and intuition.
Prompt Engineering
This is perhaps the most immediately accessible application. Define a prompt template with modifiable sections. Define an evaluation metric -- accuracy on a test set, judge-model scoring, user preference ratings. Let an autorefining agent iterate through prompt variations, testing each against the metric. The agent does not need a GPU. It needs an API key and a scoring function.
CI/CD Pipeline Optimisation
Build pipelines have dozens of configurable parameters: parallelism levels, cache strategies, test ordering, resource allocation, timeout values. The metric is pipeline duration and reliability. An autorefining agent could optimise a 45-minute CI pipeline to 20 minutes by discovering non-obvious parallelisation opportunities and cache configurations.
Manufacturing and Process Control
Any industrial process with sensor feedback and adjustable parameters is a candidate. Temperature profiles in semiconductor fabrication. Pressure curves in injection moulding. Fermentation parameters in bioprocessing. The autoresearch pattern maps directly: modify parameters, run a short production cycle, measure quality, keep or discard.
Content and Design Optimisation
A/B testing is already a crude version of autorefining. But current A/B testing is human-paced -- one or two variants tested over days or weeks. An autorefining agent could generate and test hundreds of design variations against engagement metrics in a single night. Not random variations, but hypothesis-driven modifications informed by the experimental history.
What Makes Autorefining Different from AutoML
The machine learning community has had AutoML tools for years -- systems that search hyperparameter spaces, try different architectures, and optimise model performance. How is autorefining different?
The critical distinction is the agent's ability to reason about the code itself. AutoML tools search predefined parameter spaces. Autorefining agents read the actual implementation, understand what it does, form a theory about why a change might help, and make targeted modifications. This is the difference between grid search and scientific reasoning.
Karpathy's agent did not just try random learning rates. It read the optimiser code, understood the interaction between Muon and AdamW, and hypothesised that a specific momentum schedule might improve convergence. It understood architecture, not just parameters.
This reasoning capability -- powered by the same large language models that the system is training -- creates a qualitatively different kind of optimisation. One that can discover structural improvements, not just parametric ones.
The Constraints That Make It Work
Karpathy's design choices reveal a deep understanding of what makes autonomous systems reliable:
Fixed time budget. Every experiment runs for exactly five minutes. This prevents the agent from gaming the metric by training longer. It also makes experiments directly comparable regardless of what architectural changes the agent makes.
Single metric. val_bpb. One number. No multi-objective optimisation, no Pareto frontiers, no subjective quality judgments. The agent knows exactly what "better" means.
Atomic changes. One hypothesis, one modification, one experiment. The agent does not try to change five things at once. This makes successes attributable and failures diagnosable.
Full history. The agent has access to every previous experiment -- what was tried, what the hypothesis was, what the result was. This prevents cycling and enables the agent to build on accumulated understanding.
These constraints are not limitations. They are the engineering that makes the system work. Any autorefining implementation should preserve them: fixed evaluation budget, single clear metric, atomic modifications, full experimental history.
The Open Questions
Autorefining is not a solved problem. Several hard questions remain:
Metric design. The system is only as good as its metric. val_bpb is an excellent metric for language model training because it directly measures what we care about. But many domains lack such clean metrics. How do you define a single number that captures "code quality" or "user experience" or "manufacturing reliability"? Poor metrics will produce Goodhart's Law failures -- systems that optimise the metric while degrading the thing the metric was supposed to measure.
Search space boundaries. Karpathy constrains the agent to modifying train.py. What happens when the agent has access to a larger codebase? The search space explodes, and the ratio of useful to harmful modifications drops. Autorefining implementations need carefully designed boundaries around what the agent can and cannot change.
Transferability. Karpathy's agent found improvements on a small model that transferred to larger ones. This is not guaranteed in general. An autorefining agent optimising a staging environment might find configurations that fail in production. Validation of transferred improvements is essential.
Safety and alignment. An agent that autonomously modifies code and evaluates its own output is exactly the kind of system that alignment researchers worry about. The autoresearch pattern is safe because the modifications are small, the evaluation is external, and the human reviews the accumulated changes. Scaling autorefining to larger systems will require proportionally stronger safety guarantees.
A Starting Point, Not a Destination
Karpathy has done what he does best: demonstrated a powerful idea in the simplest possible form. Autoresearch is not a product. It is a proof of concept. A 630-line invitation to a much larger conversation about how AI agents can autonomously improve the systems they operate within.
The leap from autoresearch to autorefining -- from "AI improves AI training" to "AI improves any system with a feedback loop" -- is conceptually small but practically enormous. It requires adapting the pattern to domains with messier metrics, larger search spaces, and higher stakes. It requires building the safety infrastructure to let these agents operate with appropriate autonomy and appropriate constraints.
But the core insight is now public, open-source, and MIT-licensed. The loop is simple: read, hypothesise, modify, evaluate, iterate. The rest is engineering.
And if there is one thing the AI community has proven it can do at scale, it is engineering.
The AI Lab is TGF's technology research section, exploring how emerging technologies intersect with governance, society, and human progress. We cover developments that may shape the world our federation will govern.
Sources: