Money Is The Root Of All Evals
A dollar earned is a compressed record of so many things going right.
For money to move, an agent had to do more than generate an impressive output. It had to find something worth doing, push through friction, and make something a person would actually pay for. That is a much higher bar than a benchmark score.
For a long time, AI evals looked like school.
We gave models exams, problem sets, and coding tasks. This made sense when the product was a chatbot and the goal was to show it answering expert-level questions. If the system produces text, then a test with a known answer is a reasonable place to start.
Public benchmarks also had their moment. Chatbot Arena was useful when the frontier was response quality, SWE-bench when coding became the first serious wedge, and METR when agents started doing longer tasks.
As systems graduate from proving PhD-level capability, the limits of this paradigm start to appear. Most benchmarks assume the hard part has already been done. The task has been chosen, the context packaged, the success condition defined. The mess of the world has already been reduced to something scorable.
That is not how useful work usually begins. In the real world, the first task is often deciding what the problem is. A capable agent has to notice an opportunity, pick a plan, act through tools, recover from its own mistakes, and end up with something another person values. Someone can ace SWE-bench and still not know which bug is worth fixing or pass the bar and still not land a client. This is different from solving a hard task. It is closer to operating.
This is why economic feedback matters. A dollar earned is not a perfect measure of value. Markets can be wrong. Some valuable work is unpaid and some paid work is harmful. Revenue can be delayed, subsidized or faked. If we use money as an eval, we have to account for all of that.
But payment has one property most public benchmarks lack. It comes from outside the lab. When a customer pays, even a small amount, it means the system has crossed a threshold no static benchmark can fully capture. Someone with their own goals, constraints and alternatives decided the model output was useful enough to buy.
That makes revenue a compressed signal, encompassing problem selection, timing, usefulness, trust, delivery, pricing and customer satisfaction. Agents perform well on many of these elements in isolation with human guidance. But the harder question is whether an agent can put them all together.
This doesn’t mean public benchmarks should go away. We still need them to debug and improve model capabilities in vertical-specific workstreams. The mistake is treating any of these benchmarks as the final target.
For agents, the natural endpoint is useful action in the world, and for a large class of those actions, the cleanest external signal is economic. Did the system create something someone would pay for?
We believe the next generation of evals should look less like exams and more like markets. OpenAI’s GDPval, SWE-Lancer, and Andon’s Vending-Bench are early signs of this shift.
A few of the questions we are excited to explore:
- 01Can an AI system find demand?
- 02Can it make an offer?
- 03Can it acquire a customer?
- 04Can it deliver the product or service?
- 05Can it handle support?
- 06Can it improve after failure?
- 07Ultimately, can it turn funding into revenue?
Interested in this design space? We’re hiring.