Economic Evaluations of Language Models

Stanford University

Expert forecasts of frontier AI's economic impact diverge wildly, and existing evidence is limited and seemingly contradictory. The uncertainty about the economic impacts of frontier AI conflicts with the near certainty of their improving capabilities. LMs saturate challenging benchmarks, often very quickly. We posit that a central reason for this is a misalignment between AI benchmarks and economic tasks. Benchmarking effort concentrates on math/coding (39.7% of effort) while the related occupations are only 3.5% of US jobs. To bridge this gap between AI capabilities and economic tasks, we introduce EconEvals, an open-source evaluation suite to measure capabilities relevant to tasks, work activities, and occupations in the US labor economy.

Overview of our approach

We start by building economic benchmarks from public data. We efficiently retrieve user queries for occupational tasks from publicly available chatbot conversations. Using this, we produce leaderboards for 143 work activities. These work activities include tasks relevant to sectors ranging from education to health care to finance, and span all SOC major groups (except Military).

However, since our leaderboards are built from real usage, they still inherit the limited coverage of (public) usage data. We estimate that the upper-bound on coverage from publicly available usage data is 38.6% (the January 2026 Anthropic Economic Index reports coverage for 17.2% of tasks). To cover the full U.S. economy of 2000 work activities, we generate synthetic data. We use an LM to generate user queries at varying levels of complexity and search for ones that pass a set of verifiers.

We show our synthetic data not only can be used to build benchmarks, but also estimate AI exposure, which quantifies how much time current AI could save workers assuming adoption. We introduce the first whitebox method to predict occupational exposure to AI by using an LM to simulate a worker. Unlike previous methods, this produces step-by-step justifications alongside time-savings predictions, allowing us to extract qualitative insights into how model capabilities drive economic impact.

Taxonomizing economic tasks

While many researchers and initiatives build AI benchmarks, these decentralized efforts often concentrate on popular topics like software engineering and entirely neglect economically-critical applications. For example, almost 20 million U.S. workers are employed in backoffice clerical and administrative occupations, but very few benchmarks address this at all, and those that do are quite incomplete. We define the space of economic tasks using the Department of Labor's O*NET taxonomy which maps from 1,016 occupations to 18,796 tasks that make up the US labor economy. We define our benchmarks in terms of the 2000 detailed work activities (DWAs) in O*NET: tasks are too narrow given the limits of public usage data and aggregating at the occupation level obfuscates how different jobs are affected.

Loading...

Creating benchmarks by retrieving from public data

To measure the work capabilities of LMs, we ground our evaluations in real-world LM use. As most LM usage data is not public, we use open-source datasets such as WildChat and LMSys. This results in a pool of 4,499,105 conversations from real users, which we then map to O*NET tasks and DWAs to create benchmarks.

Directly classifying all 4.5m against thousands of categories is still prohibitively expensive, so we design a multi-stage retrieval pipeline to balance costs with pipeline precision by decomposing the problem into a cheap embedding-based retrieval phase followed by a costlier LM classification phase on a much smaller subset of conversations. We also identify through trial runs the most promising 200 DWAs to focus on. Finally, given the resulting high-quality work-specific conversations for each DWA, we then convert them into benchmarking queries by using an LM to select the user-turn most relevant to the DWA. This yields benchmark queries for 143 or 6.9% of all DWAs (if, instead of balancing costs with coverage, we strictly maximized coverage, the upper bound on coverage from available public usage data would be 38.6% of all DWAs, but would increase costs by 300×).

Loading...

To score models on these 143 benchmarks, we elicit binary preferences from an LM-judge to compare model responses on the benchmark queries to that of a baseline model (o3-mini). We then use these pairwise preferences to estimate a performance score for each model. A score of 50% indicates parity with the baseline model.

Loading...

Below, we plot the aggregate model performance across all 143 DWAs alongside model performance on a sample of the raw public usage data. While the general trends are similar, gpt-oss-120B performs the best by a clear margin on our economic benchmarks as compared to the raw usage where it is tied for first. Disaggregating by DWA confirms this: of the 37 benchmarks where a model is in first place by a statistically significant margin (i.e. its confidence interval does not overlap with any other model's confidence interval), 35 have gpt-oss-120B in first place. Overall, the per-DWA benchmarks generally induce similar rankings as the aggregate ranking across all 143 DWAs.

Comparing economic benchmarks to raw public usage benchmarks. The left subfigure depicts model performance on 500 randomly sampled public user queries whereas the right depicts model performance averaged across all our 143 50-instance DWA-level benchmarks.

Generating synthetic queries

Public usage data has limited work coverage and even proprietary usage is fundamentally constrained to how AI is currently used. However, we believe AI should be evaluated for all work use cases: evaluations could inform the procurement and adoption of AI for new tasks as they demonstrate technological improvement. We introduce a simulation-based synthetic data generation pipeline to cover (essentially) all U.S. work.

To control for the complexity of the queries we generate, for each task and occupation, we first create a worker persona and then have GPT-5-mini roleplay as a worker with this persona responding to an interviewer asking about time savings. Using this synthetic data, we evaluate models on 40 DWAs: 20 randomly sampled from the 143 DWAs that the real data covers and 20 randomly sampled from the remaining 93% of uncovered DWAs. We additionally evaluate models on 43 occupations in GDPval, occupations all predominantly perform knowledge work and belong to the 9 U.S. sectors that each contribute over 5% of GDP.

Loading...

Occupational Exposure to AI

How do language model capabilities impact the labor market? Benchmarks help compare models but are not designed to answer this question. Economists have developed the alternative lens of exposure that centers real-world impact: the amount of time AI saves workers.

We propose a new method of estimating exposure. Existing exposure estimates for LMs generally distill this into a single prompt: they ask LMs, workers, or occupational experts to provide an estimate of the amount of time-savings based on a short description of the work task and a hard-coded list of current model capabilities. However, this does not show the justification behind these estimates and does not allow us to discover new capabilities or bottlenecks beyond the ones we already know to ask about.

Estimating time savings by simulating workers

We introduce the first whitebox simulation-based exposure method. To estimate the exposure of a task, we roleplay a "worker" with a language model and construct a decomposition of each task into steps. Given this decomposition, the simulated "worker" provides a baseline estimate for the time per step and the amount this can be reduced given current language model capabilities. Like existing methods, this produces a numerical estimate of exposure, but unlike existing methods, it also allows us to investigate the capabilities/bottlenecks that result in higher/lower exposure predictions by simply inspecting the reasoning trace.

Loading...

We apply this method to estimate AI-enabled time-savings across 18,796 O*NET tasks. Below, we visualize the landscape of AI exposure across the US workforce. Notably, we find that 47% of occupations can save substantial (at least 25%) of time for at least half of their tasks. However, we also find that current usage does not cover all task-level opportunities, suggesting that adoption lags behind potential in some areas.

Loading...

Next, we compare our exposure estimates to usage data from the Anthropic Economic Index. Specifically, the plot below compares the simulation-based exposure for tasks that appear in at least 0.0025% of Anthropic's work-related traffic for Claude Sonnet 4.5 versus those that do not.

High predicted exposure scores for tasks with little to no current Claude usage suggest that LMs could be applied more extensively in people's jobs than current usage patterns indicate. Such gaps between capability and adoption are characteristic of general-purpose technologies, which historically diffuse through the economy only as firms develop the complementary workflows, skills, and organizational practices needed to deploy them effectively.

Average proportion of tasks per SOC major group with moderate/high predicted exposure (simulation- and rubric-based) versus the proportion of tasks with at least 0.0025% Claude usage.

Qualitative insights from simulating workers

Finally, because each time-savings estimate is whitebox, we inspect the factors behind predictions by using an LM to categorize the simulation's justifications. In particular, our exposure estimates surface specific real-world bottlenecks that limit AI-enabled time savings on a task-by-task basis. We summarize our findings by using an LM to categorize our exposure justifications.

As expected, physical interaction requirements are the most common bottleneck preventing LMs from saving worker time. Additionally, tasks requiring live interaction, LM consultation overheard, privacy constraints, and real-time monitoring account for a substantial share of tasks that are labeled as not exposed by our simulation-based exposure measure.

Below, we show two cherry-picked excerpts of transcripts from simulated "workers" where the justifications uncover significant bottlenecks to AI-enabled time savings. In particular, these tasks are predicted as having high exposure by an existing blackbox exposure prediction method, suggesting that these bottlenecks are not captured by non-simulation-based approaches.

[...]
7) Create calendar invite and Zoom meeting (4–6 minutes)
- Real task: Build the Outlook event, add Zoom via add‑in, paste agenda, attach doc, set reminders and privacy.
- Chatbot role: Helpful for generating a nicely formatted subject line, agenda, and invite body text (consistent wording, confidentiality note, reminder language). But the actual clicks to create Zoom and attach files are manual. So a chatbot can shave time on the text part but not the mechanical steps.
- Time saved: Of the 5 minutes spent here, I'd say a chatbot could save ~2.5–3 minutes (producing the body and agenda). I'll use 3 minutes.
[...]
Occupation: Legal Secretaries and Administrative Assistants
Task: Schedule and make appointments
Final blackbox prediction: 80-100%
Final whitebox prediction: 0-24%
[...]
Caveat aloud: that 15% assumes responsible use — verifying facts, heavy human revision of any bot prose that gets used, and preserving the core creative decisions myself. In situations where I'm severely blocked or pressed for time and I lean on the bot for larger chunks, the time saved could be higher in the short term, but the cost to voice/authenticity (and ethical questions) would likely make me reject wholesale adoption.
[...]
Occupation: Poets, Lyricists and Creative Writers
Task: Write fiction or nonfiction prose, such as short stories, novels, biographies, articles, descriptive or critical analyses, and essays
Final blackbox prediction: 80-100%
Final whitebox prediction: 0-24%

Conclusion

We introduce EconEvals as an open-source evaluation suite for measuring language model capabilities on U.S. work categories. By providing both benchmark-style and exposure-style measures, expanding coverage of the U.S. work, and operating at the task, DWA, and occupational levels, we develop an array of tools for better measurement in this domain. We encourage future work to explore how these different measures predict downstream economic indicators like employment, wages, and productivity. In addition, future work can address the most fundamental limitations of our work, namely our technological scope being limited to current LM chatbots and our economic scope being limited to the current U.S. labor economy. Overall, we contribute measurement infrastructure to support the broader research agenda and evidence base on the economics of frontier AI.

Citation

@misc{wan2026econevals,
  title={Economic Evaluations of Language Models},
  author={Alexander Wan and Stephane Hatgis-Kessell and Tom{\'a}s Aguirre and Percy Liang and Rishi Bommasani},
  year={2026},
  note={Preprint}
}