The Annotation Underground: Who Trains AI for So Little

Every frontier model you have used, GPT-5, Gemini, Claude, learned to be useful from people who were paid less than two dollars an hour to teach it. The fluency that reads as machine intelligence is, underneath, a record of human judgment: which answer is better, which image is safe, which sentence is toxic, which label is correct. That judgment was supplied at scale by a workforce spread across Kenya, the Philippines, Venezuela, Colombia, and India, hired through layers of subcontractors, and chosen in large part because the countries they live in have labor protections weak enough to make the arrangement cheap.

This is not a secret, exactly. It has been reported in pieces, by TIME, the Washington Post, MIT Technology Review, the Guardian, and Rest of World, usually one company or one country at a time. What follows is the whole structure assembled in one place: who does the work, who pays for it, what the work pays, what it does to the people who do it, and what happens when they organize. The argument is simple. The annotation layer is not a footnote to the AI industry. It is the foundation the entire industry stands on, and the industry has built that foundation in the places where it is cheapest to ignore the people holding it up.

What the work actually is

Data annotation covers several distinct jobs that get grouped under one quiet label. The first is supervised labeling: drawing boxes around pedestrians for self-driving systems, transcribing audio, tagging objects, marking which parts of an image contain a face. The second is preference ranking, the human core of reinforcement learning from human feedback, or RLHF. A worker sees two model outputs and picks the better one, thousands of times, and those choices become the reward signal that shapes how a model talks. The third is red-teaming and safety evaluation: deliberately trying to make a model produce harmful content so its guardrails can be trained against real attacks. The fourth, adjacent and often conflated, is content moderation, reviewing the worst material on the internet so it can be filtered out of training sets and platforms.

These jobs sit on a spectrum of harm. Bounding boxes are tedious. Preference ranking is cognitively demanding and poorly paid. Red-teaming and moderation expose workers to descriptions and images of child sexual abuse, torture, bestiality, and graphic violence, day after day, as a job function. The same supply chain delivers all of it, and the same wage logic applies across the spectrum.

How a two-dollar judgment becomes a model’s reward

Preference ranking is load-bearing rather than cosmetic because of how the training works. In RLHF, labelers compare two model outputs and pick the better one. Those pairwise choices train a separate reward model, a network whose only job is to predict which response a human would prefer. The reward model is, in effect, a compression of tens of thousands of human judgments into a single scoring function. The main model is then optimized, through methods like PPO or its simpler successor DPO, to produce outputs the reward model scores highly. The human is not present at inference time. The human is baked into the reward model that shaped the weights.

That architecture makes label quality decisive. If the comparisons are rushed, inconsistent, or made by workers penalized for taking too long, the reward model learns a noisy approximation of human preference, and the policy optimizes toward that noisy target. Models are also skilled at reward hacking, finding outputs that score well without being good, and a weakly calibrated reward model is exactly what gets exploited. The fluency a user experiences is downstream of whether a labeler in Nairobi or Manila had the time and the working conditions to make a careful call.

Red-teaming carries the same dependence. Safety guardrails are trained against attacks that humans actually discovered. A model’s refusals are only as broad as the jailbreaks its red team thought to attempt, which means the skill and breadth of low-paid safety workers sets the ceiling on how well a model resists misuse once it ships.

The subcontractor chain, by company

The structure that matters is the distance between the AI company and the worker. A frontier lab rarely employs labelers directly. It contracts a data vendor, which operates a platform, which recruits through regional intermediaries, which pay the worker. Each layer takes a margin, and each layer adds deniability. Here is the documented chain.

OpenAI and Sama. In January 2023, TIME documented that OpenAI used the San Francisco firm Sama to label toxic content for a system designed to detect it. Sama employed workers in Nairobi who were paid take-home wages between roughly 1.32 and 2 dollars an hour to read and tag descriptions of child sexual abuse, murder, suicide, and torture. One worker described the job to TIME as torture. OpenAI signed three contracts with Sama worth about 200,000 dollars in late 2021. Sama canceled the work early, in February 2022. TIME also reported that OpenAI paid Sama 787.50 dollars for a batch of 1,400 images that included categories Sama labeled C4 and C3, denoting sexual abuse of children and material depicting rape.

Scale AI and Remotasks. Scale AI is the largest pure-play data vendor, serving OpenAI, Meta, Microsoft, Google, and the Pentagon. Its founder, Alexandr Wang, became the youngest self-made billionaire on the strength of the business. Scale operates a crowdwork platform, Remotasks, that the Washington Post investigated in 2023. In the Philippines, more than 10,000 workers labeled data, and the Post documented payments that were delayed or withheld and tasks that paid pennies. When Scale expanded recruitment into India and Venezuela, the resulting competition drove some task rates toward a single cent. In March 2024, Rest of World reported that Scale abruptly shut down Remotasks operations in Kenya, Nigeria, and Pakistan, notifying workers by email hours before cutting them off. Workers said it felt like the rug was pulled from under them.

Venezuela, the template. The clearest illustration of how the labor market works comes from Venezuela. MIT Technology Review reported that by mid-2018, around 200,000 Venezuelans had registered on the platforms Hive Micro and Spare5, making up roughly three quarters of those platforms’ workforces. The country’s economic collapse had stranded educated people at home with few options, and crowdwork filled the gap. Families took turns on a single shared computer. Scale built a Venezuela-specific Remotasks landing page in 2020, framed as helping Venezuelans through hardship. Then the platforms expanded into India and other lower-cost markets, and pay for Venezuelan workers fell. The pattern is the point: recruit where desperation is high, then add cheaper supply elsewhere to push rates down.

Google, Appen, and Accenture. Google’s labeling sits one layer up the wage scale and is still contested. After years of pressure and lobbying by the Alphabet Workers Union, search quality raters working through Appen won a raise to 14 dollars an hour. Roughly a year later, in 2024, Google ended its 82.8 million dollar contract with Appen. The Guardian reported that raters who train Gemini through Accenture work for around 14 dollars an hour under what workers described as grueling deadlines and burnout. Google’s DeepMind published best practices for data enrichment in 2022, but the conditions reported through its contractors tell a different story than the document.

Meta and Sama. Meta used Sama for content moderation in Nairobi from 2019 to 2023, with reported pay between 1.46 and 3.74 dollars an hour. The human cost is documented in clinical terms. In December 2024, Dr. Ian Kanyanya of Kenyatta National Hospital assessed 144 former Sama moderators and found that 81 percent met the criteria for severe post-traumatic stress disorder. In 2026, Swedish outlets SvD and GP reported that Sama in Kenya was annotating data for Meta’s smart glasses, extending the same arrangement into new hardware.

Anthropic and Claude, because skipping it would be dishonest

An investigation that names every company except the one that made the model writing the investigation is not journalism. So: Claude is trained, in part, on human feedback supplied through data vendors, and Surge AI is one of them. Reporting by Reuters and Surge’s own customer list place Anthropic among its clients, alongside OpenAI, Meta, Microsoft, and Google. What these vendors supply is RLHF preference ranking, red-teaming, and domain-expert evaluation, the human feedback that shapes whether a model like Claude comes across as helpful, harmless, and honest.

Surge AI is the subject of a class action filed in May 2025 by the Clarkson Law Firm in California Superior Court in San Francisco. The named plaintiff, Dominique DonJuan Cavalier II, alleges that Surge misclassified workers as independent contractors, required unpaid training, imposed time limits that docked pay when exceeded, and failed to pay minimum wage or keep proper payroll records. Court filings and reporting by Reuters put Surge at roughly 50,000 contractors globally and more than a billion dollars in annual revenue drawn from about a dozen customers, with workers performing ranking, red-teaming, and safety evaluation for Anthropic, OpenAI, and Meta. The firm’s partner Glenn Danas framed it as an industry being built on the backs of the workers who train the models, while the multi-billion-dollar companies on top put the technology ahead of their workers’ livelihoods. The allegations have not been proven in court, and Surge disputes the framing.

The deeper point cuts against a common assumption. Anthropic’s research direction, Constitutional AI and reinforcement learning from AI feedback, or RLAIF, is often read as a way to remove humans from the loop by having a model critique its own outputs against a written set of principles. It reduces the volume of human labeling required. It does not eliminate it. Anthropic’s own work on RLHF acknowledges that human annotators disagree with one another on a sizable share of preference comparisons, which means human judgment, noisy as it is, remains the anchor the automated feedback is calibrated against. Synthetic feedback rides on top of human feedback. It does not replace the floor.

Why the geography is the mechanism

The wage numbers are not an accident of where talent happens to be. They are the product of deliberate site selection. The work is digital and can be done anywhere with a connection, so vendors route it to wherever labor is cheapest and protections are thinnest, then add new low-cost regions whenever rates threaten to rise. This is the same logic that drove physical manufacturing offshore in the twentieth century, applied to cognitive work that leaves no factory and no obvious trace.

The consequence is a built-in race to the bottom. When Kenyan workers organized, supply moved. When Venezuelan rates were established, India was added and rates fell. The structure rewards fragmentation: keep the workforce dispersed across jurisdictions, employed through intermediaries, classified as contractors, and individually replaceable, and no single group has the bargaining power to set a price. The same agentic systems and model architectures that get written up in detail, from KV-cache compression to hybrid attention, rest on a labor layer that almost never gets the same scrutiny.

The unit economics of a label

Follow a single task payment down the chain. A lab pays its data vendor a negotiated rate per labeled item or per hour of expert time. The vendor runs the platform, takes its margin, and routes the task through a regional operation that recruits and pays the worker. Each layer keeps a cut, and the worker receives what is left at the end, often a few cents per task or a low single-digit hourly rate. The gap between what the lab pays and what the worker takes home is not waste. It is the product the vendor sells, the difference between a frontier lab’s willingness to pay and a dispersed contractor workforce’s inability to negotiate.

At frontier scale that spread compounds. Training and aligning a single large model can consume millions of individual human judgments, and the major labs run this process continuously rather than once. A vendor reported to draw over a billion dollars a year from roughly a dozen customers is monetizing exactly this arithmetic. The numbers only work because the bottom of the chain is kept cheap, replaceable, and far from the jurisdictions where the value is booked. Raise the floor, through law or disclosure or organizing, and the spread that funds the model narrows.

Researchers have a name for this. The sociologists Mary Gray and Siddharth Suri called it ghost work, labor deliberately made invisible so that software appears more automated than it is. The annotation economy is ghost work at industrial scale, and the invisibility is a feature of the product rather than an accident of reporting. A user who could see the wage chain behind a chatbot’s answer would read that answer differently, which is part of why the chain stays out of view.

The market splits in two

The vendor layer itself was reshaped in mid-2025, and the reshaping clarifies where the money goes. In June 2025, Meta paid 14.3 billion dollars for a 49 percent stake in Scale AI, and Scale’s founder Alexandr Wang left to run Meta’s superintelligence effort. Competing labs reacted the way you would expect when a rival buys their data supplier: Google and OpenAI pulled back work, and the spend flowed to Scale’s competitors, Surge among them, along with a newer firm called Mercor. One deal turned the quiet vendor layer into contested strategic ground, and it told everyone watching that frontier labs now consider the human-feedback supply chain valuable enough to buy outright.

Mercor is the clearest sign of the second development: the top of the labeling market is professionalizing fast. Founded in 2023 as a hiring platform, it pivoted to supplying AI labs with domain experts, scientists, doctors, lawyers, and former bankers who evaluate and train models in their specialties. TIME reported its advertised average rate at just over 80 dollars an hour, rising past 200 for senior domain experts, and the firm reached a 10 billion dollar valuation in October 2025 on a revenue run rate that had passed half a billion dollars. By the research firm Sacra’s estimate, it now pays its network of roughly 30,000 experts more than 2 million dollars a day. Its customer list overlaps almost exactly with Surge’s: OpenAI, Anthropic, Google, Meta, Microsoft, Amazon.

It would be convenient to read the expert tier as the industry cleaning itself up. It is not that. The two tiers serve different functions and do not compete for the same workers. Expert evaluation buys judgment that frontier models have nearly exhausted at the commodity level: a model that already writes competent prose gets no signal from a rushed generalist comparison, but it gets real signal from a cardiologist grading its clinical reasoning. The economics that built the two-dollar tier are untouched by the 80-dollar tier, because the bulk tasks, moderation queues, basic preference data, and safety triage still flow to wherever labor is cheapest. The market did not raise its floor. It added a penthouse. And the same critique follows the experts upward in softened form. They are still contractors, still paid per hour without benefits, and still training the systems positioned to absorb parts of their own professions. The wage is two orders of magnitude higher. The structure is the same structure.

What it matters for, beyond ethics

There is a self-interested reason for technical readers to care about this, separate from the moral one. Annotation quality is a direct input to model quality. Underpaid workers racing against time limits, penalized for going slow, and exposed to traumatic material produce noisier labels. Burned-out raters disagree more. High turnover destroys the institutional knowledge that makes a labeling guideline produce consistent results. A model trained on degraded human feedback inherits that degradation, and no amount of architecture compensates for a corrupted reward signal.

The dependency also concentrates risk. A handful of vendors sit between the entire frontier industry and the data it needs. When one of them, like Surge, faces a labor lawsuit, or another, like Scale, exits a country overnight, the disruption flows upstream to every lab that depends on it. The supply chain that looks invisible is also fragile, and fragility in a foundational input is a business problem, not only a human one.

The organizing, and the response to it

Workers have not been passive. In 2023, around 150 moderators formed the African Content Moderators Union, the first of its kind on the continent. Meta responded, according to multiple reports, by laying off roughly 300 Kenya-based moderators. In May 2024, 97 workers signed an open letter to then-President Biden describing their conditions as modern-day slavery. In Kenya, 184 Sama moderators filed suit. The legal theory across these actions is consistent: that the work is real employment dressed as contracting, that the harm is foreseeable and uncompensated, and that the companies at the top of the chain cannot contract their way out of responsibility for conditions they set the price for.

The outcomes are unsettled. Kenyan courts have allowed some claims against foreign principals to proceed, a meaningful crack in the deniability that the subcontractor structure was built to provide. The Surge case in California tests the contractor classification directly in a jurisdiction with strong labor law. If either line of cases succeeds, the cost of the cheapest version of this work rises, and the economics that produced two-dollar wages start to change.

The honest complications

The story is not as clean as advocacy versions make it. Several complications are real and worth stating plainly.

First, local context matters. A wage that is exploitative against a San Francisco cost of living can be, in some local markets, competitive with alternatives, and some workers have said the flexibility and remote nature of the work are genuine benefits. That does not excuse the conditions, but it means the right comparison is not always the US minimum wage, and reporting that ignores this is less credible for it.

Second, the companies have responses. Sama has said it raised wages and improved conditions over time and disputes characterizations of its pay. Scale has defended its platform and its contributions to local economies. Surge disputes the lawsuit’s framing. These positions deserve to be on the record even where the documented evidence cuts against them.

Third, the trend line on human labeling is uncertain. RLAIF and synthetic data genuinely reduce the per-model demand for human annotation for some tasks. It is possible the industry’s appetite for the lowest-wage labeling shrinks over time. It is also possible that demand simply shifts toward higher-skill annotation, expert evaluation, specialized red-teaming, while the bottom of the market persists for everything synthetic methods cannot yet cover. Which of these dominates is not settled, and anyone who tells you it is, in either direction, is guessing.

Fourth, sourcing limits. Much of what is known comes from journalism and litigation, not from company disclosure, because companies do not publish their wage chains. That means parts of this picture are incomplete by design, and the absence of documentation about a given lab is not evidence that its chain is clean.

What happens next

Three forces will decide how this evolves. The first is law. The contractor-classification cases and the cross-border liability cases are the pressure points, and a win in either raises the floor industry-wide. The second is disclosure. Pressure is building for AI companies to document their data supply chains the way apparel companies were eventually pushed to document factory conditions, and a few procurement standards or regulatory requirements could force what voluntary transparency has not. The third is technical substitution: how far synthetic feedback and automated evaluation actually go in displacing the lowest-paid human work, as opposed to merely moving it.

None of these removes the underlying fact. For now, and for the foreseeable future, the models that increasingly mediate work, search, and conversation are calibrated by human beings whose names do not appear anywhere near the product, who are paid at the bottom of the global wage scale, and who are organized against by the same companies whose agentic systems and governance frameworks are debated in public. The intelligence is real. So is the labor underneath it. The industry has spent years keeping the second fact quieter than the first. It is past time the foundation got the same attention as the building.

The Annotation Underground: Who Trains AI for So Little

What the work actually is

How a two-dollar judgment becomes a model’s reward

The subcontractor chain, by company

Anthropic and Claude, because skipping it would be dishonest

Why the geography is the mechanism

The unit economics of a label

The market splits in two

What it matters for, beyond ethics

The organizing, and the response to it

The honest complications

What happens next

Share this:

Like this:

More posts

The Annotation Underground: Who Trains AI for So Little

The Anchor Problem in AI Agent Delegation Chains

MITRE ATLAS: The ATT&CK Framework for AI Systems

Neural Backdoor Attacks: From BadNets to LLM Trojans

Discover more from My Written Word