Corporate America's new data gold rush
AI’s next breakthrough won’t come from scraping the web. Companies are racing to unlock new training data, from personal data to drones and corporate archives

Photo by Sean Gallup/Getty Images
The era of free AI training data is over. Reddit $RDDT charges millions for API access. The New York Times sued. Publishers are blocking scrapers. Even if AI companies could still vacuum up the public internet, they're running into a bigger problem: they need different kinds of data entirely for the next leap in abilities.
Large language models were built by scraping text and images from the web. But as AI systems move beyond chatbots, they need training data that was never publicly available in the first place. Data that's locked away, or scattered, or doesn't even exist yet.
New markets are emerging to unlock these sources. Here are three.
Your digital exhaust, monetized
Most people think of personal data as Social Security numbers and health records. But nearly everything you do online generates data that platforms collect and use — your Spotify $SPOT listening history, your email patterns, the documents you write in Google $GOOGL Docs, your conversations with ChatGPT.
When you download your Instagram data, for example, the company doesn't just give you your photos. You get everything Instagram has inferred about you based on your browsing behavior: hundreds of data points ranging from innocuous labels like "interested in nature" to psychological assessments like whether you have depression.
None of it is publicly scrapeable. All of it is legally yours.
"If you park your car in a parking lot, the parking lot doesn't own your car," says Anna Kazlauskas, CEO of Vana, a company building infrastructure for individuals to contribute their platform data to AI training. The same principle applies to data: you own it, even if it lives on someone else's server.
The scale is massive. A version of Common Crawl, the dataset that trained Meta $META's Llama 3, contains about 15 trillion words scraped from the public internet. If 100 million people each contributed data exports from just five platforms, that would yield 450 trillion tokens, 30 times larger than any existing dataset.
This type of data could unlock personalized AI that understands your music taste, or health models trained on real sleep and fitness data, all impossible with scraped web content. Kazlauskas says that paying people for data only they can provide could also reshape the broader AI debate.
"A lot of the fear around AI comes from the lack of proper attribution and economics," Kazlauskas says. "If you teach AI how to do your job, you should actually own that AI model."
Mapping the physical world
Text models could train on scraped web data. But the next generation of AI needs accurate, consistent information about the physical world. Robots navigating cities, autonomous vehicles, and augmented reality all need high-fidelity digital maps to make decisions against.
The problem is that existing aerial data is fragmented. It comes from various contractors with different sensors at different accuracies, making it nearly impossible to train reliable geospatial models. Satellite imagery, while covering most of the planet, lacks the resolution. The data layer AI companies need simply doesn't exist yet.
Spexi is trying to build it using gig workers and drones. The company has more than 10,000 pilots fly standardized missions at 80 meters altitude. In the past 18 months, they've covered more than 6 million acres across 300 North American cities at higher resolution than satellites or traditional aerial imagery, says Bill Lakeland, Spexi's CEO.
Spexi is working with companies like Niantic to train large geospatial models for augmented reality and robotics. Unlike language models, these need constant updating as buildings rise and roads change. It's a version of the same problem plaguing ChatGPT and other LLMs — how to keep models current without retraining from scratch. Lakeland's team is working on algorithms to predict when and where updates are needed, but it remains an unsolved research challenge.
Big data's second chance
One of the world's largest PC manufacturers has been collecting telemetry data for seven years. No one has looked at it. When Sachin Dharmapurikar's team at The Modern Data Company finally analyzed it, they discovered that two of the 70 fields had been collected incorrectly the entire time.
Dharmapurikar's company helps enterprises transform legacy data into structured, contextualized datasets designed for specific business questions rather than general storage. A decade ago, companies began tracking everything and storing it in the cloud, assuming that collecting data would eventually yield insights. Instead, it created expensive, siloed, unmanaged data landscapes.
When ChatGPT exploded in popularity, many executives thought they'd finally found an easy solution. Just feed all that stored data into an LLM and watch the magic happen. Dharmapurikar calls this the "ChatGPT curse."
The reality is more complex. Companies need four things: data quality at scale, the ability to trace lineage and explain how conclusions were reached, governance to prevent AI hallucination, and semantic metadata that contextualizes data in business terms. The lifetime value of a retail customer is different from an enterprise customer, for example. Without that context, models will infer incorrectly.
Even when the data exists, it's often trapped. Sales, manufacturing, and web teams collect data in silos, and transferring it between departments requires bureaucracy and red tape. AI needs information from across an organization, but the reality is fragmented systems that don't talk to each other.
Dharmapurikar says the industry is finally getting realistic. "People are now more calculated, more rational and pragmatic about this stuff," he says. "The reality is kicking in big time that there is no easy solution."