Studying people and computers (https://www.nickmvincent.com/) Blogging about data and steering AI (https://dataleverage.substack.com/)
Why the AI evaluation crisis could force a reckoning on dataset provenance, attribution, and consent.
A proposal for interoperable attestation objects that connect training data, evaluation labor, and AI-generated outputs across the AI supply chain.
Reacting to a wide-ranging set of policy ideas from OpenAI.
AI progress means the "polish" of a figure or website no longer proxies for quality. Can we try to turn this into a good thing for curation, attention allocation, and even AI progress itself?
Making an "if you like X, you might want to support Y" argument for data-focused policy
Back to the basics of data leverage.
How we can understand, and react to, the complicated impacts of AI systems on online communities and knowledge commons
On user data control, coding agents as retrievers, and the value of your coding transcripts
Sharing an early reaction to recent coding agent discourse and two relevant projects
Discussion for the data-leverage-blogs project
Another recap post for the Data Leverage newsletter! (and a test of using Leaflet for blogging
In fact, anyone who doesn't think they will be a "big winner" long term benefits from clear rules, even if it means training data costs more in the short term.
Another recap post for the Data Leverage newsletter!
New model releases keep (re)sparking discussions about training data. What can we assume is upstream in the data river, and what do we want to see happen?
This post was written by Aditya Karan, with support from Nick Vincent and Karrie Karahalios to accompany a FAccT 2025 paper. It was originally published on Jun 19, 2025 via the Crowd Dynamics Lab blog…
Reacting to a fresh wave of discussion about AI's impact on the economy and power concentration, and reiterating the potential role of collective bargaining.
Connecting evaluation and dataset documentation via the lens of "AI as ranking".
It's ranking information all the way down.
Google and others solve our attentional problem by ranking discrete bundles of information, whereas ChatGPT ranks more granular chunks. This lens can help us reason about AI policy.
Commenting on recent coverage of, and discussion about, Meta's arguments about training data value quantification.
A consortium of Public AI labs can substantially improve data pricing, which may also help to concretize debates about the ethics and legality of training practices.
Research agents and increasingly general reasoning models open the door for immense "evaluation data leverage".
Our AI design choices in 2024 could preclude "Powerful AI" in 2030.
There's still incredible tension in the current data paradigm, but sharing "data protection" technologies, like those used by OpenAI to accuse DeepSeek of model theft, can help cut a path forward.
There's deep tension in the current ask-for-forgiveness-free-for-all approach to acquiring data for model training. Will "open" models cause this tension to reach a breaking point?
The race to produce premiere AI products with high price tags might change the standards around data disclosure.
The idea that data-dependent AI systems are ready and willing to crush any leverage from knowledge workers is unlikely to make the AI industry look good to the public.
Examining the Meta CEO's claim that the "individual work of most creators isn’t valuable enough for it to matter" in the context of AI training.
Interacting with many models and harnessing the power of `diff`
Focusing on feedback loops -- connecting modern AI to early cybernetics-style thinking -- could help solve looming challenges and support democratic inputs to AI.
How can we start thinking about how opt-out decisions by content-producing organizations will affect LLMs?
The New York Times is trying to remove its content from OpenAI models, surfacing tensions around copyright, economic harms, privacy, and the distribution of AI benefits.
Could Upcoming Data Legislation Enable a "Right to Data Strike"?
Once again, we’ve had an eventful few weeks in the space of data-dependent computing!
The Last Three Months in Review: What's New and What's Next
The plants in the Gardens by the Bay evoke a sense of flourishing-by-design; photo by Victor from Unsplash.
Measuring the Alignment of AI Systems Based on their Data Pipelines
Much of my work is in pursuit of “data dignity”, an idea that stems in part from scholars arguing that we should sometimes think of “data as labor”.
The public debate over AI has seriously heated up in the wake of new advances in the design and deployment of large generative AI models.
More on why you're an expert language model trainer
Background
The much-celebrated GPT-3 that can answer questions, write poems, and more wouldn’t be possible without content written by millions of people around the world. Shouldn’t they get some credit?