Nick Vincent

Studying people and computers (https://www.nickmvincent.com/) Blogging about data and steering AI (https://dataleverage.substack.com/)

375 followers457 following54 stories

Longform Stories

The AI "Evaluation Crisis" Is an Opportunity to Get Data Flow Right

Why the AI evaluation crisis could force a reckoning on dataset provenance, attribution, and consent.

Apr 30·15 min read·2978 words

April 2026 small points

Apr 28·1 min read·78 words

Attestation across the AI Supply Chain

A proposal for interoperable attestation objects that connect training data, evaluation labor, and AI-generated outputs across the AI supply chain.

Apr 9·37 min read·7250 words

"People First" Policy Ideas that Complement Each Other (through better data flow)

Reacting to a wide-ranging set of policy ideas from OpenAI.

Apr 6·2 min read·292 words

AI is driving the cost of polish down; some musings on fancy versus terse artifacts

AI progress means the "polish" of a figure or website no longer proxies for quality. Can we try to turn this into a good thing for curation, attention allocation, and even AI progress itself?

Apr 1·2 min read·290 words

Two natural allies of a "Data Transparency" agenda: capabilities forecasters and social simulators

Making an "if you like X, you might want to support Y" argument for data-focused policy

Mar 9·5 min read·814 words

A Short Guide to Data Strikes and Conscious Data Contribution in the Context of 2026 Frontier AI

Back to the basics of data leverage.

Mar 3·8 min read·1578 words

The Paradox of Reuse in 2026: A Case of Quasi-Enclosure, or "Subsidized Club Goods that Sort of Look Like Public Goods"

How we can understand, and react to, the complicated impacts of AI systems on online communities and knowledge commons

Feb 17·14 min read·2631 words

The Coding Agent Data Deal

On user data control, coding agents as retrievers, and the value of your coding transcripts

Jan 12·17 min read·3335 words

Coding agents are (1) a big deal, (2) very relevant to data leverage, and (3) able to help build tools that support data leverage!

Sharing an early reaction to recent coding agent discourse and two relevant projects

Jan 5·8 min read·1416 words

talks Discussions

Jan 3·1 min read·15 words

ranking-book Discussions

Jan 3·1 min read·15 words

courses Discussions

Jan 3·1 min read·15 words

cb4i Discussions

Jan 3·1 min read·15 words

paidf-mini-book Discussions

Dec 25·1 min read·15 words

data-counterfactuals Discussions

Dec 25·1 min read·15 words

personal-website Discussions

Dec 25·1 min read·15 words

shared-references Discussions

Dec 25·1 min read·15 words

data-napkin-math Discussions

Dec 25·1 min read·15 words

data-licenses Discussions

Dec 25·1 min read·15 words

data-leverage-blogs Discussions

Dec 24·1 min read·15 words

data-leverage-blogs Discussions

Discussion for the data-leverage-blogs project

Dec 24·1 min read·25 words

How collective bargaining for information, public AI, and HCI research all fit together

Another recap post for the Data Leverage newsletter! (and a test of using Leaflet for blogging

Dec 23·2 min read·367 words

Almost Everybody -- Including Both Data Creators and AI Companies -- Stands to Benefit from Clearer "Data Rules".

In fact, anyone who doesn't think they will be a "big winner" long term benefits from clear rules, even if it means training data costs more in the short term.

Nov 26·32 min read·6242 words

How collective bargaining for information, public AI, and HCI research all fit together

Another recap post for the Data Leverage newsletter!

Oct 11·2 min read·359 words

Which datasets should we assume are "in all the AI models"?

New model releases keep (re)sparking discussions about training data. What can we assume is upstream in the data river, and what do we want to see happen?

Sep 24·8 min read·1493 words

Algorithmic Collective Action With Two Collectives [crosspost]

This post was written by Aditya Karan, with support from Nick Vincent and Karrie Karahalios to accompany a FAccT 2025 paper. It was originally published on Jun 19, 2025 via the Crowd Dynamics Lab blog…

Jun 20·5 min read·821 words

On AI-driven Job Apocalypses and Collective Bargaining for Information

Reacting to a fresh wave of discussion about AI's impact on the economy and power concentration, and reiterating the potential role of collective bargaining.

Jun 5·4 min read·683 words

How do we know our AI output is good? Double checks, bar charts, vibes, and training data.

Connecting evaluation and dataset documentation via the lens of "AI as ranking".

May 30·10 min read·1952 words

Each Instance of "AI Utility" Stems from Some Human Act(s) of Information Recording and Ranking

It's ranking information all the way down.

May 28·6 min read·1179 words

Google and TikTok rank bundles of information; ChatGPT ranks grains.

Google and others solve our attentional problem by ranking discrete bundles of information, whereas ChatGPT ranks more granular chunks. This lens can help us reason about AI policy.

May 27·6 min read·1128 words

[microblog] One book is worth "0.06%" benchmark points to AI; is "no different from noise". What gives?

Commenting on recent coverage of, and discussion about, Meta's arguments about training data value quantification.

Apr 21·3 min read·428 words

Public AI, Data Appraisal, and Data Debates

A consortium of Public AI labs can substantially improve data pricing, which may also help to concretize debates about the ethics and legality of training practices.

Apr 3·14 min read·2661 words

Evaluation Data Leverage: Advances like "Deep Research" Highlight a Looming Opportunity for Bargaining Power

Research agents and increasingly general reasoning models open the door for immense "evaluation data leverage".

Mar 2·9 min read·1764 words

Tipping Points for Content Ecosystems

Our AI design choices in 2024 could preclude "Powerful AI" in 2030.

Feb 12·12 min read·2225 words

AI Labs Should Open Source Data Protection Technologies

There's still incredible tension in the current data paradigm, but sharing "data protection" technologies, like those used by OpenAI to accuse DeepSeek of model theft, can help cut a path forward.

Jan 31·7 min read·1257 words

Live by the free-content-for-training sword, die by the free-content-for-training sword

There's deep tension in the current ask-for-forgiveness-free-for-all approach to acquiring data for model training. Will "open" models cause this tension to reach a breaking point?

Jan 28·6 min read·1186 words

Selling AGI like AG1: Will Consumers Push Back Against Proprietary Blends of Herbs and of Data?

The race to produce premiere AI products with high price tags might change the standards around data disclosure.

Dec 12·6 min read·1134 words

Perplexity CEO's Interaction with Striking New York Times Workers Does Not Reflect Well on the AI Industry

The idea that data-dependent AI systems are ready and willing to crush any leverage from knowledge workers is unlikely to make the AI industry look good to the public.

Nov 9·2 min read·221 words

Is Zuckerberg right to say that your specific creative work has no value to AI?

Examining the Meta CEO's claim that the "individual work of most creators isn’t valuable enough for it to matter" in the context of AI training.

Sep 28·1 min read·143 words

"Many Models" and "Track Changes" for AI: Some Thoughts on LLM Interfaces

Interacting with many models and harnessing the power of `diff`

Aug 8·7 min read·1388 words

Building a Data Pipeworks for Democratic AI: From Human Knowledge to Records to AI Systems

Focusing on feedback loops -- connecting modern AI to early cybernetics-style thinking -- could help solve looming challenges and support democratic inputs to AI.

Nov 13·17 min read·3276 words

Will the New York Times Data Strike Have a Large Impact on ChatGPT?

How can we start thinking about how opt-out decisions by content-producing organizations will affect LLMs?

Sep 28·13 min read·2573 words

A Harbinger of the Future of Content? The New York Times Starts a Data Strike

The New York Times is trying to remove its content from OpenAI models, surfacing tensions around copyright, economic harms, privacy, and the distribution of AI benefits.

Aug 25·7 min read·1360 words

The WGA Strike is a Canary in the Coal Mine for AI Labor Concerns

Could Upcoming Data Legislation Enable a "Right to Data Strike"?

May 5·1 min read·99 words

Reddit, StackOverflow, and Europe: All Trending Towards Data Dignity

Once again, we’ve had an eventful few weeks in the space of data-dependent computing!

May 1·8 min read·1505 words

Data Leverage Recap: December 2022 - April 2023

The Last Three Months in Review: What's New and What's Next

Apr 18·1 min read·83 words

Bing Rewards for the AI Age

The plants in the Gardens by the Bay evoke a sense of flourishing-by-design; photo by Victor from Unsplash.

Mar 30·17 min read·3319 words

Plural AI Data Alignment

Measuring the Alignment of AI Systems Based on their Data Pipelines

Mar 2·10 min read·1991 words

AI Technologies are System Maps, and You are a Cartographer

Much of my work is in pursuit of “data dignity”, an idea that stems in part from scholars arguing that we should sometimes think of “data as labor”.

Feb 3·8 min read·1539 words

AI Artist or AI Art Thief? Innovation, Public Mandates, and the Case for Talking in Terms of Leverage

The public debate over AI has seriously heated up in the wake of new advances in the design and deployment of large generative AI models.

Dec 16·5 min read·872 words

ChatGPT is Awesome and Scary: You Deserve Credit for the Good Parts (and Might Help Fix the Bad Parts)

More on why you're an expert language model trainer

Dec 4·9 min read·1753 words

The Paradox of Reuse, Language Models Edition

Background

Dec 2·6 min read·1081 words

Don’t give OpenAI all the credit for GPT-3: You might have helped create the latest “astonishing” advance in AI too

The much-celebrated GPT-3 that can answer questions, write poems, and more wouldn’t be possible without content written by millions of people around the world. Shouldn’t they get some credit?

Sep 22·4 min read·707 words