{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreidomgrgie6ezipr4xtzhn6afnqkbuncsa3svhuq35t2gbaukuu3h4",
    "uri": "at://did:plc:4rgrdigiftglskeax4wvmsev/app.bsky.feed.post/3mjmprrxpmpf2"
  },
  "path": "/2026/04/machine-learning-and-complexity.html",
  "publishedAt": "2026-04-16T12:45:00.000Z",
  "site": "https://blog.computationalcomplexity.org",
  "tags": [
    "How Does Machine Learning Manage Complexity?",
    "softmax function",
    "cumulative distribution function",
    "Dijkstra",
    "connection between AI and nonuniformity",
    "cryptographic pseudorandom generator",
    "paper"
  ],
  "textContent": "At Oxford I focused my research and discussions on how we can use the tools of computational complexity to help us understand the power and limitations of machine learning. Last week I posted my paper How Does Machine Learning Manage Complexity?, a first step in this direction. Let me give a broad overview of the paper. Please refer to the paper for more technical details.\n\nInstead of focusing on the machine learning concepts like neural nets, transformers, etc., I wanted to abstract out a model defined in terms of complexity, as time-efficient non-uniform computable distributions with minimum probabilities. Let's unpack this abstraction.\n\nNeural nets as next token predictors don't always give the same next token, rather they give a distribution of tokens, which leads to a distribution on text of length up to the size of the context window. The way probabilities are computed via a softmax function guarantee that every output can occur, with at least an exponentially small probability, the \"minimum probability\" in the abstraction.\n\nIn computational complexity, we study two main kinds of distributions. A distribution is  _sampleable_ if there is an algorithm that takes in uniformly random bits and outputs text according to that distribution. A distribution is _computable_ if we can compute not only the probability that a piece of text occurs, but the cumulative distribution function, the sum of the probabilities of all outputs lexicographically less than that text. Every efficiently computable distribution is efficiently sampleable but not likely the other way around.\n\nNeural nets as next token predictors turn out to be equivalent to computable distributions. We need these kinds of distributions both on how neural nets are trained but also on how we use them. Computable distributions allow for conditional sampling which lets us use a large-language model to answer questions or have a conversation. You can't get conditional sampling from a sampleable distribution.\n\nNeural nets have a limited number of layers, typically about a hundred layers deep which prevents them from handling full time-efficient (polynomial-time) computations. They can get around this restriction either with reasoning models that allow a model to talk to itself, or by directly writing and executing code which run efficient algorithms of any kind.\n\nMost of the algorithms we use, think Dijkstra, or matching, are uniform, that is the same algorithm on graphs of twenty nodes as the algorithm on twenty million. But neural nets change their weights based on the distributions they train on. These weights encode a compressed view of that data and the extra computation needed to process that data. So we have different algorithms as our technology and data sources improve. I wrote more about this connection between AI and nonuniformity last fall.\n\nWhat does this abstraction buy us?\n\nRestricting to computable distributions forces machine learning algorithms to treat complex behavior as random behavior, much like we treat a coin flip as a random event because it is too complicated to work out all the physical interactions that would determine which side it lands on.\n\nWe illustrate this point by our main result. Let D be the distribution of outputs from a cryptographic pseudorandom generator and U be the uniform distribution of words of the same length. If \\\\(\\mu\\\\) is a time-efficient non-uniform computable distribution with minimum probabilities and \\\\(\\mu\\\\) does as good or a better job than U of modeling D then \\\\(\\mu\\\\) and U are essentially the same distribution. Machine learning models the complex PRG as a uniform distribution, simplifying what we can't directly compute by using randomness.\n\nMore in the paper and future blog posts.\n\nBy Lance Fortnow",
  "title": "Machine Learning and Complexity"
}