Intro
Building a house on an unfinished foundation doesn't seem like a good idea, yet that's exactly what startups are doing today when they build products on top of AI systems. Indeed AI is not done yet. It's a moving target that, while bringing high-paced innovations to the table, can't be considered fully mature or stabilized yet.
Thriving in such an environment requires navigating ambiguity at a whole new level.
Users don't know what's possible with AI, creators don't know what users should expect, and no one fully understands what AI is capable of.
But there is a useful concept that can help: Statistical distribution.
This post draws from my experience at Novelab, Google, ClipDrop, and Jasper, where I became accustomed to building experiences, POCs, and products on top of numerous immature technologies, including various forms of AI.
Understanding Statistical distribution is key to unlocking AI's power—let me explain why in the following post.
Clarification
The following contains deliberate simplifications that might trouble purists. If you want a deeper understanding, this video is an excellent resource. A model doesn't have to be perfectly accurate to be useful—what follows is sufficient for making good decisions.
Explicit programming vs Deep Learning
Most traditional algorithms are deterministic. That’s why humanity loved computation in the first place: it guaranteed that calculations could always produce the same, correct result. Computation makes it possible to process billions of transactions per second with perfect accuracy.
Modern AI, based on deep learning, doesn’t work like that. Instead, its internal rules and decision-making processes are learned from data rather than explicitly programmed.
The main consequence is that an AI-based system can’t be formally proven correct the way a traditional algorithm can. At best, we can measure its performance in terms of the percentage of correct answers it gives under testing conditions.
How Deep Learning works.
Let’s forget the details for now. What you need to know in 2025 is that AGI is still a myth—and the hype around it doesn’t help.
AI is, at its core, an input–output system.
During training, it learns to associate outputs with inputs. If you train it on enough input/output pairs, it will usually produce the right output for a given input.
But this only works if the input is reasonably similar to what the system has seen during training. For example, if you train a system to generate images from prompts, it will only work well for the kinds of prompts it was exposed to during training, and it will generate only the kinds of images it learned to associate with them.
This core mechanism becomes surprisingly powerful when AI is combined with external tools. The ability to search, read, write, or call APIs makes AI immensely capable. But the fundamental limitation remains: it works best when the input/output patterns were present in the training data.
Of course, reality is more complex—but the basic principle holds. In 2025, AI interpolates; it does not truly extrapolate.
Distribution explained with a fruit basket.
Imagine you have a big fruit basket.
- If it’s full of apples and bananas, and just a few oranges, then the distribution of fruit is: “mostly apples and bananas, rarely oranges.”
- If you pull out a fruit at random, chances are high it’ll be an apple or banana, and low it’ll be an orange.
Now, if I train an AI on this basket:
- It will get very good at recognizing apples and bananas.
- It might be able to spot oranges, but not very reliably.
- If you suddenly add a pineapple (which it has never seen before), the AI won’t know what to do with it — because pineapples weren’t part of its distribution.
In other words, AI is only as good as the “distribution” of data it has learned from. A distribution simply means the patterns or categories of examples the model has been exposed to.
That's why Large Language Models have been trained on the full breadth of information available on the internet. This covers the broadest possible distribution. The training set is so large that on the surface, an LLM seems capable of novel tasks. However, the general rule still applies: if a request falls within the types of examples the AI has seen before, the results can be excellent. But if the request lies outside that range, the output quickly loses quality.
Why distributions matters for Products
From a product perspective, distributions map directly to use cases. For instance at a broad level, coding is one distribution, marketing text is another, image creation yet another. Within each, there are narrower distributions: in coding, tasks like refactoring or bug fixing; in writing, tasks like blog posts versus SEO snippets.
An AI system may excel at one but underperform at another, depending entirely on whether those patterns were well represented in training.
When designing AI products, success depends on matching the intended use case with the distribution the model has mastered. Performance is strongest when requests stay close to the familiar patterns, weaker at the edges, and unreliable outside them.
This explains why specialized AI products — for example, tools designed only for contract review or for bug fixing — often outperform more general models. By focusing training and optimization on a narrower distribution, they align much more closely with the real needs of users.
If you are building and training models, the most important step is to identify the distributions that matter to your users. In practice, this means understanding the exact types of tasks and inputs they will bring to the system, and ensuring your training data reflects those patterns. Aligning the training set with the user’s distribution of needs is the surest path to reliable performance. When selecting between existing models, this same principle applies. First, identify which distribution your product depends on. This clarity will help you evaluate which model will best serve your users' needs.
Real-World example
This story occurred in 2023 while I was working for ClipDrop (A company I cofunded, sold to stability.ai in 2024)
We developed a mobile app that, among other features, removed backgrounds from images.
Our model was initially trained on open source data, then we incorporated user data until our model became highly effective.
When we launched a web version of ClipDrop using the same algorithm, users were dissatisfied. We discovered why: the distribution of images from mobile differed significantly from those on the web. On mobile, users typically photographed simple objects like their hands or mugs—the easiest subjects available. On desktop, however, they uploaded images and deliberately chose complex subjects to test the system, particularly portraits with intricate hair.
After retraining our model to handle complex hair, our users were satisfied again.
What is a good model?
This experience with Clipdrop was a wake-up call. There's no such thing as a universally "good model" – only models that perform well for specific distributions. Having the broadest possible distribution doesn't guarantee coverage of your specific needs. Furthermore, the broader the distribution, the larger the model needs to be, which increases latency and cost while potentially decreasing quality for specific targeted use cases.
Tests sets to the rescue!
AI models can "memorize" training examples instead of truly learning.
It's like teaching a child 2+2=4 and then testing them with 2+3. If they succeed, they've learned the rule, not just memorized the answer.
To check for true learning, researchers use a test set: data from the same distribution as training, but with different examples. The larger and more representative the test set, the more accurate the evaluation.
This is how AI creators measure model quality — and it's how everyone should measure quality too! Even when you test a model randomly, your test is effectively a test set with its own distribution. This explains why the "vibe check" of a model can feel different from benchmark results: often, the benchmark distribution doesn't match the user's real-world usage distribution.
Here's the key activable insight from this post: use test-sets as specification and communication tools. In other words: whatever you want to do with AI, create the test set first.
If you need to create a model: create the test-set first. This helps you understand the problem and confirm with all parties that the model will solve the right problem. It defines the needed distribution for the training set and makes its creation faster and cheaper.
If you're creating a product that needs AI: create the test-set first. This dramatically improves your evaluation of model candidates: a handful of tests is never enough; proper evaluation requires testing against a full distribution of inputs.
Conclusion
Test-sets should always be the first thing to consider when you are in charge of an AI project. They are the clearest way to communicate intent unambiguously across teams.
In classical organizations, it's not clear who should be responsible for the distribution of the test-set. Depending on the task, it could be PMs, researchers, engineers, or designers. I have no doubt that as AI-first companies develop, they will need new roles—including one dedicated to managing test sets and ensuring their distribution matches company objectives.