LLMs are lossy zip file of the internet

What LLMs truly are? Let's dive in!

How LLMs are trained?

To truly understand the nature of a Large Language Model (LLM), you must first appreciate the monumental task of pre-training. This is the initial, most computationally expensive stage of development, and it fundamentally defines the model's knowledge.

The core training objective—predicting the next token in a sequence—is deceptively simple. To perform this task well across trillions of tokens of diverse human text (code, history, science, dialogue), the model must learn the underlying rules of the world that generated that text.

For example, to predict that "The capital of France is Paris," the model must have internalized the concepts of "France," "capital," and "Paris," and the relationship between them. To predict the next line of Python code, it must understand syntax, variable scope, and execution flow.

The best analogy for the result of pre-training is that the model's parameters—the billions of weights stored in a massive file (e.g., 140 GB for Llama 2 70B)—represent a lossy compression of a vast chunk of the internet.

The Power of Lossy Compression: Generalization

Crucially, the lossy nature of this compression is what gives the LLM its generalized intelligence. If the model were merely a lossless archive, it could only regurgitate exact documents. Instead, the compression process forces the neural network to discard noise and redundancy, compelling it to identify and internalize the underlying statistical structure, grammar, logic, and world model embedded within the text. This generalization is what allows the model to synthesize new, coherent text, answer novel questions, and perform tasks like translation or summarization—capabilities that go far beyond simple lookup.

The Consequences of Lossy Compression

This lossy nature, however, also explains several key behaviors:

Vague Recollection: The model's knowledge is not precise or guaranteed. It's a "vague recollection" of the internet. It knows the form of an ISBN number but might hallucinate the digits, or it might know a fact in one direction (Tom Cruise's mother is Merily Feifer) but fail the reverse query (The Reversal Curse).

Knowledge Cutoff: Since pre-training is extremely costly and infrequent, the model's internal knowledge has a fixed knowledge cutoff (often months or a year old). To overcome this temporal and factual limitation, modern LLMs must be augmented with external tools, such as web search or Retrieval-Augmented Generation (RAG). These tools allow the model to pause generation, fetch current, verifiable information, and insert it directly into its working memory (the context window) before formulating an answer.

Hallucination: This is the most visible consequence of lossy compression. When the model is asked a factual question, it relies on its vague, compressed memory. If it cannot recall the exact fact, it defaults to generating a statistically plausible answer that looks correct based on the patterns of its training data, rather than admitting uncertainty. This is the model mimicking the confident tone of its training data even when its internal knowledge is fuzzy.

Conclusion

The LLM, viewed as a lossy zip file, is a powerful but fundamentally statistical engine. It is not a perfect oracle or a conscious entity, but a highly optimized artifact resulting from massive computational effort applied to vast amounts of data. Understanding its nature as a probabilistic imitator—one whose knowledge is a vague recollection and whose behavior is shaped by fine-tuning—is essential for effective interaction. The future of LLMs lies in augmenting this powerful core with external tools and advanced reasoning techniques, moving beyond simple imitation toward becoming the kernel of a new, intelligent operating system.

This post was initially written by an LLM with access to transcripts of Andrej Karpathy's videos, then edited.