Embeddings: How AI Knows Two Things Mean the Same
Here is a small thing that should not work, and does.
You type “my laptop keeps freezing” into a search box. Back comes a help article titled “system hangs on boot.” Read those two phrases again. They share no words. Not “laptop,” not “freezing,” nothing. An old-fashioned search, the kind that matches letters, would have shrugged and returned nothing. Yet the modern one found the exact right answer, because somewhere underneath, the machine understood that a freezing laptop and a hanging system are the same idea wearing different clothes.
That quiet understanding is the single most useful trick in modern AI, and it has a name: embeddings. Almost everything you find impressive, semantic search, RAG, recommendations, an assistant that remembers what you meant three messages ago, is standing on this one idea. So I want to build it up with you slowly, from the first honest intuition all the way to the parts that still make senior engineers smile. No prior math needed to start. By the end you will understand something real.
The core move: turn meaning into a place
Computers cannot compare meanings. They can only compare numbers. So the whole game is a single audacious move: take a piece of text and turn its meaning into a list of numbers, chosen so that similar meanings get similar numbers.
That list of numbers is called a vector, or an embedding. Think of the numbers as coordinates. Two coordinates, and you have a point on a map. Three, and it is a point in a room. An embedding just uses many coordinates, hundreds or thousands, to place a piece of meaning somewhere in a vast space. You cannot picture that space, and that is fine, nobody can. The point is what it buys you: once meaning is a location, “these two things are similar” becomes “these two points are close together.” A fuzzy human question turns into plain geometry.
Seeing the space (a flat version of a huge idea)
Real embedding space has hundreds of dimensions, which is unpicturable. But the behaviour survives if we squash it down to two, so let me show you a flat cartoon of the real thing. Watch where the words land:
The famous demonstration of how structured this space is: take the vector for “king,” subtract “man,” add “woman,” and the point you land on sits almost exactly on “queen.” The space encodes not just topics but relationships, direction itself carries meaning (“the male-to-female direction,” “the singular-to-plural direction”). That is not a party trick someone hard-coded. It fell out of learning, and the first time you see it work you feel a small jolt. I still do.
Measuring “close”: the angle, not the distance
So similar meanings land nearby. How do we measure nearby? Your instinct says “distance between the points,” and that is reasonable, but the tool everyone actually reaches for measures the angle between the two vectors instead. It is called cosine similarity, and it is worth understanding why angle beats distance.
Picture each embedding as an arrow from the origin out to its point. Two arrows pointing the same direction mean the same thing, regardless of how long the arrows are. Cosine similarity reads that angle and hands you one number:
"king" · "queen"
"king" · "crown"
"king" · "banana"
The mechanics, in plain words: multiply the two vectors dimension by dimension and add it all up (that is the dot product), then divide by each arrow’s length. Dividing by the lengths is the important part, because it throws the lengths away and leaves only direction.
Why deliberately ignore length? Two reasons, one practical and one deep. The practical one: a long document about laptops should not rank as “more about laptops” than a short sentence about laptops just because it has more words. Meaning is about which way you point, not how far. The deep one is a genuine senior-level fact worth carrying: in very high-dimensional spaces, plain straight-line distances between points all start to look eerily similar, everything drifts toward equally-far-apart. It is called the curse of dimensionality, and it quietly wrecks distance-based comparison. Angle survives it. That is the real reason cosine is the default, not habit.
Where do the numbers come from? (the honest mechanics)
Fair question: who decides that “king” is [0.21, -0.44, ...]? Nobody hand-writes these. A neural network learns them. Here is the pipeline without the hand-waving, and it holds from a fresher’s mental model to what actually ships.
Two words there deserve a beat, because they separate “I sort of get it” from “I actually get it.”
Dense, not sparse. An older approach gave each word in the dictionary its very own slot, so a vector was tens of thousands of numbers, almost all zero, one slot lit up per word present. That is sparse, and it is why old search thought “laptop freezing” and “system hanging” were total strangers: different slots, zero overlap. Modern embeddings are dense: a few hundred to a few thousand numbers where nearly every value means something, and meaning is spread across all of them. That density is exactly what lets unrelated words land on nearby meanings.
Trained by contrast. How does the model learn to place similar things together? You show it pairs that should be close (a question and its correct answer) and let everything else in the batch count as things that should be far. The model nudges the close pairs together and shoves the rest apart, over and over, across billions of examples. This is contrastive learning, and it is why “laptop freezing” and “system hangs” drift into the same neighbourhood despite sharing no letters. They kept appearing in similar roles, so the model learned to point them the same way.
The thing this unlocks: semantic search
Now the payoff, and it is beautifully simple once the pieces are in place. Searching by meaning instead of by matching letters is four steps:
And this is exactly why our opening magic trick worked. “my laptop keeps freezing” and “system hangs on boot” get embedded into arrows pointing almost the same direction. High cosine similarity. The letters never mattered; the meanings pointed the same way.
Here is that same idea as scores, so you can feel the gradient between a strong match and a weak one:
Where you’ll actually meet embeddings
This is not a lab curiosity. It is quietly running under most useful AI you touch:
| Use case | What gets embedded | Why it works |
|---|---|---|
| Semantic search | Your documents and the query | Finds meaning, not keywords, so synonyms and paraphrases still match |
| RAG | Chunks of your knowledge base | Retrieves the passages nearest the question to feed the model real context |
| Recommendations | Items, and a user's history | "More like this" becomes "nearest neighbours in item-space" |
| Deduplication / clustering | Records, tickets, reviews | Near-identical meanings cluster together even when worded differently |
| Agent memory | Past notes and decisions | Recall what's *relevant* to now, by nearness, instead of scrolling everything |
| Classification | The input text | Similar inputs sit near labelled examples, so the nearest ones vote |
I have leaned on this myself. A tool I built keeps a working memory across sessions, and the way it decides what past context is relevant to your current task is exactly this: embed the notes, embed the moment, pull the nearest. It is a small idea doing enormous work.
One more, for the senior in the room: nested embeddings
Here is a recent, genuinely elegant twist that rewards understanding the basics. A 3072-dimension embedding is powerful but heavy, more storage, slower to compare, at scale that is real money. You would love to use a shorter vector when you can afford lower precision, and a longer one when you need the best. Normally that means training a whole separate model per size. Annoying.
The fix is a training trick called Matryoshka representation learning, named for the Russian nesting dolls, and the name is the whole idea. The model is trained so that the most important meaning is packed into the earliest dimensions, and every prefix of the vector is a complete, usable embedding on its own.
Why it matters in practice: you can do a fast, cheap first pass with the short prefix to narrow millions of candidates down to a few hundred, then rerank just those with the full vector for precision. Best of both. The remarkable part is that a well-trained short prefix often matches a much longer vector from an ordinary model, real quality at a fraction of the cost. That is the kind of design that makes you sit back a little.
The one sentence to keep
Strip everything away and embeddings are this: meaning becomes a place, and similarity becomes distance. Turn text into a point in a space built so that alike things sit close, and suddenly a computer that only knows how to compare numbers can answer questions about meaning, find the right doc with none of the right words, remember what’s relevant, group what belongs together.
The next time a search understands you better than the words you typed, or an assistant surfaces exactly the note you needed, you will know what happened underneath. Your meaning was turned into an arrow, and somewhere in a space too big to picture, it pointed at the answer.