What Tokens Actually Are (And Why They Cost You)

Here is a question that should be easy, and is not.

How many times does the letter “r” appear in the word “strawberry”?

You counted three. A child counts three. But for a long time, if you asked the most advanced AI models on earth, many of them said two. Confidently. The same models that can write working code and explain quantum physics could not count the letters in a fruit.

This is not a bug someone forgot to fix. It is a window into how these models actually read, and it all comes down to one idea that almost nobody outside the field talks about: the token.

Once you understand tokens, three things suddenly make sense. Why your AI bill is shaped the way it is. Why the model has a memory limit. And why it fumbles “strawberry.” So let me walk you through it the way I wish someone had walked me through it.

The model does not read letters. Or words.

When you type a sentence, you see letters grouped into words. The model never sees that. Before your text reaches the model, it gets chopped into pieces called tokens, and a token is usually a chunk of a word, not a whole word and definitely not a single letter.

Try it. Type anything into the box below and watch it break apart.

0characters

0words

0tokens (approx)

This is a close approximation, not a real tokenizer. The exact split differs by model. But the shape of the result, chunks of words plus spaces and punctuation, is exactly how it really works.

Type and watch. Notice that common words stay whole, rare words get split, and the spaces and punctuation become pieces too.

A few things you probably just noticed, and they are all real:

Common words like “the” or “are” stay in one piece. The model saw them so many times that they earned their own token.
Rarer or longer words get split. “Tokenization” might become “Token” plus “ization.”
Spaces and punctuation are not free. They are part of tokens too.

This chunking is done by a method called Byte Pair Encoding, or BPE. The short version: the tokenizer learned, from a mountain of text, which letter sequences show up together so often that it is worth giving them a single number. Frequent chunks get merged into one token. Rare ones stay split into many.

So a token is not a unit of meaning you chose. It is a unit of frequency the machine decided on, long before it ever met you.

Now the strawberry makes sense

Back to our fruit. To you, “strawberry” is ten letters. To the model, it is something like three tokens.

strawberry

what you see (ten letters, three r's)

strawberry

what the model sees (three chunks, no loose letters)

The model was handed "st", "raw", "berry". The individual r's are baked inside those chunks, invisible. Asking it to count letters is like asking you to count the bricks in a house by looking at a photo of the front door.

The model never received the letters as separate things. It got three chunks, each one just a number to it. There is no “r” sitting in its view to count. So when it guesses an answer, it is pattern-matching on what people usually say, not actually counting, because it has nothing to count.

There is even a trick that proves the point. Type “s t r a w b e r r y” with spaces between every letter, and many models suddenly get it right. Why? Because the spaces force the tokenizer to break it into single letters. Now the model can finally see each one. Same word, same model, but you changed what it was allowed to look at.

I find this genuinely clarifying. The model is not stupid here. It is doing exactly what it can do with what it was given, and counting letters was never in the deal.

Why this is also your bill

Here is where it gets practical, and a little expensive.

Every one of these AI services charges you by the token. Not by the word, not by the question, by the token. And it counts the tokens you send in plus the tokens it sends back. Both directions. So a chatty model that pads its answers is quietly costing you more.

There is a rough rule that is worth keeping in your head. On average, for ordinary English, one token works out to about four characters, which is roughly three quarters of a word. Flip that around and a hundred tokens is somewhere near seventy five words. So a page of writing, call it seven hundred and fifty words, lands around a thousand tokens. These are not exact figures, they are the kind of estimate you do on the back of a napkin, but they are close enough to sanity check a bill or a prompt length in your head.

1 token ≈ 4 characters ≈ ¾ of a word 100 tokens ≈ 75 words

A handy back-of-the-napkin conversion. Good enough to estimate a bill or check if your prompt will fit. Not exact, and never exact for code or other languages, which we will get to.

That rule is your friend for quick maths. Need to know if a long document fits in a model’s limit? Count the words, multiply by about 1.3, and you have a decent token estimate. Want to guess what a feature will cost at scale? Same trick.

But notice the rule keeps saying “English.” That word is doing a lot of work, and it is where most people get surprised.

Not all text costs the same

The four-characters rule holds for ordinary English prose. The moment you leave that lane, it falls apart, and your costs change with it.

Code is heavier. Brackets, indentation, odd variable names, none of it was common in the same way plain English is, so it splits into more tokens per character.
Numbers and symbols are heavier. A long number can become a surprising number of tokens.
Other languages are much heavier. This one matters to a lot of us. Text in scripts the tokenizer saw less of, including many Indian languages, can run close to one token per character. That is roughly four times the token cost of the same meaning in English.

Sit with that last point for a second, because it is not just trivia. It means that, today, asking a question in Hindi or Tamil can cost several times more than asking the same question in English, purely because of how the tokenizer was built. The unfairness is baked into the unit itself. I think that is worth knowing, and worth pushing on.

The part almost nobody mentions: everyone counts differently

You would think a token is a token. It is not. Each company built its own tokenizer, on its own data, with its own vocabulary. So the exact same sentence can cost a different number of tokens depending on who you send it to.

OpenAI

Claude

Gemini

~16

Illustrative, not a benchmark: the same sentence can land on different token counts across providers. For everyday prose the gap is usually small, a few percent. For code it can stretch to ten or twenty percent. The point is that there is no single true count.

Here is the real, checkable picture of how the big three differ:

OpenAI uses a library called tiktoken. Older models use an encoding called cl100k_base with a vocabulary around 100,000 chunks. Newer ones use o200k_base, roughly 200,000 chunks, with extra merges that handle code better. A bigger vocabulary means more text fits per token.
Anthropic (Claude) uses its own tokenizer and, notably, does not publish it. The supported way to know your real cost is to ask their count-tokens endpoint before you send the full request.
Google (Gemini) uses a different method again, called SentencePiece, with an even larger vocabulary.

So when you read that a model has a “200,000 token context window,” that number does not mean the same amount of text everywhere. Two hundred thousand of Claude’s tokens and two hundred thousand of OpenAI’s tokens are not the same number of words. The unit looks standard and is not.

This is exactly the kind of quiet, real headache you hit when you build a tool that talks to more than one provider, and it is part of why I built llmswap. You stop being able to assume “a token” means one fixed thing.

How to actually count them, when it matters

For a rough estimate, the four-characters rule is plenty. When you need the real number, do not guess. Use the tool the provider gives you:

For OpenAI models, the tiktoken library counts exactly, offline, in your own code.
For Claude, call the count-tokens endpoint, which returns the true input token total before you spend anything on the real call.
For Gemini, there is a matching count-tokens call.

The pattern is the same everywhere. Estimate freely while you are sketching, and measure precisely before anything ships or anything gets billed.

What I want you to take away

A token is a chunk of text, decided by frequency, that the model treats as a single number. It does not line up with letters, which is why “strawberry” trips it. It does not line up with words, which is why your bill is in this strange unit. And it does not line up across companies, which is why the same sentence has no single price.

None of this is magic, and none of it is arbitrary once you see the shape of it. It is just a side effect of teaching a machine to read by feeding it patterns instead of meaning.

The next time a model says something brilliant and then miscounts the r’s in a fruit, you will not be confused. You will know it never saw the letters at all. It was only ever working with chunks, doing its best, the way it always is.