Hi HN! I was looking for a tokenizer that would accurately(!) count tokens in browser, and I couldn't find one. So I thought "how hard can it be", and here we are 2 weeks later...
I think it would take quite a bit of work for someone to grab that BPE implementation and make it work for LLaMA. Less work than rewriting the whole tokenizer from scratch, for sure, but a non trivial amount of work anyway.
If we wanted to map this string to tokens by greedily going from left to right and choosing tokens from the vocabulary with the strategy of minimizing the number of tokens, our algorithm would be very simple. We would end up with the following tokenization: [17229, 2580] == [" grab", "bed"]
Surprisingly, the LLaMA tokenizer does not work this way. It actually finds a "worse" tokenization for this input string: [2646, 1327, 287] == [" gra", "bb", "ed"]
The tokenizer arrives at this 3 token output by applying "merges" in a priority order. For example, this is a merge: [" g", "r"] -> " gr". The trained data contains tens of thousands of these merges. When we apply the merges in the priority order, we end up with 3 tokens.
Now you might be thinking, that's easy, we'll just iterate the list of merges and see if any of them apply. Only problem with that approach is that applying a merge can open up a new opportunity to merge something else that wasn't possible before. This right here is the key thing that makes this problem complicated. We can solve this problem by iterating all possible merges from the beginning after every time we apply a merge. This would produce the correct solution. Only problem is: our algorithm is now very slow and takes minutes to run...
As a non-programmer this suggests use of a trie? But could you very roughly sketch what you did, or at least give me a keyword or two to lookup? Grazie mille.
That's quite a question coming from a self-proclaimed non-programmer (are you sure you're not one?)
I didn't use a Trie. The main data structures used are Linked List and Priority Queue.
The tokenization begins by transforming each character to a token. After that merges are applied in priority order as long as merges are possible.
The current state of tokenization is represented as a linked list where each node (token) points to the previous node and next node.
Nodes of the Linked List are added to a Priority Queue _if_ a merge to their corresponding "next node" is possible according to vocabulary (we use a Hash Map to perform these checks fast). Even though we are technically adding nodes of a linked list into the Priority Queue, you can think of the Priority Queue as holding "potential merges" in the order of their priority (priority according to the trained tokenizer merge data).
When we poll a node from the priority queue, we first check that the merge is still possible (because the situation may have changed between the time the node was added to the priority queue, and the time when it was polled from the queue). If the merge is possible, then this is guaranteed to be the most highest-priority merge, so we apply the merge. At this point we need to do several things: we need to mark the previous nodes of the merge as "deleted", we need to create a new node representing the merged token, we need to update the pointers of the adjacent nodes in the linked list, and we need to check if new merges have become possible, and if they have, we need to add the corresponding nodes to the priority queue.
I take it if you do greedily go from left to right, you end up with different tokenization and the model doesn't work properly because that isn't what it was trained with?
Did the model designers choose this 'difficult' tokenization because it performs better?
> I take it if you do greedily go from left to right, you end up with different tokenization and the model doesn't work properly because that isn't what it was trained with?
The way that the LLaMA tokenizer works is, a word is tokenized one way within a certain context, and a different way within a different context. If you go greedily from left to right, you will find mostly the same tokenizations for words, but occasionally the tokenization will be different. It will still be a legitimate tokenization for the word, though, so the model will not spew nonsense if you feed these tokens into it. I assume the outputs of the model would be slightly suboptimal, but the difference might be so small that nobody notices anything.
Anyway, I don't expect anyone to use this library in a manner where they will turn text to tokens with this library, and then feed the tokens to the model. I expect people to use this library to count tokens, dynamically iterate on the prompt to fit within the token limit, and then eventually send raw text to the model, which performs the tokenization by itself. If we think about this use case, and we consider what would happen if we just greedily tokenized from left to right, it would be mostly fine, because our token counts would be pretty close to the real token count. But occasionally it wouldn't be fine: occasionally our greedy tokenizer would say that we are below the token limit, when actually we're beyond the token limit. In this case the model would either return an error, or it would automatically trim some part of the prompt. If the prompt is automatically trimmed, it might be trimmed from the wrong place. For example, oobabooga trims the prompt from the beginning, when you typically would want to trim the prompt from the middle.
> Did the model designers choose this 'difficult' tokenization because it performs better?
I'm not sure. LLaMA is using a SentencePiece BPE tokenizer, and the most-often touted benefit of this tokenizer is that it can fluently deal with multiple different languages. So it's possible that they chose this to support multiple different languages, not to maximize the quality of output in the English language.
Tokenizers seem to be a massive pain in the neck if you are just calling into an API to use your model. The algorithm itself is non-trivial, and they need pretty sizable data to function: the vocabulary and the merges, which just sit there, using memory. I'm writing https://github.com/ryszard/agency in Go, and while there's a good library for the OpenAI tokenization, if you want a tokenizer for the HF models the best I found was a library calling HF's Rust implementation, which makes it horrible for distribution.
However, at some point I realized that I needed not really the tokens, but the token count, as my most important use was implementing a Token Buffer Memory (trim messages from the beginning in such a way that you never exceed a context size number of tokens). And in order to do that I don't need it to be exactly right, just mostly right, if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports). So, I took files from Project Gutenberg, and compared the ratio of tokens I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).
I'm not throwing shade at this project – just being able to call the tokenizer would've saved me a lot of time. But I hope that if I'm wrong about the estimates bring good enough some good person will point out the error of my ways :)
> if I am ok with slightly suboptimal efficiency (keeping slightly less tokens than the model supports) ... multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude
This sounds reasonable to me. You might also want to consider estimates based on the number of characters. And you also need a fallback for what to do when the user inputs some weird input that doesn't fall inside your safety margin, but instead causes OpenAI API to return an error (maybe in that case you aggressively trim the input and retry?)
> I get using a proper tokenizer and just calling `strings.Split`, and it seems to be remarkably stable for a given model and language (multiply the length of the result of splitting on spaces by 1.55 for OpenAI and 1.7 for Claude, which leaves a tiny safety margin).
One time I suggested this, got downvoted to hell.
To be fair to the downvoters, I quoted OpenAIs 7 tokens per word(on their tutorial page).
Seems incredibly unrealistic in hindsight, but at the time, things were fresh. Also, I think most people wanted something more robust than a linear calculation.
Somewhat tangimential, are there any open source attempts to compete with OpenAI's embeddings?
I know Word2Vec is a thing but I believe that is on a word by word basis, and doesn't capture the semantic meaning of whole sentences and paragraphs.
They charge so little for embeddings I secretly hope they do open source it. Because if for some reason it is stopped, any search functionality or the like that relies upon the API would cease to function
I've been wondering how to use two spaces as a stop token with the llama models for months. Reading the source of this finally clued me in, "__". Nice. This is significantly easier to comprehend than sentencepiece.