A lightweight Rust library for training GPT-style BPE tokenizers. The tiktoken library is excellent for inference but doesn't support training. The HuggingFace tokenizers library supports training but ...
Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you a) whether the string is too long for a text ...