dmatth1

This is my personal blog.

quicktok: a faster tokenizer

quicktok is a fast, exact BPE tokenizer written in C++. Token ids are byte-identical to tiktoken, and encoding runs 2–3.5× faster than bpe-openai (the fastest alternative I know of) and 4–11× faster than tiktoken itself. I believe it’s the fastest exact CPU tokenizer available today for these encodings. It ships cl100k, o200k, GPT-OSS (o200k_harmony), Llama-3, and Qwen2.5/3, all byte-exact, plus bring-your-own Llama-4. This is useful for anyone doing large amounts of CPU-bound data processing — search indexing, ingesting corpora, token counting/billing — and can significantly reduce the time and cost of data ingestion. It can also be used for online request serving, such as CPU-bound inference paths (token counting, embedding serving). ...

arrow-cpp, arrow-rs, and Velox still ship the scalar Parquet bloom probe

arrow-go shipped AVX2/SSE4/NEON SBBF probes in 18.3.0 (PR #336). arrow-cpp and Velox ship the same scalar reference line-for-line; arrow-rs is the same algorithm in Rust. Those three cover the C-family Parquet ecosystem: DuckDB, ClickHouse, Polars, DataFusion, Trino, Presto, Spark via Gluten, StarRocks, Doris, pyarrow → pandas, Apache Drill. A C++ port of the arrow-go approach: bit-identical on 167M (query, filter) pairs, 3–5× in-cache on the probe microbenchmark, 1.5× out-of-L3 with a 4-way bulk path across row-group filters. ...