Designing a fast CLI join tool with rust

(or how rust helped me to match the speed of GNU join written by Mike Haertel)


Join two files on command line on the common join fields as fast as possible. GNU join supports only one join field and I wanted to make it more generic by supporting arbitrary number of them.

Initial idea

It turns out developing fast CLI tools is quite tricky. Even a dumb-looking command like yes has actually a pretty sophisticated code.

At the beginning I had practically zero experience with system programming. But I knew some theory. First of all, do only what is necessary. In Mike Haertel’s words:

The key to making programs fast is to make them do practically nothing.

Second, amortize repeating heap allocations. And third, cache is king.

Having these in mind, I imagined having a reusable input buffer where parser only indexes positions of fields and records in the buffer.

The tricky part is that the last record will not fit the buffer entirely most of the time and therefore the buffer cannot be consumed. We also can’t reload the input buffer with new data otherwise we would overwrite the unconsumed one. The solution is to roll the partial record to the beginning of the buffer and load new data from there.

But what if the partial record is large and misses only a few bytes to its completion. We already parsed most of its fields. It would be pity if we had to re-parse it from the beginning after the roll of the buffer. The solution is to offset the index containing the partial record and start parsing from the last incomplete field.

However, this requires a delicate housekeeping to do right. Fortunately rust makes it very easy and convenient to test it and protects me from accessing the input buffer out of bounds. Doing this in C requires a hell lot of concentration and experience. Granted, I pay the price of bounds checks, but the cost is acceptable.

Now let’s review the various parts in more detail.

Input buffer

Unfortunately I could not find any library providing the roll feature described above. There are ring or circular buffers, but none offers what I needed, so I made my own. You can find the code in the separate crate - rollbuf. The API is is quite similar to the std::io::BufReader, adding just:

pub fn roll(&mut self) {...}


This is the most sophisticated part based on Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast JSON parser for data analytics. In VLDB, 2017.

Unlike traditional FSM based parsers, it converts control flow into data flow, thereby largely eliminating inherently unpredictable branches in the program and exploiting the parallelism available in modern processors.

Implementing it from scratch was above my capability, so I shamelessly copied pikker’s implementation and tailored it to my needs.

The whole input buffer is parsed at once. This way we avoid the context switching for every record. When done, it builds two indices (heap allocated, but reused thus amortized):

  • fields: containing the position of each field in the input buffer
  • records: containing the position of each record in fields index

The indices are synchronized with the input buffer. It means each index pointing to the rolled part of the input buffer is properly offset. This way we avoid re-parsing the same part twice.

The parser’s code is split into two parts:

  • csvroll crate - the generic building block, which takes care of synchronization of rollbuf with index builder. It builds fields and records indices. You can think of it as an analogy to BurntSushi’s csv-core and it is in the separate crate for the same reason. Currently it handles only field separators and record terminators - no quotes, escape characters or other advanced CSV features, so for now it’s not really an alternative to csv-core.

  • the custom part which takes care of grouping the records and is part of rjoin crate. The group contains the position of records with the same key (join fields). This is necessary because the join of multiple records with the same key results in Cartesian product. It handles all the complexity of calling the parser and provides a simple, iterator-like interface:

pub fn next_group(&mut self) -> Result<Option<_>, Box<Error>> {...}

Note: I’ll replace Box<Error> for failure::Error as soon as I have time.


This is really a simple loop which calls next_group on two inputs based on whether a match occurred and which join options are active.


The output takes care of formatting the results of the join. Since there are many possible output formats or serialization protocols, I made it generic using a trait Print.

pub trait Print<W:io::Write> {
    fn print_left(&mut self, w: &mut W, ...) -> Result<(), Box<Error>>;
    fn print_right(&mut self, w: &mut W, ...) -> Result<(), Box<Error>>;
    fn print_both(&mut self, w: &mut W, ...) -> Result<(), Box<Error>>;

Currently there is only one struct which implements Print. It prints the join fields followed by remaining fields from the left and right file if present and/or requested.

So, how fast is it?

Generally, rjoin is slightly faster if the output is small compared to input records. This is due to rjoin’s more efficient parser. If the output is large but is mostly unique, then GNU join is slightly faster. rjoin pays a penalty for its arbitrary number of join fields - it requires one more indirection which slows down a bit the output printing.

If however the large output is non-unique (i.e. it must perform a lot of Cartesian product), than they are at par.

Detailed benchmarks can be found here.

Further optimizations

While rjoin is quite fast, it can still be faster. There are two low hanging fruits:

  • specialization: in case there is only one join field I could get rid of the vector holding join fields. This should speed-up the output printing. In the best case where input and output separators are the same and the first field is the (only) join field, I could even avoid parsing individual fields and reuse the fields and separators from input.

  • threads: the parser can comfortably parse the input buffer in parallel using more threads. rust should make this quite easy and safe.

Some notes on rust

The language and tooling felt very ergonomic. For me the principal hindrance was the lack of support for non-lexical lifetimes. Knowing the code is correct but having the compiler complaining is quite annoying. Luckily, the implementation already landed in nightly so no more frustration on that part.

The first-class support for SIMD would be also very welcome. Currently it’s quite complicated to do it manually. There are some crates like stdsimd or faster which claim to simplify it, but the API is still experimental.


Writing a fast CLI tool was a pleasant and educative experience. Rust helped me to concentrate on the code’s logic rather than on avoiding shooting my leg, or worse. That way I had more time to study algorithms and programing techniques which in turn made me the Fire Mario.