Alex Nord

Research

Release of HSI, a Featherweight Sequence Indexing Toolkit

2021, August 21st


Following the recommendation of friend-of-the-site and master software engineer George Lesica, the HSI toolkit has finally flown the nest (i.e., been given its own GitHub branch)!

I originally developed the Hexi Sequence Index (or Hexxxi, if you're feeling saucy) as a temporary stand-in for Sean Eddy's Easel tools during a time of Mirage2 development when I was repeatedly downloading, compiling, and destroying Mirage2. Easel provides a much more robust library of sequence parsing and analysis tools than what I have ever needed, so for the sake of reducing compile time I wanted to have a pared-down version of the library containing only the functions required for Mirage2. Specifically, I needed to be able to get metadata about sequence files (e.g., the names and lengths of sequences in a file) and then extract either entire sequences (typically proteins) or portions of sequences (typically chunks of genomic sequence) that I could then stunt on science-style.

Unsurprisingly, after a week-ish of development, I was so proud of HSI that I simply had to let it play in the major leagues (i.e., permanently incorporate into Mirage2). Benchmarking showed that HSI is as good as Easel at their comparable functions, and HSI's significantly reduced scope made it much faster to install -- inital mission accomplished!

I won't bore you with any boring details, but if you find yourself frequently extracting biological sequences from FASTA files then I'll humbly suggest giving my sweet, precious child HSI a try!