umi-checker
High-performance UMI collision detection tool utilizing Rust and SIMD
umi-checker is a command-line tool developed to efficiently detect collisions in Unique Molecular Identifiers (UMIs) within high-throughput sequencing data. It specifically tackles the computational challenge of calculating Hamming distances across massive datasets to ensure data integrity.
This project marks my transition into Rust development. I created umi-checker to move beyond the performance limitations of interpreted languages when handling raw sequencing data. It served as a deep dive into systems programming, allowing me to explore low-level memory management and hardware-specific optimizations.
To achieve maximum throughput, I implemented SIMD (Single Instruction, Multiple Data) comparisons, broadening my knowledge of:
- Parallel computing: Leveraging data parallelism to process multiple UMI sequences simultaneously.
- Rust ownership model: Ensuring memory safety without garbage collection overhead.
- Benchmarking: Rigorously comparing scalar versus vectorized implementations to validate performance gains.
The result is a highly performant utility that significantly reduces the time required to quality-check sequencing libraries.