Austin Saunders

I came across an interesting read a few months ago talking about how the BRCA gene can have a larger impact on men than previously thought. For those of you who don't know, the BRCA1 and BRCA2 genes are responsible for repairing damaged DNA, effectively acting as a tumor suppressor. If you have a mutation in one of these genes, it can create a heightened risk for breast and ovarian cancer in women. It was thought that the effect of these mutations was primarily isolated to women, but this article points to recent research showing that BRCA mutations can also increase the risk of prostate, pancreatic and a host of other cancers affecting both women and men. Naturally, I was curious to see whether I had any matches for known BRCA mutations.

A few years ago, I was gifted a 23andMe ancestry/health kit for Christmas. After I took the test and sent it off to the lab I waited patiently for my results. When I received my results, there were a few common genetic health risks that were included in the report. However, for a fee, you could access way more health reports tailored just for you. If I remember correctly, these specialized reports were paywalled behind a hefty fee of almost $500. At the time, I was a bit annoyed that I didn't have access to all of the health reports, but I wasn't going to fork over that much money for a classic marketing upsell.

Fast forward to now, I was curious to see whether any new (or previously pay-walled) health risks were added to my available health report from the original time I viewed it. I logged into my account and navigated to the health reports page, I saw that the BRCA gene report was available, but it looked like it was only for the known prostate cancer risk variants. Overall, it didn't seem like anything really changed. On my report dashboard, I also saw that they were still trying to charge to access these reports - instead of a one time fee, you could now purchase a premium health subscription for almost $200 per year! (Checking again at the time of writing, it looks like it has since fallen to $69 per year.) However, after digging around the 23andMe website a bit, it turned out that you can actually download your raw genetic data for free.

$200 per year for premium health reports???

Thinks in Software Engineer... Yeah, not gonna happen. I'll just waste two weeks of my life and figure it out myself.

Understanding the Problem

The first major challenge wasn't the code – it was understanding enough about genetics to build something useful. This knowledge was crucial for ensuring the tool would make accurate comparisons between the datasets.

I started my journey by diving headfirst into genetics. I spent countless hours poring over resources from the National Human Genome Research Institute, the European Bioinformatic Institute, and researching genetic testing companies. I found myself in a pretty deep rabbit hole learning about everything from DNA sequencing methods to the standardization of genetic variant notation. I'm extremely grateful for all the researchers and educators who publish such high quality learning resources and documentation!

Next came finding a reliable data source for genetic variants. Multiple public genetic variant databases exist, but from my research it seemed like like the ClinVar database was the most comprehensive. ClinVar also conveniently provides a downloadable version of their database, so I decided to pull it down and see if I could come up with a way to cross-check my 23andMe data against it.

The ClinVar data follows the VCF format which provides the following information for a given genetic variant: a chromosome, position, reference allele, alternate allele and then some other metadata about the variant, including clinical significance (is this genetic variation benign, malignant, uncertain or a drug response) and whether or not the variant is associated with a known disease(s), among other items.

The 23andMe dataset on the other hand, provides genotype information for specific positions (SNPs - Single Nucleotide Polymorphisms) across chromosomes, rather than a complete sequence of each chromosome. These positions are carefully selected based on known genetic variants of interest, using microarray technology to test for specific alleles at each position. This does mean that we may not fully be able to utilize the ClinVar data to its fullest extent, but we should be able to at least access genetic variants gated by the paid subscription. In this dataset, for each SNP we have access to the following data: rsid, chromosome, position and genotype.

At a cursory glance, we can see that the two data sets have chromosome and position in common, but they differ when it comes to genotyping. So what is the difference between the two? In the context of the 23andMe data, the genotype is a specific combination of alleles (typically 2) comprised of the base nucleotides: A, T, G or C at a specific location. When comparing against a known single-nucleotide variant (SNV), the order of the 23andMe genotype doesn't matter - for our purpose a genotype of AG is the same as GA.

So how do we determine whether or not the 23andMe data has any positive ClinVar SNV matches?

When the 23andMe genotype either has two of the same alleles (e.g. AA) or the X chromosome only has one allele (e.g. T), this is known as as a homozygous genotype (i.e. the alleles of the genotype are the same). In this case, we check whether the 23andMe allele matches any alternate allele at that chromosome and position in the ClinVar data. Since we are wildcard matching, we can have multiple positive variant matches.

When the 23andMe genotype has two different alleles (e.g. AT), this is known as a heterozygous genotype (i.e. the alleles of the genotype are different). In this case, we check whether the 23andMe allele matches both the reference allele and alternate allele at that chromosome and position in the ClinVar data. Remember how I said before that the order of the genotype doesn't matter? This is where that comes into play. In this case, we have to perform 2 lookups that can have a positive variant match for either order of the genotype (e.g. lookup 1: ref=A and alt=T and lookup 2: ref=T and alt=A).

It's worth nothing that there are other variants that are not SNVs, but rather multi-nucleotide variants (MNVs), which are a combination of two or more nucleotides. A limitation of the 23andMe data is that it does not lend itself to analyzing MNVs, so clinvar-checker will not be able to detect these types of variants. If you are interested in seeing the logic table for how SNV matches are determined, you can check the reference logic table here.

Building clinvar-checker

After gaining a working knowledge of genetic variants, then came the engineering challenges. The sheer size difference between the datasets was striking: ClinVar's 1.2GB database with almost 3 million records versus 23andMe's relatively modest 20MB export. This disparity shaped the entire architecture of the tool. Since this would be running on users' local machines rather than some beefy cloud server, every optimization mattered.

The solution architecture evolved naturally from these constraints. I reached for Erlang Term Storage (ETS) as the backbone of the system. ETS tables are perfect for this use case – they provide O(log n) lookups when configured as ordered sets, and they support concurrent access patterns that would become crucial for performance. By enabling both read and write concurrency, we could load and query the data in parallel without creating bottlenecks.

For processing this mountain of genetic data, I turned to Flow, a library that has fascinated me since I first encountered it in Jose Valim's 2016 ElixirConf keynote, which as fate has it was my entrypoint into Elixir. Flow adopts the MapReduce paradigm while leveraging the BEAM's concurrency model, allowing us to utilize all available CPU cores for both data loading and processing. This library was a game-changer for performance.

The optimization journey didn't stop there. I then turned to MapSets for lightning-fast clinical significance filtering and leaned heavily into pattern matching to optimize Flow's throughput. Each small optimization contributed to the final result: a tool that can process and cross-check these datasets in about 8 seconds on a standard laptop.

But perhaps the most satisfying aspect was making the tool accessible to others. Using Burrito to "wrap" everything into a single executable meant anyone could use it without needing to understand Elixir or deal with dependencies. This transformed what could have been a personal script into a tool that others could easily use to explore their own genetic data.

Reflection

This project reinforced something I've always believed about software engineering: sometimes the best solutions come from refusing to accept artificial limitations. Instead of paying for a subscription service, I ended up creating something that not only solved my immediate need but also pushed me to learn about an entirely new domain.

The performance achievements – processing almost 3 million genetic variants in seconds on a local machine – showcases one aspect of how Elixir truly shines. The combination of the BEAM's concurrency model, Erlang's battle-tested tools like ETS, and elegant abstractions like Flow made it possible to build something that would have been much more challenging or verbose in other languages.

If you're interested in exploring the code or trying it out yourself, you can find clinvar-checker at github.com/ssaunderss/clinvar-checker. Whether you're curious about your own genetic data or just interested in seeing how Elixir can handle large-scale data processing, I hope this project inspires you to explore the intersection of biology and computer science.

And hey, saving $200 a year isn't bad either.