GrinGene Bioinformatics

  •  ·  Standard
  • 1 members
  • 3 followers
  • 258 views

Have you ever wondered what makes someone tick?

In this post, I will attempt to explain one of my core evangelistic, philosophical world views. This is not about Christianity, nor is it about bioinformatics, nor painting. Those are just the picture hooks that I'm using to hang things on, so that I can translate my own world view to other things that other people are more likely to understand.

Painting

But why painting?

Because I feel that painting is the quickest way to explain this approach to tackling the world's problems. There's a video that I really like which I think demonstrates this really well, it's a video of someone speed painting in 1 hour using a Free/Libre program called Krita:

Speedpainting Timelapse, Krita 2.8 [sped up about 25 times]

I love a lot of things about this video. I'm not going to list them all, but I'll highlight a few that I think are key to the points I want to get across.

1. Everything is created out of something that already works

It could be said that the painting starts with a blank slate, but that's not quite correct because there are things that have been prepared in advance of the painting. If nothing else, the computer program exists as a pre-created environment that many other people have devoted time and effort into improving. The "blank slate" is an already working thing: a featureless image... that happens to be given a quick slate-coloured wash.

2. The working thing is changed in small steps

I love that the painting starts out with broad strokes. These are simple painted lines that I imagine I could create myself. Even if the end product is beyond my own expertise, I can see how it is created by laying small changes over the top of the existing painting.

3. Changes can break things

About 11 seconds into the painting, the artist realises that a darkened blur is the wrong size. They make it smaller, and it breaks the painting, creating something that looks worse overall, but better in the area that the artist is working on. We realise later on that this darkened blur is the main subject of the painting, so it makes sense that the artist cares a lot about getting this bit right.

4. Broken things can be fixed

The broken painting is not a large issue, because the artist understands how to recover from a broken product. They create additional strokes to improve the painting at regions where there are issues, and once it looks okay overall, they get back to improving other areas of the painting.

5. Improvements can always be made

The artist is limited by the one hour they have to make the painting, so there is a fixed end-point for their work. But you might notice that the thumbnail image for this video actually has additions: text and a speech bubble. The general shape of the image seems to me to be there after about 35s, and the artist is happy enough to save a snapshot after about 1m05s (after clipping and smearing the edge). But the artist doesn't stop there; they keep adding until it's good enough.

Interlude - GrinGene Bioinformatics

I (David Eccles) am the owner/operator of a sole-trader business - GrinGene Bioinformatics - and can provide more specific advice about research workflow optimisation, including data analysis and result presentation.

If you'd like to see more posts like this, please let me know in the comments.

If you have a research project that you'd like help with, feel free to contact me for advice. See here for more details:

https://www.gringene.org/services.html

Bioinformatics

But why bioinformatics?

Because I am not a painter; bioinformatics is my working life at the moment. I see bioinformatics as the process, or art, of converting biological research outputs into something that can be better understood by other researchers - research outputs that typically, but not always, involve very large datasets. One of my favourite explanations of bioinformatics is that it’s a bit like surfing: chasing waves of information from an ocean of data, and presenting them in an interesting way before they reach the shore of public knowledge.

I have found myself frequently applying these painting ideas to the coding work that I'm doing as part of bioinformatics projects:

  1. Start with something that works
  2. Change the code to tell it what I want it to do
  3. These changes frequently break the code
  4. Fix and debug the code so that it works again
  5. If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2

This approach applies all over the place in the things that I do. I've mentioned it a couple of times on Twitter, for creating a wind turbine, and for creating a microfuge tube earring. Those aren't exactly bioinformatics, but the more physical representation of 3D models makes it easier for me to explain this process of gradually building code that works. But here... I'm going to dig a bit deeper and talk about a small bioinformatics task I've been working on.

1. Start with something that works

If I start with something that is close to what I want to end up with, and it already "works" (whatever that means), then the effort required to create the thing that I actually want is substantially less. I have some code that generates a plot of repetitive information in a DNA sequence. Explaining in detail what the plot represents takes a while, so interested people can have a peek at my presentation on the topic.

In any case, I have an image. This is the way I check to make sure that my code still works, or one of the ways that I check to find out what is broken:

image_transcoder.php?o=sys_images_editor&h=31&dpx=1&t=1723789869[REPAVER plot of an assembled haplotype from a chimpanzee (Pan troglodyte); sequence was created as part of the Vertebrate Genomes Project, combining sequence information from PacBio, ONT, 10x, Bionano, Dovetail, and Illumina reads]

2. Change the code to tell it what I want it to do

In this case, instead of changing the output, what I want it to do is to run faster. The code was slower than I wanted it to be, taking over five minutes to process and generate the above image. I wanted it to be faster, and I expected that my code only needed a few little tweaks to fix that problem.

More specifically, I had code that did something like this for each kmer:

  1. Start with a hash result of 0
  2. Convert the next base in the kmer to a 64-bit hash
  3. Shift that base hash a position dependent on the base location within the kmer
  4. XOR the shifted base hash with the current result
  5. If all the bases are processed, stop. Otherwise, return to step 2

And wanted it to do something like this for all kmers:

  1. Generate a hash for the first kmer
  2. Start with the hash result of the previous kmer, shifted one position
  3. Remove the value of the base that is no longer seen
  4. Add in the value of the new base
  5. If all the kmers are processed, stop. Otherwise, return to step 2

[see an explanation of the algorithm here]

This was changing an operation that worked on all bases within a kmer into an operation that only works on the first and last bases within a kmer. When the algorithm is running on hundreds of millions of locations within a chromosome, and the kmer size is moderately large (125bp in my test case), small changes like that can make a big difference in the run time of a program.

3. These changes frequently break the code

... and that's okay.

image_transcoder.php?o=sys_images_editor&h=33&dpx=1&t=1723791339[git diff of the initial implementation of fast hashing, including additional debug output to show when the fast hash doesn't match the slow hash]

In this case, I encountered situations where my attempts at creating a fast hash led to broken code, in other words, code that didn't produce the correct output. The code above represents the state after I made a few tweaks and is technically correct (i.e. it produces the correct output), but it takes even longer than the initial implementation because it compares the fast hash process with the slow hash process, and uses the slow value if they differ. This was not ideal.

4. Fix and debug the code so that it works again

The point of this step is to return the code (or the thing) to a state that means it is once again usable. Ideally, a state that is better than the original state, but not necessarily the same as the goal. In my case, that meant working through the bugs enough that the initial forward repeat hashing was complete, and sufficiently fast (i.e. taking under a minute to complete), but the remaining code was still slow:

image_transcoder.php?o=sys_images_editor&h=34&dpx=1&t=1723792568

[Command line output for running REPAVER on the Pan troglodyte chromosome 1. The output indicates that the forward repeat finding process took 59.917 seconds.]

5. If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2

This is an iterative process. In the process of fixing things, I often encounter new bugs. I might discover that the speedup is not actually as fast as I had expected, so I need to hunt around for other solutions, like a fast hashmap library which does the even lower-level stuff a bit quicker. Eventually, after many iterations, I got to a stop point; I'd had enough.

image_transcoder.php?o=sys_images_editor&h=35&dpx=1&t=1723792670

[Console output from REPAVER showing forward repeat hashing taking 32.867 seconds, and other repeat hashing taking 59.771 seconds.]

There are more things that I can do with this code, but I'm okay with it's current state. I successfully reduced the processing time from "It'll be done when I get back from my break" to "It'll be done after I check a couple of emails".

Summary

This "painting" approach of iterative development, accepting temporary failure as a necessary part of the process of improvement, extends into many different areas of my life. As long as I can keep little pockets of success along the path to my goals, the setbacks along the way can be weathered.

[header image: final panel of xkcd.com/349, a common experience in which attempts to improve or change something can get you into even worse trouble, and where just getting back to the state at which you started becomes an arduous or even impossible task.]

  • 1

Consider the following scenario: your research lab has been around for a few years, and you are wondering whether all these sequencing services you are paying for are worth it - should you invest in buying your own medium-scale DNA sequencer?

Prelude - Nanopore Sequencing

Disclaimer: I am a fan of Oxford Nanopore Technologies (ONT). ONT has produced a sequencer that can do something no other commercial sequencing company can do: carry out [relatively] model-free discovery of novel polymers (e.g. DNA & RNA with modifications). Combined with their relatively public disclosure of information (especially with regards to pricing), I've been willing to work around numerous technical and operational issues because their "move fast, release early" philosophy works well with my preferred way of doing research.

If you're just starting out, don't have much money, or just want a taste of sequencing, then there's only one realistic option for an in-house sequencer: Oxford Nanopore's MinION. A MinION Mk1b Starter Pack with a 24-sample Rapid Barcoding Kit is available from Oxford Nanopore for $2,000 USD ($3,300 NZD). For another $1,460 USD ($2,400 NZD), you can get a Flongle adapter.

The Flongle adapter gives you access to Flongle flow cells, which are the cheapest per-run sequencing flow cells available at $810 USD ($1,350 NZD) for a pack of 12. Depending on the use case, the per-sample consumable cost can be quite low on Flongle, down to $8.50 NZD per sample when running 24 samples using the Rapid Barcoding Kit (which is competitive vs Sanger sequencing for anything more than a couple of 1kb reads per sample).

But this post isn't about MinION sequencing (a device that has a yield of 5-15GB for MinION flow cells, and 0.2-1 GB for Flongle flow cells); it's about sequencing on a device that can realistically and comfortably sequence a human genome in a single run.

Aside - GrinGene Bioinformatics

I (David Eccles) am the owner/operator of a sole-trader business - GrinGene Bioinformatics - and can provide more specific advice about high-throughput sequencing projects, from sequencing library preparation through to downstream data analysis and result presentation.

If you'd like to see more posts like this, please let me know in the comments.

If you have a research project that you'd like help with, feel free to contact me for advice. See here for more details:

https://www.gringene.org/services.html

Cost Summary

There are three popular players in the low-end large-scale sequencing market: Illumina, Pacific Biosciences (PacBio), and Oxford Nanopore (ONT). Here is an approximate breakdown of the direct costs to get a human-genome-scale system up and running from each company (costs are approximate, mostly based on conversion from USD to NZD, and don't include additional external costs):

Illumina NextSeq 1000

  • Initial system capital cost: $350,000
  • Cost per run: $7,400 (P2, 600 cycle, 400M read kit)
  • Maintenance: $35,000 / year (estimate, assuming 10% of system cost)
  • Yield per run (using the above kit): 240 GB
  • Marginal kit cost per GB: $31
  • Read length: 600bp
  • Read accuracy: Q30 (99.9%)
  • Sequencing run time: 34 hours

PacBio Revio

  • Initial system capital cost: $1,300,000
  • Cost per run: $1,650
  • Maintenance: $75,000 / year (estimate, based on reported Sequel maintenance costs)
  • Yield per run: 90 GB
  • Marginal kit cost per GB: $18
  • Read length: up to 20 kb
  • Read accuracy: Q33 (99.95%)
  • Sequencing run time: 24 hours

ONT P2 Solo

  • Initial system capital cost: $38,000 (CapEx version)
  • Cost per run: $1,650
  • Maintenance: $1,650 / year
  • Yield per run: 50-150 GB
  • Marginal kit cost per GB: $11-$33
  • Read length: up to 2 Mb
  • Read accuracy: Q28 (99.8%)
  • Sequencing run time: 5 mins to 72 hours

Capital Cost Recovery

From my perspective, the marginal per-run costs for the platforms are similar. In other words, a service centre with good cost recovery for capital expenses and maintenance could choose any one of these systems and get similarly cheap throughput.

However, most research labs are not service centres, so the maintenance and capital costs should dominate purchasing decisions. Here are the approximate maintenance costs taken from the above summaries:

  • Illumina NextSeq 1000: $35,000 / year
  • PacBio Revio: $75,000 / year
  • P2 Solo: $1,650 / year

Assuming none of the sequencing runs were done internally, with a doubling of service cost to other people, here are the approximate number of gigabases required to sequence in order to recover the ongoing maintenance cost (and number of flow cells required):

  • Illumina NextSeq 1000 (P2, 600 Cycles): 2300 Gb (10 flow cells / runs)
  • Nanopore P2 Solo (PromethION): 100 - 300 Gb (2 flow cells; 1-2 runs)
  • PacBio Revio: 8200 Gb (91 flow cells; 23-91 runs)

On the flip side, this is the minimum number of sequencing runs that would need to get done externally (with overheads being 50% of the service charge) over the course of a year as a service before it would make financial sense to consider buying one of these sequencers for internal use.

Given that the service market for sequencing is somewhat competitive, that 50% service charge assumption may be invalid. If the service charge were similar to (or lower than) the in-house reagent cost (which happens for Illumina sequencing due to steep volume discounting), it may not be possible to charge substantially more than reagent costs and still get sufficient demand for cost recovery.

Summary

For small volume sequencing (less than one sequencing run per month), a purchase of the Nanopore P2 Solo makes financial sense, as recovery of maintenance costs happens quickly. Due to the high maintenance cost of PacBio Revio, it doesn't appear to make financial sense as a purchase for a small lab in Aotearoa at any scale.

There are some nuances depending on application, but my general recommendation is the following:

  • If you only want to do short-read sequencing (under 1000bp), then don't buy a sequencer; just continue to get sequencing done as a service from an established large-scale sequencing centre.
  • If you want to do long-read sequencing (or a mix of short-read and long-read sequencing), then consider getting a P2 Solo. The ongoing maintenance costs are low enough that it can sit idle for almost the entire year and still be a financially-viable tool for discovery.

[header image: Illumina HiscanSQ and ABI SOLiD 4 sequencers at the Max-Planck-Institut für molekulare Biomedizin, Germany; stitched image from photos taken by David Eccles in 2011]

  • 3