Have you ever wondered what makes someone tick?
In this post, I will attempt to explain one of my core evangelistic, philosophical world views. This is not about Christianity, nor is it about bioinformatics, nor painting. Those are just the picture hooks that I'm using to hang things on, so that I can translate my own world view to other things that other people are more likely to understand.
Painting
But why painting?
Because I feel that painting is the quickest way to explain this approach to tackling the world's problems. There's a video that I really like which I think demonstrates this really well, it's a video of someone speed painting in 1 hour using a Free/Libre program called Krita:
Speedpainting Timelapse, Krita 2.8 [sped up about 25 times]
I love a lot of things about this video. I'm not going to list them all, but I'll highlight a few that I think are key to the points I want to get across.
1. Everything is created out of something that already works
It could be said that the painting starts with a blank slate, but that's not quite correct because there are things that have been prepared in advance of the painting. If nothing else, the computer program exists as a pre-created environment that many other people have devoted time and effort into improving. The "blank slate" is an already working thing: a featureless image... that happens to be given a quick slate-coloured wash.
2. The working thing is changed in small steps
I love that the painting starts out with broad strokes. These are simple painted lines that I imagine I could create myself. Even if the end product is beyond my own expertise, I can see how it is created by laying small changes over the top of the existing painting.
3. Changes can break things
About 11 seconds into the painting, the artist realises that a darkened blur is the wrong size. They make it smaller, and it breaks the painting, creating something that looks worse overall, but better in the area that the artist is working on. We realise later on that this darkened blur is the main subject of the painting, so it makes sense that the artist cares a lot about getting this bit right.
4. Broken things can be fixed
The broken painting is not a large issue, because the artist understands how to recover from a broken product. They create additional strokes to improve the painting at regions where there are issues, and once it looks okay overall, they get back to improving other areas of the painting.
5. Improvements can always be made
The artist is limited by the one hour they have to make the painting, so there is a fixed end-point for their work. But you might notice that the thumbnail image for this video actually has additions: text and a speech bubble. The general shape of the image seems to me to be there after about 35s, and the artist is happy enough to save a snapshot after about 1m05s (after clipping and smearing the edge). But the artist doesn't stop there; they keep adding until it's good enough.
Interlude - GrinGene Bioinformatics
I (David Eccles) am the owner/operator of a sole-trader business - GrinGene Bioinformatics - and can provide more specific advice about research workflow optimisation, including data analysis and result presentation.
If you'd like to see more posts like this, please let me know in the comments.
If you have a research project that you'd like help with, feel free to contact me for advice. See here for more details:
https://www.gringene.org/services.html
Bioinformatics
But why bioinformatics?
Because I am not a painter; bioinformatics is my working life at the moment. I see bioinformatics as the process, or art, of converting biological research outputs into something that can be better understood by other researchers - research outputs that typically, but not always, involve very large datasets. One of my favourite explanations of bioinformatics is that it’s a bit like surfing: chasing waves of information from an ocean of data, and presenting them in an interesting way before they reach the shore of public knowledge.
I have found myself frequently applying these painting ideas to the coding work that I'm doing as part of bioinformatics projects:
- Start with something that works
- Change the code to tell it what I want it to do
- These changes frequently break the code
- Fix and debug the code so that it works again
- If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2
This approach applies all over the place in the things that I do. I've mentioned it a couple of times on Twitter, for creating a wind turbine, and for creating a microfuge tube earring. Those aren't exactly bioinformatics, but the more physical representation of 3D models makes it easier for me to explain this process of gradually building code that works. But here... I'm going to dig a bit deeper and talk about a small bioinformatics task I've been working on.
1. Start with something that works
If I start with something that is close to what I want to end up with, and it already "works" (whatever that means), then the effort required to create the thing that I actually want is substantially less. I have some code that generates a plot of repetitive information in a DNA sequence. Explaining in detail what the plot represents takes a while, so interested people can have a peek at my presentation on the topic.
In any case, I have an image. This is the way I check to make sure that my code still works, or one of the ways that I check to find out what is broken:
[REPAVER plot of an assembled haplotype from a chimpanzee (Pan troglodyte); sequence was created as part of the Vertebrate Genomes Project, combining sequence information from PacBio, ONT, 10x, Bionano, Dovetail, and Illumina reads]
2. Change the code to tell it what I want it to do
In this case, instead of changing the output, what I want it to do is to run faster. The code was slower than I wanted it to be, taking over five minutes to process and generate the above image. I wanted it to be faster, and I expected that my code only needed a few little tweaks to fix that problem.
More specifically, I had code that did something like this for each kmer:
- Start with a hash result of 0
- Convert the next base in the kmer to a 64-bit hash
- Shift that base hash a position dependent on the base location within the kmer
- XOR the shifted base hash with the current result
- If all the bases are processed, stop. Otherwise, return to step 2
And wanted it to do something like this for all kmers:
- Generate a hash for the first kmer
- Start with the hash result of the previous kmer, shifted one position
- Remove the value of the base that is no longer seen
- Add in the value of the new base
- If all the kmers are processed, stop. Otherwise, return to step 2
[see an explanation of the algorithm here]
This was changing an operation that worked on all bases within a kmer into an operation that only works on the first and last bases within a kmer. When the algorithm is running on hundreds of millions of locations within a chromosome, and the kmer size is moderately large (125bp in my test case), small changes like that can make a big difference in the run time of a program.
3. These changes frequently break the code
... and that's okay.
[git diff of the initial implementation of fast hashing, including additional debug output to show when the fast hash doesn't match the slow hash]
In this case, I encountered situations where my attempts at creating a fast hash led to broken code, in other words, code that didn't produce the correct output. The code above represents the state after I made a few tweaks and is technically correct (i.e. it produces the correct output), but it takes even longer than the initial implementation because it compares the fast hash process with the slow hash process, and uses the slow value if they differ. This was not ideal.
4. Fix and debug the code so that it works again
The point of this step is to return the code (or the thing) to a state that means it is once again usable. Ideally, a state that is better than the original state, but not necessarily the same as the goal. In my case, that meant working through the bugs enough that the initial forward repeat hashing was complete, and sufficiently fast (i.e. taking under a minute to complete), but the remaining code was still slow:
[Command line output for running REPAVER on the Pan troglodyte chromosome 1. The output indicates that the forward repeat finding process took 59.917 seconds.]
5. If the code still does what I want, or if I've had enough, stop. Otherwise, return to step 2
This is an iterative process. In the process of fixing things, I often encounter new bugs. I might discover that the speedup is not actually as fast as I had expected, so I need to hunt around for other solutions, like a fast hashmap library which does the even lower-level stuff a bit quicker. Eventually, after many iterations, I got to a stop point; I'd had enough.
[Console output from REPAVER showing forward repeat hashing taking 32.867 seconds, and other repeat hashing taking 59.771 seconds.]
There are more things that I can do with this code, but I'm okay with it's current state. I successfully reduced the processing time from "It'll be done when I get back from my break" to "It'll be done after I check a couple of emails".
Summary
This "painting" approach of iterative development, accepting temporary failure as a necessary part of the process of improvement, extends into many different areas of my life. As long as I can keep little pockets of success along the path to my goals, the setbacks along the way can be weathered.
[header image: final panel of xkcd.com/349, a common experience in which attempts to improve or change something can get you into even worse trouble, and where just getting back to the state at which you started becomes an arduous or even impossible task.]