Tuesday, August 15, 2017

DNA vs. the Machine

Last week's news contained a story sure to raise eyebrows.  A group of computer security researchers from the University of Washington claimed to have demonstrated that they could hijack a computer via sequencing a carefully-constructed DNA fragment.  Visions of NextSeqs rampaging through the streets immediately sprung to mind.  The paper is interesting and has some useful warnings for the bioinformatics community, but certainly the news coverage has been strong on hype and alarmism.
In my prior dispatch lamenting my shallow mathematical knowledge and reminiscing how I got there, I mentioned my childhood skimming of Gödel, Escher, Bach.  While much of the book plays with abstract math, philosophy and music, there are a few Achilles-Turtle dialogues which are very explicitly about biology and which impressed themselves on my young brain.  One is a charming analogy of gene finding to the reader's lament that physical books hint at their conclusion by their physical nature.  Another is an attempt by one of the friends to obtain a phonograph which can perfectly play any record in existence.  The other friend (I forget who took which role) puckishly sees this as a challenge, and makes a recording whose playing destroys the record player. Undeterred, the audiophile builds a pre-scanner into the phonograph to detect pathogenic records which is countered with a scanner-destroying tune upstream of the killer sound.  And so on -- an amusing introduction to host-pathogen arms races.

GEB was published in 1979 when computers were essentially unheard of in ordinary households (one of multiple measures by which ours wasn't ordinary) and research computing was still in the innocent days before the Usenet worm.  It was also before automated DNA sequencing.  But substitute sequencer in for phonograph and DNA sample for record, and that is what the Washington team is trying to claim.

Okay, a bit more preamble.  I'm not a computer security researcher, so while I'm making a game attempt to apply my 30+ years of computer experience to this paper, I may misinterpret.  I'll also be looking askance at some of their molecular technology descriptions and claims; I'll try to do so with respect and humility.  There's also an ethical issue in discussing any dangerous technology: when is pointing out mistakes in such a paper enabling more dangerous attempts?  I will point out a few cases in which the authors overestimate some difficulties, but only using well-known tricks which are in the open literature.   And in any case, as you'll see, I don't see their attack as remotely realistic to worry about; it is difficult imagining an ethical issue with suggesting tweaks on an ineffective technique.

The general idea pursued in the paper is this: can a DNA sequence encode actual computer code and can a computer be tricked into running that code? Underlying the attack is a key mechanism: buffer overflow.  In most modern language, a string is an object that knows its allotted size, and attempts to store larger amounts of information typically result in the string object being resized for the new information.  But in C and C++, a string is stored as data in a run of memory (buffer), with responsibility for staying in bounds lying with the programmer.  So if a buffer is allotted 100 bytes and the program goes ahead and tries to write 120 bytes, those last 20 bytes will overwrite whatever was after the buffer.  If that overwrites code that should have been executed, then the overflow is executed instead.  Now if those 20 bytes are just innocent data, the program will probably crash.  But invidiously crafted code can redirect program flow, hijacking the computer.  

A typical trick is for the overflow portion to redirect control flow into the previous area, where a long and devious payload can reside.  In order to maximize the probability of the key overflow instructions being executed, it is preceded with a "NOP sled" (aka "NOP slide"), a long stretch of No OPeration commands -- machine instructions that do nothing.  The genetic code lacks a NOP but most machine codes have one.  To evade detection, it is also possible to have sets of commands which are effectively NOPs -- adding and subtracting constants from a register or such.   Entry of program flow into any spot on the NOP sled results in the command after it being executed.  So imagine my 20 byte overflow above was 16 NOPs followed by 4 key instructions to redirect flow (I haven't programmed in machine code in 3 decades; the example probably has flaws) -- entry of program flow into any of the NOPs or that first instruction will result in the hijack working.  Similarly, a NOP sled can also be used upstream of the real payload so that small errors in the jump instruction (the distance to jump) still result in the payload executing.  That's a point the author's hint at but don't make very clear: since real DNA sequencing has errors, it is valuable to have the jump distance for success to be fuzzy.

Okay, so that's the plan.  Now how to execute it.  What the authors show is that a large number of open source bioinformatics programs use potentially risky standard C library string functions which can result in buffer overflows.  In one case, they even found a program that should be defending against such overflows but fails: the program first checks for excessively long data by comparing against one constant and then allocates buffers using a second constant.  Unfortunately, the first constant is 10X bigger than the second one!

This does point to why, far more than worrying about time bombs synthesized into DNA fragments, such programming practices need to change.  On the Illumina platforms sequences come out at very predictable read lengths, so fixed size sequence buffers are plausible.  But note that the fixed lengths change over time, so by carelessly hard-coding the fixed lengths one of building a possible future failure into the code.  For none of the other major research platforms is a fixed read length true, and now that Oxford Nanopore reads are shooting for 1 megabase in size, it really isn't tenable to set a fixed buffer size.  Worse, real ONT datafiles are likely to have bogus reads longer than that.  Furthermore, if your tool has fixed sequence sizes, it won't work with merged reads or assemblies, and even if you think your tool is only applicable to raw data, it never hurts to prepare for the case in which someone finds the tool more generally useful. Plus, it is silly to not check, given that no performance hit will be entailed since reading data is generally limited by I/O speeds.

For the exploit, the University of Washington group designed a payload of 170 bases.  Which is longer than any standard Illumina read.  So they had the group running the sequencer do so in a non-standard format.  This is hardly the first time anyone has done this; standard Illumina paired end kits specify "300 cycles" or "200 cycles" not 2x150 or 2x100.  Obviously they could have also run a MiSeq in the 2x200 or 2x300 configurations and covered the entire exploit, but they had a NextSeq so they made do.

Furthermore, they engineered a vulnerability into a specific FASTQ compressor. Let that sink in.  This is very much a proof-of-idea paper.  In the most ideal situation possible, can something happen.  Indeed, not only that but they turned off a number of security features in the operating system.  Modern operating systems have a number of features to enforce a degree of separation between code and data, since executing data is the route to so many security hacks.  So the paper is akin to saying I can rob a bank, if the staff are sent home, the security systems turned off and the vault door is left ajar.

Since the Washington researchers wanted to start with DNA, they needed to design their exploit.  Here they bumped up against both imagined and real issues with molecular synthesis.  For an example of an imagined problem, they posit that any exploit must have ends to which PCR primers can be designed; there are a number of ways to have primers that are not part of the final product, such as degradable (RNA or dUracil-containing primers) or Type IIS restriction enzyme release.  On the other hand, they did butt up against real problems caused by the repetitive nature of the NOP sled (apparently a run of GCAA) and extremes in G+C content caused by their mapping of nucleotides to bits.  Apparently they also had a 13 base homopolymer, which is a challenging synthesis in any case.

They also discuss randomness of the results; their exploit works only if a sufficiently accurate read in the correct orientation shows up as the first read in a file.  I presume this is because every sequence will cause the buffer overflow in their hand-crippled program, and so the program will tend to crash unless the hijack fires on the first attempt.  So upfront half the hijack attempts will fail due to sequencer orientation.  Of course, there are solutions to that -- such as handing the sequencing facility a sequencing-read amplicon library.


Looking through their data, the Washington researchers found that 37.4% of their reads would have triggered the exploit.  In their actual run, one of four of their datafiles (one per lane) correctly triggered the hack.  

It is interesting to think about this exploit in the context of other platforms.  Given the homopolymers, it probably wouldn't work well on Ion Torrent -- and tools to process Ion Torrent reads are very unwise to have fixed buffers due to the inherent read length variation.  PacBio reads could potentially cover longer exploits, but the start of a read isn't easily predictable on that platform -- and again the error rates.  Oxford Nanopore reads, as noted before, can be essentially any length, so tools to handle these must dynamically allocate memory.

To convince the bioinformatics community to really worry about this, computer security researchers need to show a less toy example.  The Washington group has performed a public service by pointing out the widespread use of risky techniques, but it really is more a concern (in my opinion) with possible failure during ordinary use.  I would, however, be a bit nervous if I was running a cloud bioinformatics service which, as the authors point out, is typically working on user-supplied data which could be potentially crafted for any purpose. 

This is particularly important to understand due to one of their suggestions: screening synthetic DNA orders for possible exploits.  Worse, the authors say that such code can be hidden in numerous ways.  That's a whole level of expense and expertise to throw at a hypothetical problem and which makes no sense if that problem is so remote as to be essentially impossible.  It would be particularly unpleasant to have a DNA synthesis order held up because one had inadvertently generated something that triggered these filters.  

It is also interesting to see a well-known problem in sequencing -- the misassignment of barcodes -- looked at by paranoid eyes.  In large sequencing runs, particularly on the Illumina platform, multiple parties samples are combined, distinguished by short DNA index sequences.  Because those indexes can be misread or because the molecular biology within the cluster generation step can cross-breed indexes, some data will come out in the wrong bin.  So the security researchers envision both the problem of sensitive data leaking into a sample, allowing private knowledge to be revealed, as well as the problem that this could be used to deliberately interfere with another sample by interfering with genotype calls.  Of course, given that misassignment is generally in the low single digits, it would take a lot of effort to interfere with many genotype calls.  They also raise the possibility of using a deliberately bad library to try spoil an entire run.  

Note that this is essentially an Illumina-specific problem driven by some of the early diagnostics and QC early in a run that could be thrown off if a large portion of a sample is identical (the "low complexity problem").  Unfortunately, the authors are under the impression that this is true of "most next-generation sequencers".  But, in a really large Illumina run the contribution of one sample is quite small, so the ability to foul a run with one bad sample would seem quite minimal.  And again, on other platforms multiplexing is not as common (and certainly not multiplexing samples from different submitters) and such run-level QC is not typical, as each sequence is truly independent of the others.  There are, of course, much easier ways to foul up a pooled run -- put some evil contaminant in your sample.

Overall conclusion?  If you read the paper as a useful warning of software design choices that carry risks of failing now or in the future, the paper is a valuable wake-up call.   The risks cited for cloud analysis services is useful as well.  I'm skeptical that one could corrupt other samples in a pooled run and such an attempt would be hard to do in a targeted fashion as rarely does one know how a facility will group samples.  But an emerging DNA sequence hijacking the sequencer in the real world?  At this point that remains science fiction.

[ 2017-08-16 update -- fixed my buffer example, which had suffered from a change-in-plan so some of it was talking 110 bytes and some 120 bytes]

1 comment:

tenacitus said...

Buffer overflow attacks were kinda old by 2002. This kinda reminds me of what I did for my MS. As you rightly point out the bigger issue is the bad programming practice that went into build the libraries needs to be fixed but who has time? I am sure if I read the article I would find things to like about it. But your summary and analysis was helpful and interesting.