Monday, September 14, 2009

Codon Optimization is Not Bunk?


In a previous post I asked "Is Codon Optimization Bunk?", reflecting on a paper which showed that the typical rules for codon optimization appeared not to be highly predictive of the expression of GFP constructs. A paper released in PLoS One sheds new light on this question.

A quick review. To the first approximation, the genetic code consists of 3 nucleotide units called codon; there are 64 possible codons. Twenty amino acids plus stop are specified by these codons (again, 1st approximation). So, either a lot of codons are never used or at least some codons mean the same thing. In the coding system used by the vast majority of organisms, two amino acids are encoded with a single codon whereas all the others have 2, 3, 4 or 6 codons apiece (and stop gets 3). For amino acids with 2, 3 or 4 codons, it is the third position that makes the difference; for the three that have 6, they have one block of 4 which follows this pattern and one set of two which also differ from each other in the third position. For two amino acids with 6 codons, the two groups are next to each other so that you can think of the change between the blocks as a change in the second position; Ser is very strange in that the two blocks of codons are terribly like each other. For amino acids with two codons, the 3rd position is either a purine (A,G) or pyrimidine (C,T). For a given amino acid, these codons are not used equally by a given organism; the pattern of bias in codon usage is quite distinct for an organism and its close cousins and this has wide effects on the genome (and vice versa). For example, in some Streptomyces I have the codon bias pattern pretty much memorized: use G or C in the third position and you'll almost always pick a frequent codon; use A or T and you've picked a rarity. Some other organisms skew much the reverse; they like A or T in the third position.

Furthermore, within a species the genes can even be divided further. In E.coli, for example, there are roughly three classes of genes each with a distinctive codon usage signature. One class is rich in important proteins which the cell probably needs a lot of, the second class seems to have many proteins which see only a soupcon of expression and the third class is rich in proteins likely to have been recently acquired from other species.

So, it was natural to infer that this mattered for protein expression. In particular, if you try to express a protein from one species in another. Some species seemed to care more than others. E.coli has a reputation for being finicky and had one of the best studied systems. Not only did changing the codon usage over to a more E.coli system seem to help some proteins, but truly rare codons (used less than 5% of the time, though that is an arbitrary threshold) could cause all sorts of trouble.

However, the question remained how to optimize. Given all those interchangeable codons, a synthetic gene could have many permutations. Several major camps emerged with many variants, particularly amongst the gene synthesis companies. One school of thought said "maximize, maximize, maximize" -- pick the most frequently used codons in the target species. A second school said "context matters" -- and went to maximize the codon pair useage. A third school said "match the source!", meaning make the codon usage of the new coding sequence in the new species resemble the codon usage of the old coding region in the old species. This hedged for possible requirements for rare codons to ensure proper folding. Yet another school (which I belonged to) urged "balance", and chose to make the new coding region resemble a "typical" target species gene by sampling the codons based on their frequencies, throwing out the truly rare ones. A logic here is that hammering the same codon -- and thereby the same tRNA -- over and over would make that codon as good as rare.

The new work has some crumbs for many of these camps but not many; it suggests much was wrong with each -- or perhaps, the same thing was wrong with each. The problem is that even with these systems some proteins just didn't express well, leaving everyone scratching their heads. The GFP work seemed to suggest that the effects of codon usage were unpredictable if present, and in any case other factors, such as secondary structure near the ribosome, were what counted.

What the new work did is synthesize a modest number (40) of versions of two very different proteins (a single-chain antibody and an enzyme, each version specifying the same protein sequence but with a different set of codons. Within each type of protein, the expression varied over two logs; clearly something matters. Furthermore, they divided some of the best and worst expressors into thirds and made chimaeras, head of good and tail of bad (and vice versa). Some chimaeras seemed to have expression resembling their parent for the head end but others seemed to inherit from the tail end parent. So the GFP-based "ribosome binding site neighborhood secondary structure matters" hypothesis did not fare well with these tests.

After some computational slicing-and-dicing, what they did come up with is that codon usage matters. The twist is that it isn't matching the best used codons (CAI) that's important, as shown in the figure at the top which I'm fair-using. The codons that matter aren't necessarily the most used codons, but when cross-referenced with some data on which codons are most sensitive to starvation conditions the jackpot lights come on. When you use these as your guide, as shown below, the predictive ability is quite striking. In retrospect, this makes total sense: expressing a single protein at very high levels is probably going to deplete a number of amino acids. Indeed, this was the logic of the sampling approach. But, I don't believe any proponent of that approach ever predicted this.


Furthermore, not only does this work on the training set but new coding regions were prepared to test the model, and these new versions had expression levels consistent with the new model.

What of secondary structure near the ribosome? In some of the single-chain antibody constructs an effect could be seen, but it appears the codon usage effect is dominant. In conversations with the authors (more on this below), they mentioned that GFP is easy to code with secondary structure near the ribosome binding site; this is just an interesting interaction of the genetic code with the amino acid sequence of GFP. Since it is easy in this case to stumble on secondary structure, that effect shows up in that dataset.

This is all very interesting, but it is also practical. On the pure biology side, it does suggest that studying starvation is applicable to studying high level protein expression, which should enable further studies on this important problem. On the protein expression side, it suggests a new approach to optimizing expression of synthetic constructs. A catch however: this work was run by DNA2.0 and they have filed for patents and at least some of these patents have issued (e.g. US 7561972 and US 7561973). I mention this only to note that it is so and to give some starting points for reading further; clearly I have neither the expertise nor responsibility to interpret the legal meaning of patents.

Which brings us to one final note: this paper represents my first embargo! A representative of DNA2.0 contacted me back when my "bunk" post was written to mention that this work was going to emerge, and finally last week the curtain was lifted. Obviously they know how to keep a geek in suspense! They sent me the manuscript and engaged in a teleconference with the only proviso being that I continued to keep silent until the paper issued. I'm not sure I would have caught this paper otherwise, so I'm glad they alerted me; though clearly both the paper and this post are not bad press for DNA2.0. Good luck to them! Now that I'm on the other side of the fence, I'll buy my synthetic genes from anyone with a good price and a good design rationale.

ResearchBlogging.orgMark Welch, Sridhar Govindarajan, Jon E. Ness, Alan Villalobos, Austin Gurney, Jeremy Minshull1, Claes Gustafsson (2009). Design Parameters to Control Synthetic Gene Expression in Escherichia coli PLoS One, 4 (9) : 10.1371/journal.pone.0007002

9 comments:

Nash said...

While it is an interesting summary of previous findings, I am not sure what the pragmatic implications of codon usage are for molecular biologists who clone and sequence genes as bread and butter work.
There is usually an attempt to optimize codon usage when using a construct for protein expression or transient transfection. However, most often optimizing for secondary structure is not trivial, and detecting which amino acid is being depleted requires combinatorial experiments. In the end , its not so much about optimizing expression, as getting the protein in hand. One can always simply use more runs of purification, more cells for a blot, or stronger excitation for imagaing, and achieve the same results as a high expression.
I can understand of course, that this kind of work might be useful to those who produce protein-based products commercially.

PS : Your link to the PloSOne article is non-functional.

Keith Robison said...

Well, it all depends on your tolerance for inconsistent expression, particularly when dealing with a lot of proteins. Two logs is quite a spread, wouldn't you agree? It will be important for producing individual proteins, but I would see an even greater utility when screening lots of proteins for some interesting enzymatic activity, particularly when trying to assemble a whole pathway of activities.

If you are already having the genes synthesized, in general the combinatorial space is so huge you can layer on quite a few constraints or optimizations and not run out of choices, or at a minimum use all those to rank choices.

Optimizing against secondary structure will, as shown by the GFP example, depend on the protein being encoded. However, it isn't hard to optimize this in silico. A high probability of forming secondary structure (due to inverted repeats) is also potentially troublesome for synthesis, so most companies will be knocking it out anyway.

Curious about the PLoS link -- it works for me. Perhaps it was something transient.

The Mark et al work suggests a path for predicting which tRNAs are the most critical, but as you point out it still may not be the whole story. I may write some more on that tonight.

Victor Stevko said...

Thanks for the pointer, Keith. The link to the article doesn't work for me either -- it looks like the Blogger software thinks it's an internal link. The URL is in the title property, which may be relevant.

Keith Robison said...

thanks for the concurrence on having trouble with the link -- strangely it works for me on two different computers running two different browsers, which underscores how

I've tried another form of the link that doesn't have any escaped characters; with luck that will work.

Guy said...

Try this link

Anonymous said...

Does it mean that if E.coli is grown in minimal media (mimic starvation condition)protein expression would be better?thanks.
sam.

Keith Robison said...

I doubt you'd actually want to use minimal medium -- protein production requires a lot of resources & you don't want the cell to have a hard time getting them.

What the result does suggest is that high level protein production does induce, at least in part, severe stresses for some nutrients. Further supplementation with those nutrients might work, but it may be that the import systems for these can't work any harder.

A very crude idea that now occurs is to take one of the poor expressing coding schemes and use it as a reporter system to evolve better strains -- with targeted mutation aimed at the production & transport of some of the amino acids which seem to be most sensitive in the analysis.

türkiye ve hayata dair herşey said...

http://postgenomic.com/index.php

Anonymous said...

Given the amount of work that has been put into codon optimization in the past - and the many factors that have been found to have an influence on expression levels - it would highly surprise me if in the end the "solution" would turn out to be this simple; and indeed, I fear that there are some technical issues with this study. In particular: can one really deduce such "hard" generalizations about an enormously vast sequence space by testing a mere 40 genes? The use of partial least squares as a dimension reduction method (which I think is what the authors did) and verification by means of a test set should reduce the problem of overtraining somewhat, but I would be very cautious in the interpretation of these results regardless. Possibly even more so because this paper appears to be sponsored by people with a direct financial interest in a spectacular outcome.