Not having had any serious biological training I have to go to Wikipedia and Google to learn the basics. And I’m often surprised to find that concepts everyone uses don’t have good consensus amongst scientists. When reading the Wikipedia entry for “gene”, it occurs to me that if the concept didn’t predate the discovery of DNA, it would not exist.

At the very least, it would look much different than “a locatable region of genomic sequence, corresponding to a unit of inheritance” (call this the “standard definition”).

Gerstein’s definition, “a union of genomic sequences encoding a coherent set of potentially overlapping functional products,” while more accurate, is not really a useful definition. It just says there is structure to the information on a DNA sequence which corresponds to higher level function in the cell or organism. But we knew that already, it doesn’t tell us anything about the structure or how it relates to function.

The standard definition is a stronger claim, but it’s harder to reconcile with the evidence. Simply read the rest of the Wikipedia page to see the contradictions. Locatable regions of sequential base pairs only partially correspond to identifiable function. And as far as I can tell — somebody please explain if I’m wrong here — these locatable regions are not units of inheritance. Inheritance involves a very accurate copying of the entire genomic sequence, and in the case of sexual reproduction a very preservative recombination operation called crossover. The heterogeneity conferred by mutation (and crossover) acts either on points in the sequence or sequential regions, but these are mostly random occurrences, not limited to supposed gene boundaries. So if during inheritance, a gene can be chopped in two, partially deleted, inserted into, etc, in what sense is a gene a unit of inheritance? You could argue that an individual base-pair is a unit of inheritance, and you could argue that the entire genomic sequence (modulo a few mutations and crossover) is a unit of inheritance. But not a gene.

Then there’s the mystery of “junk” DNA, the portions of the genome which don’t directly code into identifiable products like proteins. In many species (including humans) the non-coding portion of DNA comprises over 98% of the genome. Wayt Gibbs in a Scientific American article points out that certain segments code for RNA which doesn’t get turned into protein, but which are actively functional in a number of ways not previously appreciated:

To avoid confusion, says Claes Wahlestedt of the Karolinska Institute in Sweden, “we tend not to talk about ‘genes’ anymore; we just refer to any segment that is transcribed [to RNA] as a ‘transcriptional unit.’”

Still, these RNA-only genes are only likely to double the number of identified functional units in the genome, leaving in humans 96% or more of the genome unaccounted for. Recently, the ENCODE project made a startling revelation:

“The majority of the genome is copied, or transcribed, into RNA, which is the active molecule in our cells, relaying information from the archival DNA to the cellular machinery,” said Tim Hubbard of the Wellcome Trust’s Sanger Institute…. [From COSMOS article, emphasis mine]

The pilot project posits:

Perhaps the genome encodes a network of transcripts, many of which are linked to protein-coding transcripts and to the majority of which we cannot (yet) assign a biological role. Our perspective of transcription and genes may have to evolve and also poses some interesting mechanistic questions.[From Nature article]

In other words, most of the so-called non-coding DNA does code after all (just not directly into proteins), and there is good evidence that these genetic products are not junk after all, but rather constitute nodes in a multi-level genetic, epi-genetic, proteomic and metabolic network. Sole, et al outline both theoretical and empirical bases for this line of thinking, which not only agrees with the above findings but also gives a plausible explanation for one of the biggest questions about the gene model having to do with robustness. More generally, Sole, et al represent a shift in thinking towards systems biology which is long overdue.

With all this in mind, we turn to a real genetic headscratcher about recent experiments in which ultraconserved (and thus presumed critical) portions of mouse DNA were deleted but to no apparent effect. While the network theory could in principle solve the mystery, it’s worth going through the comments on the blog because (a) plausibility doesn’t imply veracity, and (b) there could easily be more than one relevant cause/dynamic. I’ve summarized what I feel are the main arguments and referenced the corresponding comments by number:



  • The deleted genes serve functions that are latent (21,26,32, 35, 52, 55, 63, 78, 82)


  • Gene-level information interacts with other types of information in a complex, indirect way (71, 86, 87) [ precursor to the network theory ]


  • Genes are selfish and look out for their own preservation; the deletable sections could have been introduced by random mutation, viral or recombinatory injection (33, 59, 62, 72, 74, 76)


  • They are useful in recombination, but not alone (4, 75, 76, 77)


  • They play an important role in the physical structure of the chromosome (4, 54, 71)


  • Redundancy is achieved by a form of “checksum”, possibly probabilistic, or other cryptographic mechanism (6, 12, 76)


  • Certain genes if mutated properly could be harmful, but if deleted entirely have little effect (59, 76, 84)


  • The ultraconserved segments piggyback on the critical genes they are next to (29, 33, 54)


  • Some of the genome acts as a latent heterogeneity reserve which evolution thrives off of (53)


  • They protect critical genes by reducing the chance such genes will be mutated (75, 76)


  • The ultrapreservation has to do with negative selection (37) [ I have to admit to not understanding this argument (or whether it's a counterargument for negative selection), perhaps someone can explain it ]

Notwithstanding the ultraconserved parts, it is worth pointing out that there doesn’t necessarily need to be any function for non-protein-coding DNA; it could just be junk after all. Alternatively, we are often just looking at an evolutionary snapshot (as suggested by some of the comments above) and it’s hard to say what is functional without looking at the longer context and looking at the evolutionary system as a whole. After all, nothing in biology makes sense except in the light of evolution.

So, why does all this matter, and why am I picking on the gene model even though we all know that it has its flaws? For one, because we don’t all know that it is so terribly flawed. I certainly didn’t until I looked into it. But more importantly, even if we admit at some level that a “gene” is a quaint concept — accurate to describe only a small portion of the genome — by continuing to use the term, we (a) propagate misunderstanding to the vast majority of the population, and (b) continually reinforce flawed thinking and logical fallacies in our own minds that blocks better understanding and insidiously undermines fruitful new ways of thinking of the problems. Ultimately we keep having to narrow the gene definition, add caveats to apologize for its poor explanatory power and come up with post hoc explanations for why empirical results don’t fit the model.

Perhaps it is time to stop using the term “gene” entirely and come up with a lexicon for the elements and processes of the genome which incorporates and integrates models for the informational content beyond that of protein coding, including chromosomal structure, epi-genetic information, and biomolecular and cellular networks.

[Original article appeared here; followup is Beyond The Gene]