A science and medicine blog
Structure over sequence: new ways of assessing the evolution and biological role of non-coding RNA’s and Junk DNA.
Structure over sequence: new ways of assessing the evolution and biological role of non-coding RNA’s and Junk DNA.

Structure over sequence: new ways of assessing the evolution and biological role of non-coding RNA’s and Junk DNA.

In his book “The Selfish Gene”, Richard Dawkins argues that if we can accept the idea that natural selection acts upon the genes and not upon the individual, you will have an explanatory framework that accounts for the paradox of altruistic behavior: that an individual would willingly sacrifice themselves to save their kin.

Here, I will attempt to illustrate a solution to another paradox in biology: the conservation of genomic and non-genomic structures that are independent of the underlying sequence. This paradox can be explained by taking the idea of gene selection one tiny reductionist step further. In order to do this, we will have to merge the idea of gene selection with “it from bit” philosophy.

AN: Readers of the aforementioned works by Richard Dawkins are arguably familiarized with the fact that biology has continuously reduced the scale at which natural selection acts upon: initially it was thought that natural selection acted upon “species” and then later populations. Later still, it was concluded that natural selection acts upon the individuals, and then lastly, circa Dawkins: the genes, information itself being the next logical step.

The basic idea of it from bit” philosophy relevant to our discussion is that one can viewing things we commonly study in biology such as hormones, action potentials, genes and etc as the more abstract concept of information. If stripped away from all of the excessive details of individual systems such as molecules, cells, organisms, ecosystems, and etc. One can view all of biology as merely the exchange and replication of information. Organisms, Cells, Proteins, RNA, DNA, and etc are just the physical substrates of abstract information that is continuously replicating and interacting with other information. The replication of and interaction between different sets of information increases the entropy of the universe. Therefore, These process’ are continuously being driven and optimized by the 2nd law of thermodynamics and natural selection.

Information selection or “bit selection” includes the non-conserved DNA structures such as functional RNA’s, 3D chromatin confirmations, Prions, and/or any other biomolecule that, by their very nature, contain information that is conserved and transmitted, aiding in its own differential reproductive success.

AN: For those of you whom have already read the aforementioned works by Richard Dawkins, you are arguably fully aware that he leaves the definition of “genes” to be a pragmatic one: ambiguous and contextual. As such, we are really just extending the definition of “genes” to be a more elaborate one that includes anything that is conserved and aids in its own replication.

Proteins, non-protein coding RNA’s, and “Junk” DNA

Proteins are incredible molecules that carry out an enormous range of biological functions. Its easy for a scientist to study, purify, isolate and do other various tasks with proteins. Because of their abundance and very unique chemistry we have been able to study them and understand their role better than any other molecule in biology. Even decades after their discovery, we have continuously over emphasized the role our understanding of proteins has made on how we think about living things. Linus Pauling for example once hypothesized quite reasonably, that the genetic material of the cell would be a protein, this single statement lead to delays in the discovery of DNA as the genetic material of the cell as others were hesitant to disagree with a prominent two time Nobel laureate. . One cannot be too harsh on Pauling for making this statement as proteins carry out a lot of functions and it was a reasonable bet to make. However, the evidence ran to the contrary.

Even today, we still carry with us this “protein centric” view of biology that has lead us to miss out on other major discoveries and place unnecessary conceptual road blocks for understanding living things.

One of the major conceptual hurdles that biologist encounter when investigating the role of non-coding RNA or “Junk” DNA is to apply the same framework of specialization and conservation that we apply to proteins and protein coding DNA, despite the clear differences in their underlying chemistry.

A critical question that needs to be asked when thinking about evolution is: what substrates are actually being conserved vs what is merely a byproduct of that conservation?

Breaking the Central Dogma

Proteins are the largest and most amplified substrates of the information in our genome. As the final end point of the central dogma these incredible molecules carry out an enormous range of biological functions. As such their monomers (20+ amino acids if you count the proteinogenic amino acids) have great variation in their underlying chemistry (or to phrase this as “it from bit”: a large difference in their information.)

Some Amino Acids contain R groups that are hydrophobic, aromatic, acidic, basic, and etc. As such there is greater variation among their finalized 3D structures and functions than any other biomolecule.

There is very strong selective pressure not to alter the major components of protein’s monomer units (amino acids) critical to their structure/functionality. Proteins are arguably as good as they are going to get in terms of Evolution. To give one example of many, the incredibly well conserved catalytic triad of protease enzymes would lose their ability to do their job if it were to suffer an amino acid substitution changing its Aspartate to Isoleucine. This is because of the significant differences in the two amino acid’s chemical properties: the former contains a carboxylate in its R group and the later, a chiral center with two alkane groups branching off (note the image below).


However, mutations can, and often do, arise that induce changes in a protein’s amino acid sequence that are of no consequence. A base substitution that changes leucine to isoleucine probably wont have significant changes to the overall 3D structure/function.

(This part is key here) Because the two amino acids have very similar chemical properties. Thus it is not necessarily the amino acid sequence that is conserved, sequence conservation is merely a by product of the conservation of the final 3D structure and the associated information that it contains. The shape and ability to do its job is really the only thing that matters.

Prions: replication and propagation without gene

Prions, such as the ones responsible for mad cow disease which plagued the 1990s and may be responsible for the emerging Alzheimer’s epidemic are another example of a situation where the final 3D structure and the associated information that it contains is what is being selected for NOT the genes/DNA. Prions follow the same rules of Darwinian natural selection, the same way that viruses (another non-living “microbe”) do. Prions are capable of self replication and and propagation and are constantly changing in order to increase their differential reproductive success. What makes them so remarkable is that they are capable of replication and propagation without the use of DNA or any “genetic material”. Thus, it is not the genes that natural selection acts upon in the evolution of prions, it is only the finalized 3D information itself.

Correlation vs Causation

Spurious correlations are always a comical and entertaining way of showing how easy it can be to conflate a cause and effect relationship. When the two events that are linked together like “Films Nicholas Cage Appears in” and “Drowning by falling into a pool” one can clearly see that the observations are just flukes of data merely because the idea of one event causing another is comically absurd. Though the correlative relationship between teacher salary and alcohol sales one might be more causative than the authors realized.

Even under apparent “causative” relationships, the mechanism and details can be difficult to determine. With the changes in amino acid sequence where there is a loss of functionality, we have to consider ALL effects of DNA/RNA base changes and rule out ALL causative models available to fully determine cause and effect.

What may be perceived as a minor change in amino acid sequence that results in a major loss of protein functionality may actually be due to changes in the conditions that dictate how alternative RNA splicing is carried out. Solely focusing on amino acid substitutions may also misguide us into overlooking other changes in the RNA’s Untranslated regions: which are critical for regulation.

Both of the two scenarios: changes in RNA splicing and UTR regulation: may lead one to incorrectly conclude that the causative agent of the loss of protein functionality is the amino acid sequence change. Unless there is significant change in their final 3D structure and these proteins fail to pass a functionality test in vitro we should always ere on the side of caution prior to making a claim of a causative relationship. Thus the only thing that really matters is how the information in the finalized 3D structure manifests itself and if this information contains all the necessary components to carry out a biological function.

Understanding the evolution of Non-Coding RNA’s

Unlike proteins, there is significantly less variation among the RNA monomers:


Unlike Amino acids, all RNA monomers (a whopping total of 4: A, U, C, and G) have a striking degree of conformity: they are all cyclic, planar, polar, conjugated and their only real interaction seems to be intermolecular forces. Unlike proteins, which can form, in addition to intermolecular interactions, covalent disulfide bridges and electrostatic interactions among their R groups.

Remember: a change in the amino acids sequence from leucine to isoleucine is probably not going to have a significant effect on the final 3D shape of the protein because these two monomers have incredibly similar chemical properties.

Thus, when we look for evolutionary conservation and biological roles among functional DNA and RNAs. We need to consider that alternations in monomer sequence probably wont have significant effects on their finalized 3D structure because there is less chemical variation among the monomers of nucleic acids ( all 4 of them) than there are among proteins ( all 20+ of them).

In some circumstances the sequence itself is the primary functional information that is conserved. One example would be the messenger RNA, which must serve as an information carrier from the nucleus to the cytoplasm. Another example would be the ribosomal RNA, which also requires sequence specific binding in order to interact with messenger RNA and, just like the hyperconserved catalytic tirad of proteases, the sequence of rRNA is conserved when it is necessary for structure/function ie: catalytic activity and a relatively long biological half life.

But even then, there is variation between Prokaryotic and Eukaryotic rRNA. When we consider that peptide bonds are all the same regardless of whether they are made by prokaryotic or eukaryotic cells this falls in line with the central thesis: the structure is what is conserved NOT the sequence. In fact, if it were not for bacteria constantly trying to poison each other with their own antibiotics there would not have been any selective pressure for the deviation. I mention this because unlike rRNA, many functional RNA’s work in regulating gene expression and to the fullest extend of my knowledge, their selection was favored by making gene regulation more precise, and thus in a direction of optimization in their ability to perform a function and NOT by avoiding destruction by antibiotic compounds.

When we examine “functional” RNA that serves other purposes than to carry information from the nucleus to the cytoplasm. What one might call other more “3D” RNA molecules such as RNA enzymes, and functional RNAs we need to seriously consider the fact that they are not proteins, and as such the conservation of sequences does not matter as much since variation among monomers is significantly reduced.

Functional RNA’s that fold into regulatory 3D molecules or even single stranded coils are not compelled to maintain the same level of such significant sequence conservation as proteins do, simply because they do not have as much chemical variation among their monomers. As a consequence of this, their ability to perform their function is not compromised severely by a simple base substation.

In the end, it is the information of the STRUCTURE itself that is conserved, NOT the sequence.


Centromeres are the key defining feature of a chromosome. One centromere equals one chromosome, regardless of how many chromatids are present. These fragments of DNA serve as key binding sites for a specalized type of motor protein called kinesins. Collectively this complex is referred to as the kinetochore, which is critical for the process of cellular replication.

Cytogenetic technologists actually look for these regions under a microscope for diagnostic purposes. When a physician suspects Down Syndrome, he orders a rapid FISH test, which contains fluorescent probes which hybridize with the centromere DNA of human chromosome 21. If the technologists see 3 glowing centromeres for chromosome 21 under a microscope, that means the patient has 3 copies of chromosome 21, and thus will develop the debilitating disease.

Given how ancient and vital centromeres are, one would expect the DNA sequence of them to be conserved across species. Since all forms of life, from yeast, to plants, to fish, to frogs, to birds and humans undergo cellular replication to survive, one would anticipate this sequence to be conserved.  However this is one of the more adamant examples of structure over sequence conservation. There is significant variation among the Centromeric DNA sequence from species to species, which is odd given that they all perform the same function.

This paradox illustrates the importance of structural conservation over sequence conservation. When DNA is replicated, it is not left as “naked DNA” the template strands are used for more than just copying information in base pairing. The chemical makeup of the template strand also can be used for the replication of epigenetic modifications such as histones, CpG methylation islands, and (among many other things) Centromeres.

Thus Centromere DNA sequence is not conserved because it serves as nothing more than binding sites for the numerous epigenetic proteins and modifications that come together to make a centromere a centromere. It is the final 3D chromatin structure that is conserved, because it is the final 3D chromatin structure that is necessary to form the kinetochore NOT the sequence.

Latent Enhancers and other Distal Control Elements.

Distal Control Elements such as Enhancers or Repressors are additional circumstances where there is poor conservation of sequence but functional 3D information remains in tact. In some cases enhancers can be constructed at seemingly random sites in the genome, long after differentiation: these so called latent enhancers, which lack the histone marks characteristic of enhancers but acquire these features in response to stimulation seem to act without regards to the specific sequences underlying them. The key authors of the original paper that put forth the idea “suggest that stimulus-specific expansion of the cis-regulatory repertoire provides an epigenomic memory of the exposure to environmental agents.” It can be suggested that this model of 3D information over “genes” allows the organism (vehicles) to adapt faster and more robust responses without having to wait for signals to trickle down the central dogma.

Genomic Imprinting and Parent of origin effects.

There are circumstances where, even when the DNA is kept constant the differences in information can manifest themselves in the 3D structure of DNA. Some forms of Prader Willi Syndrome and Angleman Syndrome can emerge from a phenomenon known as uniparental disomy. In this bizarre circumstance, a child receives their two copies of a chromosome (say for example chromosome 15) from the same parent (dad for example) and zero copies of the same chromosome from the other parent ( no copies of chromosome 15 from mom for example). Even though all the information is there (all 46 chromosomes are present and accounted for) the differences in maternal and paternal epigenetic patterns on these chromosomes leads to the development of these two unique and devastating developmental diseases.

There are numerous other examples of circumstances where structure is conserved over sequence in the genome and elsewhere. By taking the concept of selfish genes: that natural selection acts on the genes and not on the individual one step further and identifying that it is the information itself that is conserved. We have a more inclusive explanatory framework.