The feature article in November-December 2008, New Scientist alerts us to what many have suspected for a long time. “Genomic Confounds Gene Classification” by Seringhaus and Gerstein points to large scale genomic studies that question the prevailing hypothesis in molecular biology that genes are distinct parts of the DNA molecule that operate by producing an mRNA transcript which translates into a polypeptide that folds to form a protein that has a specific function within the organism. This one gene – one protein view is at the core of our current understanding of biological processes. Nevertheless, scientists have suspected that whereas the idea is fundamental it is too simplistic. As I noted in “Evolution and the Future of Humanity” [Hart, 2008]
“Those parts of the DNA molecule that do code directly for proteins are termed exons, and those parts which do not are called introns. The intron regions of the chromosome molecule are often referred to as junk sequences. These sequences are probably important in controlling the development of traits in some way or another because chromosome duplication processes are far too precise to allow replication of useless materials.”
Seringhaus and Gerstein note that as “high-throughput genomics is generating data on thousands of gene products ….. biology’s basic unit, it is clear, is not nearly so uniform nor as discrete as once thought”. Although the basic concept of one gene – one protein still stands there is a need for an enhanced taxonomy of genes that can improve our ability to classify and interpret the molecular products of the DNA – RNA genomes. Current analytical methods that simultaneously examine the relationship of millions of bases along the genome are showing that “creating an RNA transcript from a DNA region is more complex”, involving transcription of areas of the genome beyond the known boundaries of a specific gene often involving areas of the genome thought to be relic genes harbored in the introns. These introns were previously thought to be spliced out prior to protein production in the Eukaryotes. It is now seen that introns can be incorporated in the protein during transcription and exons can be discarded – this complicates the work of the systematicist and demands a new taxonomy to allow a more rigorous and comprehensive classification system.
The authors significantly note “understanding of gene regulation is also changing”. The classical idea that the repressors, operators and promoters are located in close proximity to one another, as exemplified by the classic lac operand in bacteria is again too simplistic: “in mammalian systems and other higher eukaryotes … genes can be regulated very far upstream by enhancers over 50,000 base pairs away, even beyond adjacent genes”. This is done by the folding ability of DNA. Moreover, we have known for a while that gene activity can be modulated by epigenetic effects such as the addition of methyl groups.
Genomics is at a developmental stage that many other natural sciences passed through. My own early interests were in taxonomy and classification of microfossils and in that field it was recognized, early on, that only with a rigorously enforced and standardized nomenclature integrated into a well thought out taxonomic framework could progress ensue. Seringhaus and Gerstein imply this is what is needed for gene classification if genomics is to progress. In Neontology and Palaeontology nomenclature is standardized through International Codes e.g. The International Code of Botanical Nomenclature and the authors point out the need for such a code for genes. Nomenclatural standardization is necessary as a means of unambiguous communication but also an added value is that a standardized naming system when viewed within a classification is also a global knowledge holder about each object classified. The classification structure itself becomes a a knowledge web that can be queried as a massive bio-database. Seringhaus and Gerstein use the semantic web of the internet and the ability of Google to extract information, as an example of a rich classification scheme. However, I would urge caution of any direct approach in this direction because the web as illuminated by Google will produce a classification in which systematic anarchy prevails. Only with a rigorously enforced and standardized nomenclature within a well thought out [multiple] taxonomic framework would a semantic web produce what is needed. Anyone who uses the web for non-trivial scientific research is aware that the opinion of a single professional is worth more than those of a thousand amateurs [anyone know who said that first?].
It is important to remember the following distinctions: “Taxonomy pertains to a system devised for dividing up things into different types and how they are arranged one to another. Classification pertains to an actual classification that is set up for a group of things. Systematics is the actual classification of individual things within a taxonomic framework.” Hart, 1996. In palaeontology the route to a more stable palaeo-species classification was based in the simple move from a morphospecies definition involving measurable traits to one in which evolutionary [temporal] acquisition of traits is important. This led to a significant advance in the inherent knowledge content of fossil classifications.
Now that we realize that the DNA sequence in a single region does not necessarily define a gene we can incorporate the biochemical effect, [developed by transcription] on the functional phenotype, into gene classification. Moreover, the external and internal selection pressures operating during transcription need to be more fully understood and incorporated into. Clearly, the phenotypic effect does not necessarily capture the function of the gene at the molecular level and to understand the genotype we need to know how biochemical products affect the metabolic pathways and the resulting biological system. All of these aspects need to be incorporated into an improved gene classification.
The taxonomy for genes needs to be non-hierarchical i.e. a multi-level taxonomy. The authors hint at a classification based on gene ontology that uses directed acyclic graph structure [DAG] within which a multiple classification exists but all pointing towards a single gene. What is interesting in the DAG approach is that the multiple classifications [each for its own purpose] and each of which points to a single object i.e. a gene, allows a web to be built that can be interpreted in semantic terms. It is a system that would lend itself to an AI approach to gene systematics. Recently, I have been looking at the SOAR programming language as a system for understanding complex relationships, like those within a genome, or a cultural gamodeme, and perhaps this is the direction in which to develop a useful classification of genes. SOAR can allow a single gene to have multiple functions and multiple genes to have a single function within a semantic framework. This area I hope to explore in the future [as soon as I learn how to use SOAR correctly!]. Such a system could result in a greater understanding of evolution and relationships among living systems.
References:
Hart G. F. 1996 http://www.geol.lsu.edu/hart/NOTES/taxonomy.htm
Hart G. F. 2008 Evolution and the Future of Humanity: Homo sapiens’ galactic future. SCI& Publications, Boulder, Colorado. www.ScienceAnd.com
ISBN-13 978-0-9818642-0-4.
Seringhaus M. and Gerstein M. 2008 Genomics Confounds Gene Classification. American Scientist, 96[6]:466-473.
George F. Hart.
Monday, November 10th , 2008.