How complex is life at the molecular level?  Human cells, for example, possess genes for an estimated 35,000 proteins and RNA molecules.  In questioning how a cell could evolve so many different functional molecules, it becomes obvious that many of these genes are similar.  The number of copies of a small number of rRNA and tRNA genes may total to be more than 2000.  It turns out that gene duplication is quite common and most of our genome has resulted from the duplication of a small number of ancestral genes.  Some of the duplicated genes change gradually over time and adopt new functions so that they can no longer be considered duplicates of each other, but rather related genes of a gene family.  There are hundreds of genes belonging to each of the largest gene superfamilies such as the G-protein coupled receptor superfamily, the kinases, the immunoglobulin/fibronectin superfamily, homeodomain transcription factors, and the zinc finger family of transcription factors.  There are a large number of smaller families such as the ABC/helicase superfamily, the voltage regulated ion channel family, the cadherins, the globins, and so on.  Thus a very large percentage of the human genome has resulted from duplicating and modifying the genes of simpler ancestral genomes. The majority of known proteins in modern organisms are members of fewer than 1,000 protein families (Chlothia, 1992; Sankoff, Holmes).

     What genes were present in the first cells?  This question cannot easily be answered since unmodified descendants of the first cells do not exist.  Just as it would be unreasonable to suggest that the ancestors of modern Celts or Navajo or Bantu had electricity simply because their modern descendants do, it is unreasonable to suggest that the first cells had anything which even approached the diverse set of proteins found in even the simplest of their descendants alive today.  Could the first cells have possessed the complex proteins that modern cells require for life?  No.  In fact, the first cells might have lacked protein altogether, performing necessary tasks with RNA molecules.  Later descendants seem to have incorporated proteins simply as stabilizers of the 3 dimensional shape of these functional RNAs.  Even when we reach the level of proteins, it is evident that many proteins are not as complex as they might appear at first glance.


1)      Gene Duplications: Globins, Hox Genes, and Opsins

     Not all humans have the same number of genes—some people have duplicated copies of genes which others have only one copy of (per haploid set).  Duplications can occur through different mechanisms but a common cause is unequal crossing over during recombination (recombination is a normal aspect of gamete formation).  Often, duplicated genes will then exist as a cluster of genes in tandem on a single chromosome.  There are many such examples of gene clusters in the human genome.  Other chromosomal changes, such as translocation, can move one chromosome segment to another chromosome.  When these mutations occur during the formation of gametes, it can affect the number of gene copies present in the offspring.

Yeast possess about 6,000 genes, Drosophila about 12,000 and the nematode C. elegans about 14,000.  Gene number increased in animals compared to simpler eukaryotes.  The first non-vertebrate deuterostome whose genome size was known, that of a sea urchin, identified about 25,000 genes.  This sea urchin not only possessed many of the genes found in more primitive animals, they were organized into many of the same linkage groups (such as PA-ANK-FGFR-ADR-ECR-VMAT/VACHT-LPL group which is thought to have existed primitive bilateran animals) (Pebusque, 1998).  Not only did individual gene duplications continue to increase the repertoire of genes present in the vertebrate lineage, it appears that two large scale gene duplications (perhaps whole genome duplications) occurred early in the evolution of the vertebrate lineage.  One of these duplications seems to have predated the evolution of the first fish and the second to have predated the evolution of jawed fish (Stadler, 2004; Hahn, 1998; Escriva, 2002; Hoyle, 1998).  The following pattern is frequently observed in comparative genomics in which vertebrates have four homologs of single invertebrate genes.


There are many examples of recent duplications of genes in human genomes.  For example, duplications of a cluster of genes on chromosome 1p36.2 have included the pseudogene MSPL.  Human genomes can vary in the number of this pseudogene they contain per haploid set from 4 to 7 or more (van der Drift, 1999).  The following groups of human genes are examples of gene family members which have arisen through tandem duplications.  In the globin and opsin clusters, additional duplications are occurring, producing additional family members in some humans than in others.

1) Red and Green Opsins

Most non-primate mammals and more primitive primates (prosimians and New World Monkeys) have one opsin for long wavelengths of light.  Old world monkeys and apes have two genes resulting from gene duplication: red and green.  Some people have multiple copies of red (up to four) and green opsins (up to seven).
Most non-primate mammals and more primitive primates (prosimians and New World Monkeys) have one opsin for long wavelengths of light.  Old world monkeys and apes have two genes resulting from gene duplication: red and green.  Some people have multiple copies of red (up to four) and green opsins (up to seven).


5’-----zeta-------zeta pseudogene-------alpha pseudogene------alpha 2------alpha 1-----theta----3’


5’---------epsilon----------gammma G---------gamma A----------eta------------delta----------beta---------3’

There are variations in the number of globin genes from the alpha and beta clusters.  There are 3 copies of zeta genes in some Melanesians and Polynesians.  A common variation of the zeta pseudogene makes it a functional gene.  While most adults have 2 alpha hemoglobin genes, common variants include the possession of 1 or 3 alpha genes.  Melanesians only have 1 alpha gene and, among African-Americans, chromosomes with one alpha gene are about as common as those with two.  The function of the theta gene isn’t known although it is expressed in red blood cell lineages in fetal mammals (including humans at about 5 weeks).  There have been cases where it has been deleted but no effects of this are known.  Most people have two gamma genes (gamma G and gamma A) but some people have 3 or even 5 (there was even one family that had 4 copies on another chromosome). Delta hemoglobin can complex with alpha hemoglobin in adults and it can be considered to be a second beta hemoglobin gene. There are examples of fusion hemoglobin genes in which the delta and beta genes have fused into one gene (and even a case of a delta-beta-delta fusion gene).


The importance of HOX clusters in embryonic development cannot be over-emphasized.  These gene clusters appear to be multiple tandem duplications.  In vertebrates, the original cluster was duplicated to produce four clusters, from which a few members were lost.





Evidence indicates that a region of human chromosome 1q22 was duplicated in the ancestors of anthropoid primates. At least two new genes were generated, including a transcription cofactor expressed in the testis which might affect anthropoid speciation from ancestral forms. In order to compensate for increased dosage of the MSTO gene, different anthropoids have evolved different deleterious mutations in one of the two genes (Kuryshev, 2006).

In comparing the chemosensory genes of two nematodes of the genus Caenorhabditis, it is evident that the lineage leading to C. elegans has expanded its serpentine receptor class ab (srab) family through tandem duplication for a total of 11 new genes (Chen, 2005). The TKDP gene family is an unusual example of an amplification in a set of ten genes specific to a group of mammals (ungulates) which lack homologs in other mammals. One region of the gene seems to have resulted from the duplication of one exon with a neighboring intron (Chakrabarty, 2006).

1)      Not all parts of a protein are created equal: Protein Folds

     Does a protein have to have a precise amino acid order to maintain its function?  The answer is obviously no.  If you survey the amino acid sequences of a certain protein in any one species (such as humans) or across species, it is evident that some parts of the protein are not as critical to the overall function as others (and thus more free to change). A protein may require as few as 7 amino acids to determine its tertiary structure (Reader, 2002).  Many of the essential portions of proteins form a specific protein fold and it is this part of the protein that performs an essential function such as binding DNA, binding ATP, forming the active site of the enzyme, adding a phosphate group to a protein, etc.   For example, the zinc finger fold binds DNA and is a requirement for all the zinc finger transcription factors, allowing them to bind DNA.  The original zinc finger proteins have been duplicated hundreds of times to produce a superfamily of proteins which bind DNA.  Variations between different members of the superfamily allow them to bind to specific regions of DNA while retaining the zinc finger protein fold as the essential part of the protein. 

     How many protein folds are there?  Not as many as one might think.  There are only several hundred protein folds which have been identified from all known gene sequences and it is thought that the total number of protein folds throughout the kingdoms of life may be about 1,000.  These folds may be central elements of different proteins—the average fold is known to be incorporated into over 100 different proteins but some (such as the TIM barrel, the immunoglobulin fold, the Rossman fold, the ferrodoxin fold, and the helix-turn-helix bundle) are incorporated into thousands of different proteins each.  The twenty five most abundant folds are parts of 61% of proteins with structural homologues throughout all groups of life (Gerstein, 1997).

      Although each major group of organisms possesses varying numbers of these folds in their genomes (for example, immunoglobulins for intercellular communication and zinc fingers for gene regulation are among the ten most abundant folds in animals but not plants or eubacteria), there are many folds which are shared.  Of 229 protein folds identified in eukaryotes, 156 were shared with bacteria.  Of 194 protein folds identified in animals (metazoans), 132 were shared with other eukaryotes.  Of 181 protein folds identified in chordates, 131 were present in non-chordate animals.  Thus, the functional portions of many human proteins are domains which evolved very early in the history of life and were subsequently duplicated and modified in more complex species (Gerstein, 1997).  

     There are only a few thousand protein domains known in living organisms.  Only 7% of vertebrate protein domains are unique to vertebrates (Liu, 2001; International Human Genome Sequencing Consortium, 2001). There are 21 small-molecule-binding-domains (SMBDs) which bind to small intracellular molecules and are shared by at least 2 of the three main groups of organisms (eubacteria, archaea, and eukaryotes).  These small domains have been incorporated into a number of unrelated proteins.  For example, the T-OB domain has been incorporated into some ABC transporters where it regulates the uptake of the substrate (Anantharaman, 2001). The majority of eukaryotic genes are multi-domain proteins whose original forms originated as the fusion of small domains. After the fusion of domains, additional changes such as additions and deletions often occur, often on either side of the fused domains (Bjorklund, 2005). There are a large number of multidomain proteins in the human genome in which a small number of ancestral domains have been shuffled and spliced to produce a diversity of proteins.


     Exons can share homology over a number of diverse genes.  For example, sequence similarities suggest a common origin of the X exon of human α-1 (II) collagen and the second exon of rat mannose-binding protein A; the 4th exon of human serum albumin and the 7th exon of human K6b epidermal keratin, the first exons of human apolipoprotein B-100 and EGF receptor; the 5th exon of mouse α-2 type IV collagen and first exon of human complement C1q B-chain; and the second exon of human TNF-β and 3rd exon of rat asialoglycoprotein receptor.  All known exons may be descended from only 1,000-7,000 ancestral exons. (Dorit, 1990). New genes can be created through exon shuffling. For example, an mRNA of alcohol dehydrogenase was inserted into an intron of the yellow emperor gene in Drosophila by retroposition. Its expression pattern and the rate of synonymous changes suggest it has acquired a new function (Long, 2003). The prolactin-induced protein arose by the partial gene duplication of a gene in the alpha-2-macroglobulin family in ancestral amniotes (Kitano, 2006).

     Since individual genes come in pieces, a cell can shuffle these pieces to produce a diversity of different proteins from one gene.  In the following diagram, one pre-mRNA containing introns (magenta) and exons (various colors) can be spliced in different ways to produce a variety of mRNAs which are composed of different sets of exons.  These mRNAs would then encode different protein sequences despite their origin from the same gene.   Alternate splicing of the original transcript RNA allow higher eukaryotes to generate a diverse repertoire of proteins.  One of the reasons that the early estimates of the size of the human genome (about 100,000 genes) were so high was that so many genes produce alternate transcripts.



    The LDLR family of receptors share a number of molecular components.  The green areas represent LDLR repeats (with some variation in the number of repeats) and the red areas represent EGF-precursor domains which occur in proteins outside the LDLR family.


2)      Small proteins functioning in larger protein complexes

      Many of the protein structures of a cell are truly complex—those that replicate DNA, initiate transcription, transport substances across the cell membrane, form part of the electron transport chain in mitochondria, etc.  However, these proteins are actually complexes made of smaller subunits.  The replication of DNA is not accomplished by one protein but by a number of them.

The F0 and F1 particles of mitochondria are found in bacteria as well.  Note that the F0 pump is composed of multiple b and c subunits and the F1 pump is composed of multiple aand b subunits. 
More subunits can be added throughout evolution for modification of the original function.  In the illustrations above, bacterial F0 particles have 3 subunits (a, b, and c) while eukaryotes have these subunits plus 2-5 more.  Bacterial F1 particles are composed of 5 subunits. While eukaryotic bc1 complexes may involve 9-11 subunits, bacterial versions only require 3 subunits to function: cytochrome b, cytochrome c1, and the Rieske iron sulfur protein  (Darnell, p. 599; Baltscheffsky, p. 223). 

3) Creating larger genes through fusion of smaller ones

Human hexokinase enzymes resulted from the fusion of two replicas of an ancestral hexokinase.  One of the subunits of the resultant enzyme was modified over time to serve a regulatory function.


The Tre2 oncogene only occurs in hominoid primates and seems to have resulted from the fusion of two genes: the highly conserved USP32 and the TBC1D3 gene which has undergone amplification in primates (Paulding, 2003).


Collagen is the most abundant protein in the human body.  It is composed of primordial units: one primordial unit is 6 replicas of an amino acid triplet:

Three ABC transporters: the HisQMP2 transporter of E.coli made of 4 subunits encoded by 3 genes; the Drosophila eye pigment transporter formed by the products of 2 genes (each encoding half a transporter), and the chloride ion transporter responsible for cystic fibrosis (one gene)

It has often been observed that proteins which are structurally unrelated can perform equivalent functions. There are more than ten protein superfamilies which perform the function of binding a nucleotide to a free hydroxyl group. While the proteins within a family are homologous, there is little if any homology between the superfamilies. For example, there are at least 5 superfamilies of DNA polymerases, two related DNA ligase families, two families of RNA polymerases, two families of polyA polymerases, enzymes which cap RNA, two families of CCA-adding families, and a variety of viral proteins (Aravind, 1999).

The DNA polymerase β superfamily includes groups of enzymes which are known in all three domains (DNA polymerase X), bacteria and archaea (CBS and cNMP-binding domain containing nucleotidyltransferases and “minimal” nucleotidyltransferases), eukaryotes and eubacteria (PolyA Polymerase CCA adding enzyme), specific to eukaryotes (PolyA polymerase, TRF4/5, a’-5’ A polymerase), specific to archaea (CCA adding enzyme), and a variety which are only known from eubacteria (terminal deoxynucleotidyl transferase, uridylyl transferase, adenylyl transferase, kanamycin nucelotidyltransferases, and proteobacterial adenylyl cyclases). The proteobacterial adenylyl cyclases represent an interesting example of how enzymes can be modified to new functions. The “minimal” nucleotidyltransferases represent a simple structure which is thought to be comparable to the ancestral members of this gene superfamily (Aravind, 1999).


There are about 35 million single nucleotide changes which distinguish the chimpanzee genome from the human genome. In addition, there are about 5 million insertion/deletion events which are specific to one of the lineages and are responsible for 1.5% of the differences in euchromatin sequences. There have also been several chromosomal rearrangements including 9 pericentric inversions and the fusion of two ancestral chromosomes to produce human chromosome 2 (The Chimpanzee Sequencing and Analysis Consortium, 2005).

Retroviruses have made a number of lineage-specific insertions. The only retrovirus still active in the human genome, HERV-K, made 73 human specific insertions and 34 chimp specific insertions since the lineages separated. Two new retroviruses have been incorporated into the chimp genome, the larger of which has produced 200 copies. The human lineage has incorporated 7,000 new Alu sequences compared to only 2,300 in chimps. This is in part due to the activity of two new Alu subfamilies in humans. Given that the activity of Alu insertions is even higher in baboons than in humans, chimps may have undergone a reduction in Alu activity. Since the separation of human and chimp lineages, about 200 retrotransposed genes have been introduced into the human genome and about 300 into the chimp genome. While ribosomal proteins are the most common retropositions in both, chimps unlike humans have added numerous copies of C2H2 zinc finger genes (The Chimpanzee Sequencing and Analysis Consortium, 2005)

About 29% of human proteins are identical in their amino acid sequence to those of chimps. The average protein has two amino acid differences between humans and chimps, one change in each lineage. The average gene possesses 2 synonymous changes and 3 nonsynonymous changes when the two genomes are compared. Although 5% of the genes have undergone a insertion/deletion, the median case is the addition/deletion of one codon. About 585 genes seem to have undergone positive selection when comparing humans and chimps. The X chromosome is the most similar between humans and chimps while the Y chromosome is the most different (The Chimpanzee Sequencing and Analysis Consortium, 2005). The differences of mitochondrial DNA between humans and chimps is less than double that which separates mammoths from Asian elephants (Krause, 2006).

The overwhelming majority of dog genes, more than 19,000, have homologs in the human genome. Humans have more genes than dogs (current estimates of the human genome are about 22,000 genes) because of the numerous gene duplications which occurred in the primate lineage (Lindblad-Toh, 2005).

A dog DNA sequence possesses a different nucleotide compared to a coyote every 420 base pairs, every 580 base pairs when compared to a wolf, and every 900 base pairs on average when compared to other breeds of dog. This is the Single Nucleotide Polymorphism rate or SNP rate. In a typical dog breed, about 73% of the SNPs were polymorphic, or there existed multiple alleles at that site. It has been an estimated 9 thousand dog generations since dog and wolf lineages separated compared to an estimated 4 thousand generations since human and chimp lineages separated (Lindblad-Toh, 2005).

The genes which have undergone a positive selection in the human lineage since its divergence from the lineage of chimps include genes for olfactory receptors, skeletal development (TLL2, ALPL, BMP4, SDC2, MMP20, and MGP), the development of the nervous system (*NLGN3, SEMA3B, PLXNC1, NTF3, WNT2, WIF1, EPHB6, NEUROG1, and S1M2), and homeodomain transcription factors (CDX4, HOXA5, HOXD4, MEOX2, POU2F3, MIXL1, and PHTF). A gene required for speech, forkhead-box P2 and a gene involved in hearing, tectorin, have also undergone selection in the human lineage ( Clark, 2003).

Major mendelian disorders are more likely to occur in genes which have undergone positive selection in the human lineage ( Clark, 2003).


One genetic analysis of dog breeds indicated that 27% of the total genetic variation was caused by variation between breeds. This degree of genetic variation is much greater than that found between diverse human populations (usually 5-10%) (Parker, 2004). Alaskan malamutes and Siberian huskies are the dog breeds which are most closely related to wolves (Parker, 2004).


Almost all human genes whose protein function is known (more than 97%) are not completely unique; at least part of their sequence is a duplicate of a sequence contained elsewhere in the genome. Most genes (about 80%) are composite genes which include sequences that match those contained by multiple genes elsewhere in the genome. Analysis of several invertebrate genomes results in similar conclusions (Britten, 2006).

Sequences in proteins can be duplicated through several processes. Duplication results from the insertion of mRNA copies into chromosomes, an error in crossing over during synapsis, and the duplication of chromosomal regions through segmental duplications, aneuploidy, and polyploidy (Britten, 2006).

Duplicate genes are not retained in genomes at constant rates. Mouse genomes have retained more duplicate genes than that of humans (Shiu, 2006).



     The phosphoglycerate kinase in trypanosomes has acquired a unique amino acid sequence by the capture of an intron.  No new function has yet been acquired by this sequence (Golding, 1994).


The glycine receptor is homologous to the GABA receptor. Amino acid changes at residues 159 and 161 convert the wild type receptor which responds to glycine into a mutant channel which responds to GABA (Schmieden, 1993). An olfactory receptor gene is expressed in the notochord of the developing chick embryo, suggesting that it has been adapted for a novel signaling function (Nef, 1997). A group of proteins once classified as FGFs are now classified as Fibroblast growth factor homologous factors, FHFs, because they are not secreted and do not interact with FGF receptors. They have been modified to bind to interact with the intracellular region of certain sodium channels and Islet-brain2, a neuronal MAP kinase scaffold protein (Goldfarb, 2005). The serpin superfamily has undergone a great diversification in vertebrates to produce proteins as varied as ovalbumin in chicken egg whites to proteins which regulate inflammation and embryological development. Although most function as serine protease inhibitors, some have evolved modified functions such as signaling (e.g. angiotensinogen) and energy storage (ovalbumin). Serpins are ancient molecules and are known from all three domains of life (Benarafa, 2005).The brain synthesizes a number of steroid hormones such as dehydroepiandrosterone (DHEAS), androstenedione, pregnelone, progesterone, and deoxycorticosterone. Cytochrome P450 enzymes and other enzymes are required for their production from cholesterol. These neurosteroids are capable of interacting with cell membrane receptors such as GABA receptors, sigma receptors, NMDA receptors, glycine activated chloride channels, ACh receptors, and specific calcium channels (Dubovsky, 2005).

Vertebrates utilize connexin proteins to establish gap junctions. Invertebrates utilize innexin proteins which are structurally similar but evolutionarily unrelated to connexins (EMBL-EBI, 2006).

The topoisomerase V found only in Methanopyrus kandleri is unlike those used by other organisms and may have originated from a virus (Forterre, 2006).Thymine can be synthesized in several different ways by different organisms. The enzyme thymidylate synthase (ThyA) can methylate deoxyuridine 5’monophosphate (dUMP), thymidine kinase can modify components of nutrient media, and ThyX can convert uracil into thymine (Myllykallio, 2002).

Lampreys and hagfish do, however, possess lymphocyte receptors which are modified in development to produce a considerable diversity of proteins. They are not homologous to the antibodies and TCRs of jawed vertebrates and instead represent an independent mechanism for generating a diversity of leukocyte receptors (Pancer, 2005).

Although both Drosophila pseudoobscura and Drosophila melanogaster possess Y chromosomes, they are not homologous indicating that the Y chromosome of D. pseudoobscura has evolved separately (Carvalho, 2005).

Although archaebacteria may possess complete glycolysis pathways, they can perform individual steps by using proteins which are completely unrelated to those of eubacteria (such as the steps performed by hexokinase, aldolase, PFK, and TPI (Martin, 2002a).

Archaebacterial cell membranes are composed by lipids which are very different from those found in eubacteria and the biochemical pathways which produce these lipids are equally divergent (Martin, 2002a).

Mice and rats use carbonic anhydrase in olfactory neurons to detect carbon dioxide, which is odorless to humans (Hu, 2007).

Although the primary role of beta defensin is that of an immune protein, mutations can also result in black pigmentation in dogs (Candille, 2007).

Caseins compose more than 80% of the protein in milk Most caseins interact with calcium and are important not only as a source of amino acids for the newborn, but also for the calcium the carry. {kappa}-casein is not homologous to other caseins and seems to have originated from a modified duplicate of a fibrinogen gene. All therian mammals express {kappa}-casein and at least one of the calcium-sensitive caseins. Mice with mutations in their {kappa}-casein gene fail to lactate (Shekar, 2006).


Three genetic mechanisms are known in which monoallelic expression of agene is achieved: imprinting, X chromosome inactivation in females, and a third system which randomly inactivates either paternal or maternal copies. This third system is known to affect about 300 genes including olfactory receptors, T cell receptors, interleukins, and natural killer receptors (Gimelbrant, 2007).