What was the shape of the original chromosomes?  This is not clear since, in general, prokaryotes possess circular chromosomes while eukaryotes possess linear chromosomes.   The shape of chromosomes can even vary:  some members of Chlamydomonas have circular mitochondrial chromosomes, while in others possess linear ones (Gray, 1992).  While most bacteria possess circular chromosomes, a few species are known to possess linear chromosomes with telomeres which utilize reverse transcriptase (Lue, 2004).

    Many feel that the compact, circular chromosome of modern bacteria is a modification from the ancestral condition of mini-chromosomes which has been selected for efficiency in rapidly dividing bacterial cells.  Small linear chromosomes would have possessed a number of advantages to the earliest cells.  They could be replicated without the risk of damaging errors of large chromosomes and would not have required enzymes to control the degree of twisting in the DNA (topoisomerases).  Multiple copies of a chromosome would have allowed for function in the case of mutation.  Multiple copies of small chromosomes would also have facilitated the evolution of greater complexity through lateral transfer, gene duplication, and gene diversification following diversification.  Before eukaryotes evolved a mitotic cycle which could control the partitioning of these chromosomes into daughter cells, mini-chromosomes would have been distributed randomly when the cell divided (Woese, 1998)..



     What did the original genes look like?  Most prokaryotic genes are continuous coding units while most eukaryotic genes are divided structures, split into segments known as exons which include the coding sequences for proteins and intervening sequences known as introns which do not determine the sequence of a protein.


     Although some bacterial introns are known, they are not found in most bacterial genes (Gray, 1992).  Did introns exist in the original genes or are they an adaptation which evolved in eukaryotes?  Could both of these conditions be true?



      In the process of converting a DNA message into protein, the DNA is copied into RNA.  The introns of the RNA must be removed and the exons joined before RNA can leave the nucleus and be translated into protein.  How are introns removed?  There are two general mechanisms: some require a complex known as a spliceosome (mentioned shortly) and others remove themselves.  There are different mechanisms through which introns can remove themselves; the 2 main groups of introns differ in internal organization and are not related to each other.  Group I and II introns found in both organelles and bacteria; group I introns are also known in the nuclei of lower eukaryotes.  Catalytic RNA introns are thus classified as ribozymes, and ribozymes were thought to have been the primary catalysts of the RNA world.  Splicing the genetic message to produce functional molecules would have been an essential process for the first cells in the RNA world.

      While some RNA introns are capable of catalytic activity and splice themselves out of the genetic message, others are excised by a structure called the spliceosome.  This spliceosome is a complex composed of a number of proteins and RNAs called small nuclear RNAs (snRNAs).  The exist in all eukaryotes and small nuclear ribonucleoproteins (snrps) which form a structure known as the spliceosome which control the splicing of pre-mRNAs to produce mRNAs.  The snRNA is critical in this process: U2 and U6 can begin splicing even without the protein and mutations in the RNA sequences affect the specificity of the splicing.  (Maniatis, 2002). 

      Introns typically begin with the sequence GT and end with AG (the GT-AG rule).  Splice sites are generic and not tissue specific.  In principle, any GT site can interact with an AG site, allowing alternate splicing possibilities. Some sites bind SNRPs less well than others.  About 1/10,000 introns begins with AT and ends with AC and they utilize different snRNAs.   The spliceosome can frequently cause different versions of the same original transcript through alternative splicing.  For example, the human gene RBP-MS can produce at least 12 different transcripts (OMIM; Sharp, 1985; Zhou, 2002).

     There are additional small RNAs located in the nucleolus within the nucleus called small nucleolar RNAs.  As was previously discussed, many of the snoRNA genes are located in the introns of other genes, leading some to suggest that introns were originally the coding structures of genes that produced functional RNAs.  The exons which separated them were used to produce random amino acid chains which were subsequently selected for specific functions.


     Since genes come in pieces, a cell can shuffle these pieces to produce a diversity of different proteins.  In the following diagram, one pre-mRNA containing introns (magenta) and exons (various colors) can be spliced in different ways to produce a variety of mRNAs which are composed of different sets of exons.  These mRNAs would then encode different protein sequences despite their origin from the same gene.


     Given the catalytic RNAs are responsible for this splicing of pre-mRNAs, it is possible that RNA splicing predated the use of proteins in precells.  If the ancestors of cells depended on functional RNA molecules, RNA splicing would have been crucial for the production of functional RNAs.    The exon theory of genes proposes that small coding RNAs coding 15-20 amino acids composed the original genes.  The splicing performed by introns was thus vital in processing the first proteins (DiGiulio, 2001)



     One of the ways that higher eukaryotes generate such a diverse repertoire of proteins is through alternate splicing of original transcripts.  While much of this splicing occurs by joining exons separated by introns, there are also cryptic splice sites which lack introns, but which are recognized by the spliceosomes.  In comparing homologous genes from different species, the cryptic splice sites of one species can correspond to the position of introns in different species.  It seems that the process of splicing a transcript at a certain site can introduce an intron at the site (Sadusky, 2004).  There are specific splice sites where new introns can non-randomly insert.  Introns usually arise de novo instead of originating from the duplication of existing introns (Sverlov, 2003).

       While the presence of introns is shared by all eukaryotes, the use of introns and exon shuffling seems to have increased markedly in animals, contributing to their success (Patthy, 1999; Muller, 2002). 



     When analyzing introns, do they appear to be old regions shared between distantly related organisms, or are they relatively new found only in closely related groups?  A comparison of more than 680 groups of genes indicate that many introns are conserved throughout eukaryotes.  About a third of the introns in these genes found in a protist (the malaria parasite) also exist in at least one of the groups of higher invertebrates, indicating that many introns have been retained over 1 billion years.  In a number of genes, such as p38 and JNK kinases, the introns are highly conserved from sponges through protostomes (such as flies)  and deuterostomes (such as humans).  New introns are also known which are specific to some lineages, especially animals and plants (Rogozin, 2003; Muller, 2002).    For example, an intron has been inserted in the SRY gene of dasyurid marsupials, apparently without altering its function (O’Neill, 1998). Some of the introns in higher animals originated after the evolution of sponges such as the introns of tyrosine kinase and crystallin genes (Gamulin, 1997; DiMaro, 2002; Muller, 1999). Comparisons of closely related organisms (such as two genomes) demonstrates that some introns have a recent origin and that the reverse splicing of existing introns is one mechanism for the generation of novel introns (Coghlan, 2004).More than 120 introns in the C.elegans genome have been shown to be of recent origin.  Many are duplicates of introns in the same genome and some are duplicates of introns in the same gene (Logsdon, 2004).  Several bacteria are known to have introns in 23S rRNA genes which originated from eukaryotes (Nesbo, 2003).

     Genome comparisons have demonstrated that some introns are variable in homologous genes in different organisms (suggesting that introns evolved late) while others are consistently found in the same position in all animals and even in more primitive eukaryotes (suggesting an early origin of many introns; Sverlov, 2003). Although some forms of alternate splicing have been conserved between humans and mice, the majority seem to have evolved since the divergence of these lineages (Yeo, 2005).

Silent mutations can affect splicing and thus are not truly neutral. In the CFTR protein, such mutations have resulted in exon skipping and the formation of nonfunctional proteins (Pagani, 2005).


     Does a protein have to have a precise amino acid order to maintain its function?  The answer is obviously no.  If one surveys the amino acid sequences of a certain protein in any one species (such as humans) or across different species, it is evident that some parts of the protein are not as critical to the overall function as others (and thus more free to change). A protein may require as few as 7 amino acids to determine its tertiary structure with most of the amino acids being less critical to the overall protein shape (Reader, 2002).

     Many of the essential portions of proteins form a specific protein fold called a domain. Different protein domains can perform different functions-- one binds DNA, another binds ATP, another forms the active site of the enzyme, etc.   Organisms seemed to have increased their complexity by the diversification of the proteins containing existing domains rather than the evolution of new domains.  For example, the zinc finger fold binds DNA and is a requirement for all the zinc finger transcription factors, allowing them to bind DNA.  The original zinc finger proteins have been duplicated hundreds of times to produce a superfamily of proteins which bind DNA.  Variations between different members of the superfamily allow them to bind to specific regions of DNA while retaining the zinc finger protein fold as the essential part of the protein. 

     How many protein folds (domains) are there?  Not as many as one might think.  There are only several hundred protein folds which have been identified from all known gene sequences in modern organisms and it is thought that the total number of protein folds throughout the kingdoms of life may be about 1,000.  These folds may be central elements of different proteins—the average fold is known to be incorporated into over 100 different proteins but some (such as the TIM barrel, the immunoglobulin fold, the Rossman fold, the ferrodoxin fold, and the helix-turn-helix bundle) are incorporated into thousands of different proteins each.  The twenty five most abundant folds are parts of 61% of proteins with structural homologues throughout all groups of life (Gerstein, 1997).


About 75% of the known protein domains are shared between at least two of the three domains of life ( Kurland, 2007). The distribution of domains in eukaryotic cells may approximate that of the early cells and those of the two bacterial domains may have undergone significant reduction for greater efficiency ( Kurland, 2007). Although each major group of organisms have different distributions of these folds (for example, immunoglobulins for intercellular communication and zinc fingers for gene regulation are among the ten most abundant folds in animals but not plants or eubacteria), there are many folds which are shared.  Of 229 protein folds identified in eukaryotes, 156 were shared with bacteria.  Of 194 protein folds identified in animals (metazoans), 132 were shared with other eukaryotes.  Of 181 protein folds identified in chordates, 131 were present in non-chordate animals (Gerstein, 1997).  

     There are only a few thousand protein domains known in living organisms.  Only 7% are unique to vertebrates (Liu, 2001; International Human Genome Sequencing Consortium, 2001).. 

There are 21 small-molecule-binding-domains (SMBDs) which bind to small intracellular molecules and are shared by at least 2 of the three main groups of organisms (eubacteria, archaea, and eukaryotes).  These small domains have been incorporated into a number of unrelated proteins.  For example, the T-OB domain has been incorporated into some ABC transporters where it regulates the uptake of the substrate (Anantharaman, 2001).

     The duplication of domains can increase diversity through two separate mechanisms.  Duplications of individual genes followed by modifications in at least one of the copies creates related members of a gene family which share functional characteristics because of a common domain.  A second phenomenon is the production of multidomain proteins through exon shuffling.  A great deal of exon shuffling has occurred in eukaryotes, especially in vertebrates.  Exon borders often coincide with the borders of a domain (Liu, 2004d). Domains which are coded by exons whose borders lie close to the borders of the domain have been amplified throughout the genome to a greater extent than other types of domains (Liu, 2005).

Multidomain proteins which possess the functional domains of unrelated gene families are known in every group of animals, although it seems that multidomain proteins have made more limited contributions to more primitive organisms. The majority of the multidomain animal proteins function on the cell membrane or are extracellular, such as receptor protein kinases, cell-cell proteins, and plasma proteins like clotting.  The majority of extracellular proteins seem to have been produced through exon shufflings (Patthy, 1999).

     The following group of receptors provides a good example of multidomain proteins.  .  There are many molecules (such as the CAMs, contactin, nephrin, myomesin, MERTK, PUNC, TIE2, ROBO1, CRLF1, contactin3) which contain both fibronectin and immunoglobulin domains (depicted in green below).


The LDLR family of receptors share a number of molecular components.  The green areas represent LDLR repeats (with some variation in the number of repeats) and the red areas represent EGF-precursor domains which occur in proteins outside the LDLR family.  It is evident that molecular diversity can be generated by the shuffling of domains which originated in separate ancestral proteins.


     The following depiction of the units of ABC transporters shows that they are homologous despite their being coded by 3 genes in bacteria, 2 genes in flies, and 1 gene in humans.  Domains which represented separate genes in bacteria were fused to produce a gene encoding multiple domains.


Three ABC transporters: the HisQMP2 transporter of E.coli made of 4 subunits encoded by 3 genes; the Drosophila eye pigment transporter formed by the products of 2 genes (each encoding half a transporter), and the chloride ion transporter responsible for cystic fibrosis (one gene)



     In bacteria, genes which share a similar function are often located adjacent to one another and regulated by the same control elements.  Such an organization is called an operon.  Archebacteria have an operon-like organization of genes, similar to eubacteria (in both operon organization and the order of genes in the operons) (Gray, 1992).  Only a few examples of operons are known in eukaryotes although there are groups of genes which share a related function which can be located near each other (such as the various loci of the MHC complexes in humans).

Operons exist in C. elegans including some in which genes with similar functions are organized (Zorlo, 1994).




    In eukaryotes, DNA is coiled around histone proteins.  The histone fold domain is present in a large number of proteins including transcription factors and enzymes in organisms ranging from archaea to mammals.   Although histones were thought to be unique to eukaryotes, archaea are known to possess histone-like proteins (HMf, HMt) which form dimers and allow the supercoiling of DNA (Arents, 1995).


A number of genes are imprinted in both humans and mice, suggesting that ancestral inheritance patterns have been conserved. These genes include insulin2, insulin-like growth factor2, the zinc finger protein PEG3, and a number of snoRNAs and other noncoding RNAs (Morison, 2005).