GENETICS HOME GENETICS TABLE OF CONTENTS   OBL HOME OBL REFERENCES
NONCODING SEQUENCES

NONCODING SEQUENCES AND OTHER COMPONENTS OF THE GENOME

Transposable elements were considered to be an odd occurrence when they were first discovered. It is now known that they are the most common component of eukaryotic genomes, representing almost half the human genome and more than 70% of the genomes of some grasses (Wessler, 2006).

Not only can they form new alleles by inserting into coding areas, they can also alter expression rates as a result of their insertion (Wessler, 2006). Chromosomal mutations may result from their activity (Wessler, 2006).

ENDOGENOUS VIRUSES

     It is estimated that 97% of the human genome is noncoding DNA that does not produce protein or functional RNA.  Surveys of the intergenic regions in the genomes of humans, flies, worms and yeasts indicate that they contain regions which seem to be sequences of ancestral genes which have long ago lost their function and degenerated.  These pseudomotif regions include zinc finger, leucine zipper, nucleotide-binding regions and EGF domains (Zhang, 2002a).  

     Hundreds of genes appear to have resulted from horizontal transfer from bacteria and dozens of genes originated from transposable elements (International Human Genome Sequencing Consortium, 2001).  Transposons and LTR retroposons seem to have been inactivated in the human lineage, even though they compose half the human genome (International Human Genome Sequencing Consortium, 2001).  There are more than 70 copies of HMG17 in the genome, most of which are retropseudogenes (making this the largest known family of retropseudogenes) (OMIM).

Viral DNA may compose about 5% of eukaryotic genomes (Bamford, 2005). Retroviruses and retrotransposons compose about 8% of the human genome. Most of the 22 separate human endogenous retrovirus families (HERVs) that have entered the genome originated between 10 and 50 million years ago prior to the separation of the hominid lineage from that of non-human primates (such as HERV-R). Often these human sequences are conserved at specific homologous sites in the chromosomes of other species. While some HERV group members exist as a single copy, others form thousands of copies (Kim, 2006).Some HERVs are expressed and can affect infection and cancer. It even appears that human host cells have adopted some of the sequences in the development and function of normal tissues (Kim, 2006).

      About one percent of the human genome appears to have originated in viruses which infected human ancestors.  Endogenous retroviruses (ERVs) may be partially excised from the genome, leaving LTRs as a remnant.  It is estimated that 8% of the human genome is composed of LTRs, with perhaps a half a billion copies.  There are 22 known families of human ERVs with at least one capable of forming complete viruses.  Many LTRs indicate that lots of HERVs have been partially excised.  ERVs can enter the genome through infection or through replication of existing elements.  ERVs in birds and reptiles have primarily been inherited through the germline.  In mammals, however, ERVs seem to have entered the human genome in the adaptation to new hosts. ERVs entered the genome of our ancestors from early mammals to human lineages.  These viruses shuffle DNA, add new promoters, and new genes but can also cause disease and add to the DNA which has to be replicated (Bromham, 2002).  Genomes of a number of gnathostomes possess copies of retroviruses which are homologous to the HERV-I retrovirus sequences in the human genome (Martin, 1997).  Many retroviral sequences of cattarhine primates share a common ancestry from an ancestral infection (Tristem, 2000).

     There are a number of different families of these human endogenous retroviruses (HERVs): HERV-E (30 to 50 copies per genome), HERV-R (1 copy), HERV-H (50-100 with a provirus structure, thousands with a transposon structure), HERV-I (2-25 copies), HERV-P (10-20 copies), and NERV-K (20-50 copies).  While most are not known to produce protein, the Env protein is produced from HERV-R and Gag, cORF, protease, polymerase, and Env proteins are produced from HERV-K (Lower, 1996).

 

LTR class I endogenous retrovirus (ERV) retroelements frequently include binding sites for p53 and are apparently part of the regulatory network for p53 action (Wang, 2007).

Human endogenous retroviruses compose result from viruses inserting into the chromosomes of cells in the germline and compose 8% of the genome. Usually, they are not located in genes because their insertion often is detrimental (Wang, 2007).

     There is a danger in the transplantation of animal organs into humans that endogenous viruses in the animal might replicate in humans.   The methylation of DNA helps to inhibit activity of endogenous viruses.  The hybridization of two wallaby species led to a large scale amplification of ERVs in this marsupial which added extensive regions onto the centromeric regions of the chromosomes.  Cultured pig cells can produce retroviruses that can infect human cells but this has not been observed in xenotransplants.  (Bromham, 2002). 

Anthropoid primates have far more ERV-L elements than prosimians.  It appears the ancestor of anthropoids underwent a major burst of replication.

In the mouse genus Mus, two bursts of ERV-L elements have occurred and the number present number of LTRs varies from 364 to 1275.   The number of viral pol sequences can vary in mice of genus Mus from 40 to 155 while humans and chimps have about 194 and 215 respectively.  This number varies: humans have 194 plus/minus 8 and chimps have 215 plus/minus 21.  HERV-L has about 200 copies in human geome; similar sequences have been found in mice.

 (Benit, 1999)

 

     There are about 1.2 million Alu sequences in the human genome.  Some have mobilized during recent human evolution and their presence is polymorphic in modern populations (Devor, 2003).

Alu sequences are the most common SINES in primate genomes. Most of the amplification of these sequences occurred between 40 and 60 million years ago, individual primate lineages continue to experience changes in their Alu sequence repertoire . Recombination events between Alu sites have been identified as being responsible for more than 50 human disorders and Alu-caused mutations may be the cause of 0.1% of human diseases, including cancer. Alu mutation events can also cause deletions and evidence indicates more than three thousand of these events in primate lineages (Callinan, 2005). SINES and retroviruses have been less common in the ancestors of dogs than in the ancestors of modern primates (Lindblad-Toh, 2005).

RETROTRANSPOSONS

Retrotransposition is one of the primary mechanisms of gene duplication (Managadze, 2007).

     Retropseudogenes which have arisen from the reverse transcription of mRNA into DNA are known in the genomes of bacteria, plants, insects and vertebrates.   The greatest number of them (at least 20,000) are known in mammals (Devor, 2003)..About 3,600 gene transcripts have been retrotranscribed back into the genome. More than 1,100 have been inserted into the introns of other genes and at least 460 are transcribed. Some of these gene fusions occurred prior to the human lineage separating from that of non-human primates. At least 50 have evolved a function in the testis, perhaps as a way of compensating for the single X chromosome in males (Vinckenbosch, 2006).

Genome comparisons of mammals and reptiles suggest that the CR1 and MIR retroelements were probably the dominant repetitive structures in the genomes of ancestral amniotes (Shedlock, 2007).

Retrotransposon-mediated sequence transduction occurs when a retrotransposon is reintroduced into a chromosome along with the sequence which flanked its original position. The SVA retrotransposon family are the youngest family known in primates and the activity in the ancestors of African apes and humans resulted in three duplicated copies of the AMAC gene which flanked an insertion site of an SVA retrotransposon. There are about 3,000 copies of SVA in the human genome, about 10% of which copied flanking regions when they were amplified (Xing, 2006).

L1 activity has been shown to cause exon shuffling in both genome studies (two examples known in the human genome) and in cell culture (Xing, 2006).

Plant genomes also include retrotransposon families. While large plant genomes typically have multiple families with more than ten thousand copies each, smaller plant genomes may have few families of a thousand copies each, if any (Vitte, 2006).

The POTE gene family is known only in primates. One of the family members in humans is a composite gene which as gained an actin sequence as a result of retroposon insertion (Lee, 2006).

 

The majority of the human genome which is under purifying selection (98% of sequences under purifying selection) do not code for protein. Most are unique to human genomes and difficult to identify from studies of other organisms. There are a number of conserved noncoding elements (CNEs) in the human genome which encode neither protein nor RNA. For example, there are more than 900 copies of the MER121 repeats. Although they are unknown in bird genomes, their positions are conserved in the genomes of placental and marsupial mammals. They may function as a regulatory element.

There are almost 800 groups of transposon fossils in the human genome owe their radiation to the time before the branching of mammalian lineages. While most are nonfunctional, some seem to have acquired new functions and experience purifying selection (Kamal, 2006).

TRANSPOSONS

Only two types of mobile elements are known in eukaryotes: retrotransposons and DNA transposons. Retrotransposons spread when their mRNA are reverse transcribed and inserted as DNA into new chromosomal sites. This process is performed by the reverse transcriptase and endonuclease domains of a polyprotein encoded by a given retrotransposon or others in genome (Kapitonov, 2006).

DNA transposons code for proteins that excise the existing transposon and integrate it into a novel region of the genome. Unlike retrotransposons, DNA transposons also require the DNA replication machinery of the host cell. Some transposons are autonomous because they code for all of the proteins needed for their transfer. Others are classified as non-autonomous because they lack a complete set of proteins and rely on some (or all) of the proteins produced by other autonomous transposons (Kapitonov, 2006).

One class of transposons named Polintons probably originated more than a billion years ago given its distribution in diverse groups of eukaryotes. The proteins encoded in this transposon include DNA polymerase B, retroviral integrase, cysteine protease, and ATPase (Kapitonov, 2006).

 

Thousands of human proteins include amino acid sequences which support the conclusion that transposable elements have been incorporated into coding regions (Britten, 2006).

 

A small percentage of L1 retrotransposons are active in modern humans and comparisons of different human ethnic groups show variable positions for these “hot” L1s. Mutations can change the retrotransposition potential of these sequences by a factor of about 400% and some alleles result in the conversion of a “hot” L1 to a “cool” one which is less likely to mobilize (del Carmen Seleme, 2006).

L1 retrotransposons compose about 17% of the human genome. They cause variation by the mutations they generate upon insertion in new regions and by their ability to move other sequences such as SINES, pseudogenes, and exons (resulting in exon shuffling) ( del Carmen Seleme, 2006).

The average person’s genome possesses 80 to 100 L1s which are capable of retrotransposition and 12 which are “hot” with a high potential for mobilization (del Carmen Seleme, 2006).

 

Nearly 11,000 transposable element positions vary between human and chimp genomes. More than 95% of these insertions are classified in three families: Alu, L1, and SVA (Mills, 2006).

In ancestral anthropoid primates, a transposon inserted downstream of a SET histone methyltransferase gene. This created a chimeric gene with transposase and an exon composed of previously noncoding sequences (Cordaux, 2006; Zhang, 2006).

While the human genome is composed of an enormous amount of DNA (about 44% of the genome) which resulted from transposons and elements similar to transposons, less than .05% are currently active (Mills, 2007).

Transposable elements can promote chromosomal rearrangements.

Although a large proportion (44%) of the human genome is occupied by transposons and transposon-like repetitive elements, only a small proportion (<0.05%) of these elements remain active today. Recent evidence indicates that not, vert, similar35–40 subfamilies of Alu, L1 and SVA elements (and possibly HERV-K elements) remain actively mobile in the human genome. These active transposons are of great interest because they continue to produce genetic diversity in human populations and also cause human diseases by integrating into genes (Mills, 2007).

It is estimated that between 1 and 10 percent of human births possesses a new insertion of a transposable element (Mills, 2007).

 

About 4 million identified elements have been classified into 848 families and subfamilies of elements (Mills, 2007).

Amplification of these groups has occurred at different times. For example, the SVA-A, -B, -C and -D families diverged prior to the common ancestor of the African apes and humans while the SVA-E and -F groups diverged in humans.

Although Alu and LINE-1 (L1) elements have been known since the 1960s, the ability of a small number of them to actively undergo transposition was not recognized until the 1980s. The large number of inactive copies (whose genes are mutated or which lack vital sequences) inhibited the discovery of active forms until an L1 transposon mutation resulted in hemophilia (Mills, 2007).

 

LINE-1 (long interspersed element-1) or L1-mediated retrotransposition is known to have generated the mutations responsible for more than 50 human diseases including cyctic fibrosis and hemophilia (Chen, 2008).

Although almost all eukaryotes host transposons, they are more abundant in some lineages. While transposons can compose less than 3% of fish genomes, modern and fossil transposons can compose 45% of mammalian genomes. Mammals host the L1 retroposon family and SINES (Abrusan, 2006).

Comparisons of the genomes of diverse mammals indicates that about 5-6% of the genome has undergone purifying selection. Interestingly, only 1-2% of these regions code for proteins. The rest are conserved noncoding elements (CNEs). One CNE family with more than 900 sequences underwent an expansion in basal mammals. Although there are many remnants of transposons in the human genome resulting from duplications prior to the radiation of mammals, some seem to have evolved a function (Kamia, 2006).

The descendants of ancient transposons compose about one fourth of the human genome and almost 800 classes of sequences are known (Kamia, 2006).

SINES

ALU

The model for Alu amplification is the following:

First, two proteins SRP9p and SRP14p (components of the signal recognition particle), transport the Alu RNA to a ribosome. If a cell is producing the proteins encoded by L1 transposons, the Alu RNA can attach to the ORF2 protein as it is translated from the L1 transposon and replace itself for the L1 mRNA. As a result, Alu amplification utilizes the cellular machinery of L1 transposons (Mills, 2007).

It is thought that half of human genes undergo alternative splicing. Alternative splicing can result in a premature stop codon and cellular mechanisms recognize and degrade these degenerate mRNAs (Mola, 2007).

Repetitive sequences, such as Alu sequences, are known to change introns, provide additional possible splice sites, and affect enhancer and silencer sequences. Repetitive elements can even contribute sequences which are incorporated into exons after alternate splicing (Mola, 2007).

Alu sequences are about 280 base pairs long and are composed of a pair of similar sequences followed by a variable polyA tail, and are bordered by sequence repeats 5 to 20 base pairs long (Mola, 2007).

Within an Alu sequence, the length of the polyA tail determines how frequently it could undergo retrotransposition using the L1 machinery. Shorter tails result in an element being less active (Dewannieux, 2005).


Alu sequences can affect cells by contributing amino acids to coding sequences, providing 3’ polyadenylation sites, influencing mRNA stability, and changing the splice sites of alternate transcripts (Smalheiser, 2006).

A number of individuals have been diagnosed with a genetic disease caused by an Alu sequence changing the amino acid sequence of a protein (Makatowski, 1994).

 

Short interspersed elements (SINEs) result through the retroposition of RNAs such as 7SL RNA or tRNA whose RNA polymerase III promoters are located within the sequence. SINES have been shown to cause some diseases and can even be translated in the correct position (Kriegs, 2007).

7SL RNA derived SINES are specific to the mammalian clade Eurarchonta which includes primates, rodents, rabbits, tree shrews, and flying lemurs. A sequence was deleted in one group of SINES in ancestral primates(Kriegs, 2007).

Chimp genomes have almost twice the number of Alu sequences as humans.

Insertions and deletions have had a greater impact on the sequence divergence between humans and chimps than have substitutions (Sakate, 2007).

 

 

SVA elements actually combine sequences from three other classes of transposable element: SINE-R, VNTR and Alu(Mills, 2007).

NONCODING

The sequences of the genome which code for protein represent 1/3 of the sequences undergoing purifying selection. Many are transposable elements (Lowe, 2007).

 

Although less than 1.5% of our genome codes for proteins, almost half is transcribed into RNA (other than that which functions as mRNA, tRNA, and rRNA). As genomes of diverse organisms are compared, the biological complexity increases as the number of noncoding RNA molecules increases (Szell, 2008).

 

An area named HAR1 (human accelerated region 1) is important in establishing the six layers of the human cerebral cortex (Szell, 2008).

In comparing human to mouse genomes or fly to mosquito, the sequences of ncRNAs undergo a higher rate of evolution than do protein coding regions.

In the regions which have undergone the most rapid change comparing human and chimp genomes (HARs, human accelerated regions), few areas code for protein. Many represent areas adjacent to genes, especially those involved in the nervous system and development. Differences are evident between primate and mouse expression of HARs and one HAR is located near the neural gene reelin (Szell, 2008).

It is not yet known how many nc RNAs there are. These molecules can form complexes with other molecules of RNA, DNA, or protein to regulate translation, perform RNA editing, imprinting, and RNA interference (Szell, 2008).

Increased expression of BC200 is correlated with Alzheimers and cancer.

Increased expression of BIC ncRNA (800–1700 nucleotides) is involved in leukemia and lymphoma. PCGEM1 and DD3 are ncRNAs whose expression is increased in prostate cancer. MALAT-1 (Metastasis Associated in Lung Adenocarcinoma Transcript 1) is overexpressed in a number of cancers. PRINS (psoriasis susceptibility-related RNA gene induced by stress), is overexpressed in psoriasis (Szell, 2008).

LINES and SINES are transcribed. Some Alu elements are regulated and effect stress responses, packing of chromosomes, and development (Szell, 2008).

 

Tunicates do not possess large quantities of noncoding DNA: although they have half the number of genes as does the human genome, tunicate genomes consist of only 5% of the DNA found in that of humans (Sherwood, 2005).

INTRONS

While 1.1 to 1.4% of the genome codes for protein, about 24% of the human genome encodes introns. (Managadze, 2007). Class II introns code for enzymes known as maturases (similar to reverse transcriptases) which can cleave sequences and perform reverse transcription, thus promoting the dispersal of introns (Managadze, 2007).

 

Evidence indicates that some of the difference between humans and chimps is in the 3’ and 5’ untranslated regions, including alternate transcription start sites and changes in translational efficacy (Sakate, 2007).

 

 

DISEASES

Human T-lymphotropic viruses (HTLVs) currently infect about 22 million people. These viruses also infect other primates and seem to have entered human populations through contact with non-human primates (Wolfe, 2005).

Ten species of Plasmodium are known to infect primates, including several species which can cause human malaria. One of the species of Plasmodium which causes human malaria, Plasmodium falciparum, is most closely related to the chimpanzee pathogen Plasmodium reichenowi. The modern populations of Plasmodium vivax, the major agent of human malaria, may have originated from a species infecting macaques in Southeast Asia between 45,000-81,000 years ago (Escalante, 2005).The primary cause of human malaria, Plasmodium vivax, is most similar to species which infect monkeys and apes in South Asia. It is thought that malaria originated there about 2 million years ago and spread to Europe and Africa. Although this species was probably a serious pathogen to Africans in the past, given the high percentage of Africans who lack functional Duffy proteins, the protein through which P. vivax enters red blood cells, the majority of modern malaria cases in Africa are caused by Plasmodium falciparum (Carter, 2003).

During the last Ice Age (100,000 to 20,000 years ago), malaria caused by P. vivax could have been an important disease of Neanderthals (Carter, 2003).