New Frontiers in Computational Genomics: Identification of Novel Functional Elements and Genome-scale Pattern Analysis

Main Article Content

Laura Elnitski
Lonnie R. Welch

Abstract

Genomics researchers produce vast quantities of data that require detailed analysis. The amount of information makes it impossible to manually analyze the data. Thus, many bioinformatics software tools have been developed for the purpose of analyzing large-scale data distributed across numerous public data repositories. The discipline of bioinformatics, like the field of genomics, is in its infancy. This editorial illustrates this point by highlighting recent findings of the international ENCyclopedia Of DNA Elements initiative (referred to as the ENCODE project) [1], and by presenting exciting opportunities for computational genomics research.

Startling Outcomes of the ENCODE Project

The following quotation expresses the excitement that exists in the genomics community about the research area of identifying functional elements:

The ENCODE project has set out to identify all the functional elements in the human genome. With the genome sequence now established, the next challenge is to discover how the cell actually uses it as an instruction manual.[1]

The metaphor of an instruction manual is useful for visualizing the challenges that exist. Below is a commentary on the recently discovered magnitude of the genomic instruction manual.

Researchers of the ENCODE consortium have analysed 1% of the human genome. Their findings bring us a step closer to understanding the role of the vast amount of obscure DNA that does not function as genes.

We usually think of the functional sequences in the genome solely in terms of genes, the sequences transcribed to messenger RNA to generate proteins. This perception is really the result of effective publicity by the genes, who take all of the credit even though their function is basically limited to communicating genomic information to the outside world. They have even managed to have the entire DNA sequence referred to as the `genome', as if the collective importance of genes is all you need to know about the DNA in a cell.

We should have guessed that this was merely prima-donna behaviour on the part of narcissist genes when the sequencing of the human genome revealed that they comprise only a small percentage of the DNA. And our confidence should have been shaken when some sequences located far from any genes were found to be strikingly conserved, indicating that they have some important function. Now, the ENCODE Project Consortium shows through the analysis of 1% of the human genome that the humble, unpretentious non-gene sequences have essential regulatory roles…

The aim of the ENCODE (encyclopaedia of DNA elements) project is exactly that—to identify every sequence with functional properties in the human genome. The results of the pilot phase of this project, which involved an analysis of 1% (30 megabases) of the human genome, are not good news for genes, which will no longer be able to hog the limelight. Even this preliminary study reveals that the genome is much more than a mere vehicle for genes, and sheds light on the extensive molecular decision-making that takes place before a gene is expressed.[2]

First, it was discovered that much functional information is not conserved across organisms. (In the language of computer science, this means that functions and classes in different organisms have different source code implementations). With 5% of the human genome estimated to be under selective constraint [2], a large amount of genetic material has no attributed function. We know that 50% of the genome contains repetitive elements (considered nonfunctional), leaving 45% uncharacterized. To address the potential importance of these sequences new approaches are necessary to identify functional elements. Such knowledge will increase our abilities to understand and address problems that have genetic causes.

A second noteworthy result confirmed by the ENCODE project is that epigenetic marks predict the presence and activity of functional regions [3]. Histone proteins assemble into octameric complexes known as nucleosomes, around which the DNA wraps. The attractive force between positively charged amino tails of the histones and negatively charged phosphate backbone of the DNA creates a tightly associated structure. Modifications of the histone tails, such as acetylation, neutralize the positive charge, create a loosened position of the DNA, and enable access of the proteins that regulate transcription. These modifications are epigenetic in nature—meaning they are not encoded in the DNA sequence itself. Of the ~100 modifications that are possible [4], a subset of 3 events distinguish sites of active promoter regions [3]. An overlapping yet distinctive set of events occurs in enhancer elements [5]. These types of large-scale analyses support the approach of de novo predictions of promoter and enhancer regions based on their epigenetic properties [6]. Notably, some regions carrying these modifications exhibit functional activity, yet lack sequence conservation between species [1]. This observation strongly indicates that epigenetic modification may be a better indicator of functional elements than sequence conservation. Given the diversity of epigenetic events that are possible, knowledge of additional functional regions is likely to emerge from studies of other combinations of epigenetic changes. Large-scale experimental analyses will require computational assessment to define the correlations between these modifications and their functional purposes.

A third surprising outcome of the ENCODE consortium is the amount of the genome that is expressed as RNA. Protein-coding genes, pseudo-genes, noncoding RNA, and functional RNA, such as rRNA, tRNA, snoRNA and miRNA, all contribute to this category. Novel observations show that transcripts can routinely span multiple gene loci creating fusions of transcripts from neighboring genes [7]. These unique products have properties of both of the individual genes. Noncoding RNA transcripts have recently been found traversing intergenic regions, which were previously thought to be nontranscribed. Termed transcriptionally active regions (i.e., TARs or transfrags), these unexplainable transcripts have been identified in large numbers [8]. Remarkably, up to 93% of bases in the ENCODE regions are transcribed—the noncoding transcripts are perhaps the best evidence that the genome still holds many elusive secrets.

The definition of genes is evolving from the objects once identified as annotated open reading frames of a genomic sequence into a broader description to include expressed sequences. The new definition can be summarized as a union of genomic sequences encoding a coherent set of potentially overlapping functional products [7]. To be compatible with the former definition of genes, this definition encompasses the conventional aspect of individual transcripts from unique locations that encode protein or functional RNA. It now extends to include the union of sets of overlapping regions that produce similar yet distinct products. The definition furthermore removes alternative untranslated regions (5' and 3' UTRs), which are not encoded in the final protein product of a gene but may be alternatively utilized by various transcripts from the same gene. The requirement of a functional product also precludes most transfrags from this definition, simply because we have not identified their role in a majority of cases.

A fourth outcome is that regulatory binding sites (motifs) extend upstream and downstream of transcription start sites; this implies a need to redefine what constitutes a promoter region. Binding studies indicate that regulatory sequences surround transcription start sites (TSSs) and are symmetrically distributed with no bias towards upstream regions. Thus analyses of regulatory regions should also include sequences downstream of the TSSs. Furthermore, we can deduce that the downstream protein interaction sites are present in the 5'UTR sequences and first introns of many genes. As expected, these studies confirm that components of the basal transcription machinery such as RNA Polymerase II (POLII) and TATA Binding Protein Associated Factor 1 (TAF1) occupy active promoters and are depleted in inactive regions. It was previously thought that E2F1 was a specialized Transcription Factor (TF) (i.e., that it bound to Transcription Factor Binding Sites (TFBSs) found in only a small set of promoters). However, in studying these widespread patterns of promoter occupancy, the protein E2F1 was unexpectedly discovered at transcription start sites at a frequency similar to frequencies of generalized TFs such as TAF1 [9]. Thus, as more information becomes available, our rudimentary understanding of the process of transcription initiation will surrender to increasingly refined models of macromolecular protein complexes corresponding to specific promoter sequences that produce molecular responses to environmental conditions.

Research Frontiers

An important implication of the outcomes of the ENCODE project is that the existing models for computational genomics are inadequate. This section discusses specific research challenges that must be addressed to assimilate ENCODE's findings into computational genomics research and technology.

Exciting prospects. One percent of the genome was selected for the initial phase of the ENCODE Consortium analysis. The successful outcome of the project, providing a catalogue of functional elements and insight into a broader understanding of the content and organization of the human genome, paves the way for a larger, more comprehensive analysis stage. In scaling the ENCODE project from 1% of the genome to 100%, all aspects of the project must expand. For a timely conclusion, experimental analyses need high-throughput approaches. Computational data storage, querying and retrieval must be scaled to larger sizes than ever before. Data analysis will require speed and capacity to handle immense amounts of data. The plethora of biological elements identified by the next phase of ENCODE will provide opportunities to study biological phenomena that have never been discovered. Data availability is only one component part. Important metadata, including conditions used for data collection are necessary for the correct interpretation of the data by external parties. Furthermore, recognition of the intrinsic noise produced by high throughput analyses must be emphasized to minimize false-positive predictions and the postulation of misguided biological hypotheses.

Integrative data analysis approaches. Integrative data analysis is emerging as a major theme for computational genomics. Characteristic features are present in classes of functional data, thereby creating a broader picture of how the individual parts create a functional entity. Recognition of these features is useful for supervised learning approaches. Features are often measured one experiment at a time. All sets of features can then be compared through an intersection of their overlapping positions in the human genome. Converting raw sequences into their genomic coordinates requires the use of alignment tools, such as BLAST or BLAT. Databases, which were established to help with comparison of high-throughput datasets, include the UCSC Human genome Table Browser [10], Galaxy [11], and ENCODEdb [12]. This interconnected network of databases allows each one to specialize in a specific area, while proving portability of data for comparison. One of the newer databases in this field, ENCODEdb serves as a prototype for the storage and retrieval of micro-array based data, which has been historically difficult to handle in search and retrieval operations. The ENCODEdb portal converts microarray data (either expression data or ChIP-chip data) into genomic coordinates for evaluation against other functional elements discovered in the genome.

Regulatory genomics. The question of what is regulatory in the genome remains unanswered. Former estimates of 5% of human sequence being under selective constraint may be an underestimate, if the neutral rate of substitution has been miscalculated [13]. The elements most commonly used as a model for the neutral rate of evolution are repetitive elements. These elements have long been considered junk or parasitic DNA. Their apparent lack of functional activity in the genome creates an assumption that these sequences diverge along a neutral model. However, some elements have been adapted for functional purposes and are under purifying selection [14, 15]. Sequence conservation present in this category of elements could contribute to an inaccurate assessment of the neutral rate. The prevalence of noncoding transcripts in the genome provides an avenue for speculating that a larger fraction of the genome is functional than previously thought. An upper estimate of 20% can be supported by quantitative studies sprinkled with some speculation [13]. Some researchers conjecture that the majority of the genome is functional [16]. Computational approaches [17] are helping to explore this issue.

On a smaller scale, studies of promoter regions in the human genome continue to reveal new information. For example, the co-occurrence of transcription factor binding sites corresponds to signals for tissue-specific expression, yet only a few patterns are known [18]. One controversial hypothesis that small RNA molecules activate promoter regions seems plausible yet heavily scrutinized [19]. Another area that is re-emerging is the study of bidirectional promoters [20]. Positioned between two genes, these regulatory elements control two genes rather than one. This association implies that mutations in a bidirectional promoter could negatively affect two genes rather than one. Quite often these promoters regulate transcription of DNA repair genes, several of which are known tumor suppressor genes [21]. This finding suggests that bidirectional promoters should be further examined for a role in cancer when their function is compromised.

Identification of ALL functional elements, interactions, and 2D/3D structural aspects of the human genome. The ENCODE project identifies 5 types of regulatory elements: promoters, enhancers, silencers, insulators and locus control regions. Additional categories that have yet to be fully elucidated include replication origins, structural features, and noncoding and anti-sense transcripts. One problem that looms well beyond the identification of regulatory elements is the determination of which genes they regulate. Regulatory elements can act over long genomic distances through mechanisms that are not fully elucidated, which may include looping out of intervening regions. Such long-distance interactions are difficult to predict with current techniques. The seemingly random placement of regulatory elements complicates the interpretation of which gene receives the activity; however identification of domains of active and repressed DNA may help to narrow the search [22]. Distant regulatory elements have been predicted through the use of sequence conservation or tandem arrangements of protein binding sites. The collection of high-throughput data representing epigenetic modifications may thrust the study of regulatory regions past the annotation phase and into interpretation of biologically relevant effects. This tremendous amount of data will require foresight and planning.

Obviously, there is a need for new computer models and algorithms that incorporate the findings of ENCODE for the discovery of new genomic elements.

Another interesting computational problem relates to determining specific functions of genomic elements. Function is specific for each cell type; therefore the possibility exists to define signatures for active sequences representing every cell type. Such analyses would capitalize on computational machine learning approaches.

Numerous additional challenges provide prospects for future research. For example, the need to identify genome-wide patterns that reveal clues about the language of the genome will require new software. Ultimately, the syntax and semantics of genomes need to be discovered. Thus, there is a need for new computational approaches, which reveal unique aspects of the genome, such as the following:

  1. Which portions of genomes are likely to have biological meaning?
  2. What are the meanings?
  3. Is there a vocabulary of the DNA? If so, what are the biological words, phrases, grammar, semantics, etc?

The answers to these questions will have significant impact on the future of biological research. They will contribute to a more complete understanding of the purpose of genomes and their composite elements. Subsequent analyses will provide datasets for wet-bench scientists to use in studying the functions of newly discovered genomic elements. Benefits also include gaining enough knowledge to predict which nucleotide-level perturbations contribute to genetic disease.

Given that we know so little about the composition of genomic sequences and that much more of the genome appears to be functional than was previously recognized, it is likely that new kinds of functional elements and associated mechanisms remain to be discovered.

Another aspect of functional elements is the role of secondary structure. For example, RNA molecules often form folded arrangements based on the composition of their complementary single stranded nucleotide interactions. Secondary structure is utilized in DNA during recombination events in DNA repair processes. Even promoters regions, such as the MYC promoter, form higher-order structures involved in regulatory function [23].

As structural features, physical associations, and protein affinities are interwoven into the story of gene regulation, we must also consider that linear order may be indispensable for proper genomic performance. Components that insulate elements from spurious interactions may not be detectable outside of their genomic context. Experiments creating rearrangements of sequences within their endogenous locations, termed chromosomal engineering, are simple only in conceptual terms; yet they offer the promise of accentuating the subtleties behind complex genomic regulation. Sequences that function to create the appropriate distance between genomic regions may perform a passive, albeit essential, function in the genome.

Yet another challenge is to extend such analyses to genomes other than human. The continual production of new genomic sequences provides opportunities to explore composition of sequences built on a similar model, yet designed to fulfill a different biological niche. The likeness of the sequences can be further studied through comparative genomics, which aims to discover similarities and differences among the patterns, syntax, semantics, and functional elements of various genomes. Approaches for discovering and visualizing functional genomic elements are emerging [17, 24, 25, 26], but many challenges remain.

Personalized medicine. Identifying regulatory regions in the human genome underscores a central goal in biology. That goal is the prediction of sequence changes that correlate with disease. Two complementary approaches may be necessary to accomplish this task. The first technique aims at finding all sequence changes that are likely to be involved in disease. The second technique is to find all regulatory regions whose disruption may lead to disease. Patterns associated with each approach could boost their effectiveness by narrowing the amount of data that must be evaluated. The PhenCode database provides one example of linking genotype (DNA sequence) and phenotype (characteristics) by mapping alterations in amino acids back to the DNA sequence of the human genome. This approach aims to record the patterns of change that result in disease [21]. As genomic sequencing becomes more cost-effective, applying the technology to each person's genome could become a reality. Such an era of personalized medicine will embrace computational analysis as an integral part of the process.

In summary, the current generation of bioinformaticians and molecular biologists will map the elements of the genomes, and will make discoveries that will help to cure many of the ills of mankind. Our success will be enabled by continuing to refine our paradigms to encompass outcomes of initiatives such as the ENCODE project, and by developing a new generation of computational genomics software tools that can help biologists to deal with the complexity of the new paradigms.

Acknowledgments

LE is supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health.

LW's research is supported by several programs of the Ohio University: 1804 Research Fund, Stocker Research Fund, Graduate Education and Research Board, Research Priorities Fund, and Stuckey Professorship.

 

References

[1] Birney E., Stamatoyannopoulos J. A., Dutta A., Guigo R., Gingeras T. R., Margulies E. H., Weng Z., Snyder M., Dermitzakis E. T., Thurman, R. E., et al. (2007) Nature 447, 799–816.

[2] Chiaromonte F., Weber R. J., Roskin K. M., Diekhans M., Kent W. J. & Haussler D. (2003)

[3] Koch C. M., Andrews R. M., Flicek P., Dillon S. C., Karaoz U., Clelland G. K., Wilcox S., Beare D. M., Fowler J. C., Couttet P., et al. (2007) Genome Res 17, 691–707.

[4] Bernstein B. E., Meissner A. & Lander E. S. (2007) Cell 128, 669–81.

[5] Heintzman N. D., Stuart R. K., Hon G., Fu Y., Ching C. W., Hawkins R. D., Barrera L. O., Van Calcar S., Qu C., Ching K. A., et al. (2007) Nat Genet 39, 311–318.

[6] Roh T. Y., Wei G., Farrell C. M. & Zhao K. (2007) Genome Res 17, 74–81.

[7] Gerstein M. B., Bruce C., Rozowsky J. S., Zheng D., Du J., Korbel J. O., Emanuelsson O., Zhang Z. D., Weissman S. & Snyder M. (2007) Genome Res 17, 669–81.

[8] Kapranov P., Willingham A. T. & Gingeras T. R. (2007) Nat Rev Genet 8, 413–23.

[9] Bieda M., Xu X., Singer M. A., Green R. & Farnham P. J. (2006) Genome Res 16, 595–605.

[10] Karolchik D., Hinrichs A. S., Furey T. S., Roskin K. M., Sugnet C. W., Haussler D. & Kent W. J. (2004) Nucleic Acids Res 32, D493-6.

[11] Giardine B., Riemer C., Hardison R. C., Burhans R., Elnitski L., Shah P., Zhang Y., Blankenberg D., Albert I., Taylor J., et al. (2005) Genome Res 15, 1451–5.

[12] Elnitski L. L., Shah P., Moreland R. T., Umayam L., Wolfsberg T. G. & Baxevanis A. D. (2007) Genome Res 17, 954–9.

[13] Pheasant M. & Mattick J. S. (2007) Genome Res 17, 1245–53.

[14] Krull M., Petrusma M., Makalowski W., Brosius J. & Schmitz J. (2007) Genome Res 17, 1139–45.

[15] Bejerano G., Lowe C. B., Ahituv N., King B., Siepel A., Salama S. R., Rubin E. M., Kent W. J. & Haussler D. (2006) Nature 441, 87–90.

[16] Slack F. (2006) Genome Biology 7, 328. (A report from the meeting Regulatory RNAs, the 71st Cold Spring Harbor Symposium on Quantitative Biology, Cold Spring Harbor, USA, 31 May-5 July 2006.)

[17] Rigoutsos I., Huynh T., Miranda K., Tsirigos A., McHardy A., & Platt D. (2006) Proceedings of the National Academy of Sciences 103(17), 6605–6610.

[18] Pennacchio L. A., Loots G. G., Nobrega M. A. & Ovcharenko I. (2007) Genome Res 17, 201–11.

[19] Check E. (2007) Nature 448, 855–858.

[20] Lin J. M., Collins P. J., Trinklein N. D., Fu Y., Xi H., Myers R. M. & Weng Z. (2007) Genome Res 17, 818–27.

[21] Yang M. Q., Koehly L. M. & Elnitski L. L. (2007) PLoS Comput Biol 3, e72.

[22] Thurman R. E., Day N., Noble W. S. & Stamatoyannopoulos J. A. (2007) Genome Res 17, 917–27.

[23] Belotserkovskii B. P., di Silva E., Tornaletti S., Wang G., Vasquez K. M. & Hanawalt P. C. (2007) J. Biol Chem, (in press).

[24] Meynert A. & Birney E. (2006) Cell 125(5), 836–838.

[25] Xie X., Mikkelsen T. S., Gnirke A., Lindblad-Toh K., Kellis M., & Lander E. S. (2006) Proceedings of the National Academy of Sciences 104(17), 7145–7150.

[26] Gu D., Lichtenberg J., Petri E., Welch J. D., Alam M., Nelson C., Ecker K., Nathaniel George, Josiah Seaman, Haiqang Zhang, Veronica Liang Wyatt S., & Welch L. R. (2007), http://wordseeker.ath.cx (The WordSeeker Functional Genomics Toolkit).

[27] Giardine B., Riemer C., Hefferon T., Thomas D., Hsu F., Zielenski J., Sang Y., Elnitski L., Cutting G., Trumbower H., et al. (2007) Hum Mutat 0, 1–9.

Laura Elnitski
Genomic Functional Analysis Section
National Human Genome Research Institute, NIH
elnitski@mail.nih.gov

Lonnie R. Welch
School of Electrical Engineering and Computer Science
Biomedical Engineering Program
and Molecular and Cellular Biology Program, Ohio University
welch@ohio.edu

Article Details

Section
Editorial