UCSC Genome Browser Project History

Genome Browser Overview

The Genomic Microscope
Genomic Tools

Human Genome Project — The Race

New challenger, Celera Genomics
Push to the Finish Line

The ENCODE Project

UC Santa Cruz's Role

UCSC Genome Research Primer

Comparative Genomics
Possibilities for Health

Genome Browser Overview

The UCSC Genome Browser is a web-based tool serving as a multi-powered microscope that allows researchers to view all 23 chromosomes of the human genome at any scale from a full chromosome down to an individual nucleotide. The browser integrates the work of countless scientists in laboratories worldwide, including work generated at UCSC, in an interactive, graphical display. The Browser also affords access to the genomes of more than one hundred other organisms.

The Genomic Microscope

Zoomed out, the coarse-level view of the genome shows early chromosome maps as determined by electron microscopy, then the browser drills down to levels of increasing detail, focusing first on chromosome bands. The next level of detail zooms in on gene clusters, showing known and predicted genes near one another on the chromosome. Zooming in further to view a single gene shows the components of genes, the introns and exons. Finally, the browser allows researchers to view the nucleotides — the As, Cs, Gs, and Ts that make up the genome alphabet. Not only does the browser show the genome sequence, but it also delineates known areas of the genome and offers supplementary information about the genes — in effect, providing the word breaks and punctuation.

Genome sequences are difficult to read because they consist of letter strings with no breaks or punctuation. The example below contains 7 different letters (genomes contain only 4). Can you understand what it is saying? (Line borrowed from the movie, Charly.)

THATTHATISISTHATTHATISNOTISNOTISTHATITITIS

With word breaks and punctuation, it starts to make sense:

THAT THAT IS, IS. THAT THAT IS NOT, IS NOT. IS THAT IT? IT IS!

The UCSC Genome Browser group played a pivotal role in bringing this extraordinary life script into the light of science. The browser presents both experimentally validated and computer-predicted genes along with dozens of lines of evidence that help scientists recognize the key features of genes and predict their function. The databases for the Genome Browser are updated nightly with new information generated by researchers throughout the world.

Genomic Tools

When directed to focus on a particular segment of the genome, the browser displays a range of data that are stacked vertically. At the top, it shows the chromosome number and the current position on the chromosome. Underneath, it shows several rows of data about genes that have been found experimentally or have been predicted by a number of different methods. Below those are lines of information about gene expression and regulation, followed by comparisons with the genomes of other species and other information, such as single-nucleotide polymorphisms (SNPs).

Far from simply displaying the genetic code, the UCSC browser brings the code to life by aligning relevant areas with experimental and computational data and images. It also links to international databases, giving researchers instant access to deeper information about the genome. An experienced user can form a hypothesis and verify it in minutes using this tool. Together this information represents an extremely comprehensive view of the genome, helping scientists recognize important features of the sequence and providing strong evidence of function. For instance, the Genome Browser helps unravel the varied splicing patterns whereby one gene can make many different proteins. This process of alternative splicing is thought to explain how a human can be so complex, yet have only about twice as many genes as a roundworm.

The UCSC Genome Browser group continues to add functions to the Genome Browser, such as the Track Collection Builder, which allows multiple continuous-value graphing tracks to be copied and grouped into one composite track or "collection." Once the tracks are inside of a collection, the Track Collection Builder tool allows you to sort by similarity and magnitude, as well as alter the aggregate/overlay graphing view options to compare results. By merging experimental results from multiple sources, this powerful tool allows researchers to better understand how genes function.

Today, the UCSC Genome Browser group continues to make genome sequences even more useful for science and medicine by facilitating the visualization of aggregate data so that it is easily accessible to researchers. This process of discovery and categorization is a critical step toward fully understanding the workings of the human genome, a project that will occupy science and medicine for many years. The browser platform has multiple potential uses that can aid in disease prevention, diagnostics, and the search for cures. The usefulness of the UCSC Genome Browser lead to spin-offs, or Genome Browser mirrors, such as the following:

Human Genome Project — The Race

In December 1999, the International Human Genome Project (IHGP) came to UC Santa Cruz when Eric Lander, the director of the Whitehead sequencing center (Whitehead Institute/MIT Center for Genome Research), invited David Haussler to help annotate the human genome. In particular, Lander wanted help in discovering the locations of the genes, which make up only approximately 1.5% of the sequence. Haussler had previously applied a mathematical technique known as hidden Markov models (HMMs) to the task of computer gene-finding. This application of HMMs had quickly become the dominant gene-finding methodology and was used successfully on the Drosophila melanogaster (fruit fly) genome. But the process requires intact long-range genomic sequence to find the genes and the human genome sequence was in many small pieces.

At the time UCSC entered the International Human Genome Project (IHGP), the IHGP was assembling the sequence one piece (or, in the jargon of molecular biology, one "clone") at a time, and intended to string the pieces together based on a precisely constructed clone map. This approach had been shown to work very well with Caenorhabditis elegans (a roundworm) and human chromosome 22. But the process of making sure every last part of the sequence is read and put together properly is quite labor-intensive.

Haussler enlisted Jim Kent, then a graduate student at UCSC's Department of Molecular, Cell, & Developmental Biology, along with systems engineer Patrick Gavin, and graduate students Terrence Furey and David Kulp (who had led the gene-finding effort on the Drosophila genome). This was the birth of the UCSC Genome Browser group.

Jim Kent in his garage sitting next to the computer where he wrote the 10,000 lines of computer code to assemble the first draft assembly of the human genome.

Jim Kent, David Haussler, Patrick Gavin and Scot Free Kennedy at UCSC.

New challenger, Celera Genomics

It was a crucial time for the international project. A private company, Celera Genomics, had announced its intention to assemble the human genome sequence well in advance of the public effort, raising the fear that the sequence would be protected by patents and thus not be freely available to scientists. Celera Genomics was using an alternative approach, a so-called whole-genome "shotgun" method, where small bits of the sequence are read at random from the genome, and then a computer program uses sequence overlap of these bits to assemble an approximation of the genome as a whole. Using this approach, Celera's assembly would still have numerous gaps and ambiguities, but the entire project from start to finish could be done in less than half the time the IHGP planned for their effort. A further complication was the fact that Celera had access to the fruits of the public project, while keeping their own results private.

An approach resulting in numerous gaps and ambiguities was necessary if the IHGP's draft sequence was to have similar utility to Celera's sequence, and in particular to prevent Celera and its clients from locking up significant portions of the human genome under patents. A number of groups within the IHGP were working on the second stage of assembly that would merge the approximately 400,000 pieces into larger pieces and order them along the human chromosomes so that research groups could find the human genes. However, the process was slow and arduous. Even with the outstanding mapping information provided by Bob Waterston's group at Washington University, the second-stage assembly turned out to be like an extremely difficult jigsaw puzzle, with many layers of conflicting evidence arising from similar-looking, non-contiguous pieces caused by repeats scattered throughout the genome.

At least partly in response to competition from Celera, the IHGP changed its focus from producing finished clones to producing draft clones. To sequence a clone, the IHGP adopted a shotgun approach in miniature. Bits of a clone were read at random, and the bits were stitched together by a computer program into pieces called "contigs." After the shotgun phase, a clone was typically in 5-50 contigs, but the relative order of the contigs was not known. This was the state of the genome when David Haussler first attempted to locate the genes computationally, and he quickly discovered that computational gene-finding was nearly impossible, because the average size of a contig was considerably smaller than the average size of a human gene.

Push to the Finish Line

In May of 2000, motivated to prevent Celera and its clients from locking up significant portions of the human genome in patents, Jim Kent dropped his other work to focus on the assembly problem. In a remarkable display of energy and talent, Kent developed within four weeks a 10,000-line computer program that assembled the working draft of the human genome. The program, called GigAssembler, constructed the first working draft of the human genome to be shown outside of the IHGP on June 22, 2000, just days before Celera completed its first assembly. The IHGP working draft combined anonymous genomic information from human volunteers of diverse backgrounds, accepted on a first-come, first-taken basis. The Celera sequence was of a single individual. Since the public consortium finished the genome ahead of the private company, the genome and the information it contains are available free to researchers worldwide. Kent's assembly was celebrated at a White House ceremony on June 26, 2000, announcing the completion of the first drafts of the human genome by the IHGP and Celera.

Copy of first draft of the human genome on a CD

Copy of first draft of the human genome sequence presented to President Clinton and deposited in the Smithsonian.

On July 7, 2000, after further examination by the principal scientists of the public genome project, and to facilitate the annotation process, the UCSC Genome Browser group released this first working draft on the web at https://genome.ucsc.edu. In the first 24 hours of free and unrestricted access to the human genome, the scientific community downloaded one-half trillion bytes of information from the assembled blueprint of our human species. The initial assembled human genome sequence was referred to as a working draft because there remained gaps where DNA sequence was missing, due either to a lack of raw sequence data or ambiguities in the positions of the fragments. With the assembly 90% complete, the assembled genome was published along with the findings of hundreds of researchers worldwide in the February 15, 2001 issue of Nature, which was largely devoted to the human genome. In the months following the release of the working draft, the UCSC team worked with other researchers worldwide to fill in the gaps. The resulting sequence made its debut in April of 2003. It encompasses 99% of the gene-containing regions of the human genome and is 99.99% accurate. UCSC was designated as the official repository of the early human genome assembly iterations. Eventually, the National Center for Biotechnology Information (NCBI) and then the Genome Reference Consortium (GRC) would take over the assembly and official release of improvements on the genome assembly.

David Haussler, a tall man with a penchant for Hawaiian shirts, had managed to wrangle 100 Dell Pentium III processor workstations for the project. About 30 had been purchased, but others had been intended for student use-until Dean of Engineering Patrick Mantey and Chancellor M.R.C. Greenwood agreed Haussler could "break in" the machines before they went into classrooms. Each of the computers was less powerful than one of today's smartphones; nevertheless, the UC Santa Cruz group was able to link them together for parallel processing, creating a makeshift "supercomputer" for the project.

The UCSC team was a key part of the Hard Core Analysis Group that published in the Feb 15, 2001 issue of Nature. We linked the genome sequence to previous genetic, cytogenetic, and radiation hybrid maps, and to the new physical clone map. We did this both to refine and validate the sequence assembly, and to explore phenomena such as positional and gender variation in recombination rate, regional isochore structure and repeat structure at the single base resolution for the first time. David Kulp performed the mapping of STS markers, messenger RNAs and ESTs, Terry Furey mapped the chromosome band positions, cytogenetic markers (~8,000 gene regions mapped by Fluorescence In-Situ Hybridization) and isochores, and integrated these data with the radiation hybrid and genetic maps.

The genome sequence at the time of release, however, was simply a few billion characters of Gs, As, Ts and Cs, many of them assigned to chromosomes. As indicated above, however, without landmarks it is unintelligible. Before his work on the draft assembly, Kent was also working on a computer program that would allow him to view genes of C. elegans and show via a web interface which parts of the genes are ultimately used by the cell to encode proteins. The process of "splicing" removes sequence called introns and was visualizable using Jim's program, The Intronerator.

The Intronerator evolved into Genome Browser and ultimately became a tool to provide information about the functional significance of many other parts of the genome sequence. The process of annotation, as it is called, identifies sequences that represent not only the genes and which parts of the genes encode proteins, but also the control sequences that tell cells when and where to activate genes, which regions of the genome are conserved through evolution and can be found in other animals, and many other significant regions. Essentially, in the Browser the genome became a coordinate system upon which to hang any functionally significant annotation.

Once the human genome sequence became available and the Browser built to visualize it, other genome browsers also came online, most notably those at NCBI and at the European Bioinformatics Institute (EBI). Reciprocal links provided on each of the three browsers allow researchers to jump from any place in the human genome to the same region on either of the other two browsers.

The ENCODE Project

The human genome contains vast amounts of information, and all of the functions of a human cell are implicitly coded in the human genome. With the molecular sequence known, researchers have been mining it for clues as to how the body works in health and in disease, ultimately laying out the plan for the complex pathways of molecular interactions that the sequence orchestrate. The UCSC Genome Browser aids the worldwide scientific community in its challenge to understand the genome, to probe it with new experimental and informatics methodologies, and to decode the genetic program of the cell.

After the sequence of the genome was first available, a researcher's ability to decode that sequence and tap into the wealth of information it holds was still quite limited. The next step beyond viewing the genome is gaining an understanding of the instructions encoded in it. Toward this end, the UCSC Genome Browser group participated as the Data Coordination Center for the ENCyclopedia Of DNA Elements (ENCODE) project, an international endeavor to generate a comprehensive parts list of all the functional components in the human genome.

ENCODE is a scientific reconnaissance mission aimed at discovering all regions of the human genome crucial to biological function. Before ENCODE, scientists focused on finding the genes, or protein-coding regions, in DNA sequences; but these account for only about 1.5% of the genetic material of humans and other mammals. Non-protein-coding regions of the genome have important functions serving as the instruction set for when and in which tissues genes are turned on and off. The ENCODE project is developing a comprehensive "parts list" by identifying and precisely locating all functional elements in the human genome. This project, sponsored by the National Human Genome Research Institute (NHGRI), involves an international consortium of scientists from government, industry, and academia.

UC Santa Cruz's Role

UC Santa Cruz developed and ran the Data Coordination Center for the ENCODE project from its inception in 2003 through the end of the first production phase in 2012. During that time, the UCSC Genome Browser group, directed by Jim Kent with technical management by Kate Rosenbloom, provided the database and web interface for all sequence-related data to the ENCODE project. This included integrating the data into the UCSC Human Genome Browser (where it continues to reside) on specialized tracks, and providing further in-depth information on detail pages. UC Santa Cruz also developed, performed, and presented computational and comparative analyses to glean further genomic and functional information from the collective data.

UC Santa Cruz worked closely with labs producing data for the ENCODE project and with data analysis groups to define data and metadata reporting standards for a broad range of genomic assays. They implemented data submission and validation pipelines, created and maintained the encodeproject.org website, developed user access tools for ENCODE data, exported all ENCODE data to repositories at the National Center for Biotechnology Information (NCBI), and provided outreach and tutorial support for the project.

Kate Rosenbloom while working on the ENCODE Project (2003).

The ENCODE data coordination was passed on to the Michael Cherry laboratory at Stanford University in late 2012. UC Santa Cruz, however, continues to support existing ENCODE data and resources on the UCSC Genome Browser website. Researchers can still select ENCODE from the portal at Stanford and export a track hub to the Genome Browser for visualization. Newer ENCODE data of broad interest, particularly integrative and summary data, will continue to be incorporated into the browser.

The following paper describes ENCODE resources at UC Santa Cruz:

Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R, Heitner SG et al. ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res. 2013 Jan;41(Database issue):D56-63. PMID: 23193274; PMC: PMC3531152

More about the ENCODE Project

UCSC Genome Research Primer

Comparative Genomics

Besides developing, supporting, and continuing to improve the Genome Browser, the UCSC Genome Browser group conducts research into the functional elements of the human genome that have evolved under natural selection. Since the first assembly of the human genome, a growing number of species have been added to the UCSC Genome Browser, including roundworm, pufferfish, chicken, mouse, and chimpanzee. In 2018, the UCSC Genome Browser surpassed 200 assemblies for the various species hosted on the browser. Interspecies alignments allow researchers to compare human genes to similar genes in other species. The UCSC Genome Browser allows rapid comparisons between species, which can lead to many different types of new discoveries:

New gene discoveries can result from searching the human genome for sequences that match those with known functions in other organisms. The molecular genetics behind disease development and progression in model organisms can be leveraged to discover potential disease-related genes in humans, moving us closer to diagnostic advances and targeted treatments.
We can reconstruct the evolutionary history of the human genome by identifying the origins of interspecies differences and of short segments in the human genome that have been extremely well-conserved over millions of years of evolution.
By searching for the highly conserved segments in the human genome—those that are unchanged from like segments in the genomes of other organisms—we can begin to understand the essential elements of the blueprint for life. Researchers believe that these highly conserved elements must be essential to function; there has been too much time for the sequences to diverge from the last common ancestor for any other explanation to make sense. Genes make up only a small percentage of the unchanged elements, suggesting that other unknown regulatory elements in the genome are also important for function.
Searching for genes that have evolved with unusual speed from one organism to another will give clues to essential interspecies differences, such as differences between the human and chimpanzee brain.

Possibilities for Health

As we begin to better understand the molecular mechanisms responsible for human disease, entirely new avenues for treatment have opened up. We are only now getting a first glimmer of the molecular functions of a healthy human cell or organ, and we are still a long way from understanding the often subtle and complex ways that these can go awry. Yet knowledge of the human genome puts us on the brink of a revolution in medicine.

Rather than relying on trial and error to design and test new drugs, researchers will increasingly use their knowledge of the molecular causes of diseases to design new, targeted therapies. Research based on genome studies and new experimental methods such as CRISPR, all viewable on the UCSC Genome Browser, will also form the basis for new diagnoses and therapies for human disease that will transform the practice of medicine in this century.

The UCSC Genome Browser supports the latest endeavor of the National Human Genome Research Institute (NHGRI), a medical sequencing project intended to amass data relating genes to health conditions. This project sets the stage for the time when it becomes affordable for any individual's genome to be sequenced in a medical context. The information obtained will allow estimates of future disease risk and improve the prevention, diagnosis, and treatment of disease. The project focuses on rare Mendelian disorders, complex disorders, and normal human variation.

The practice of medicine will become much more individualized, with therapies tailored to be most effective given an individual's genetic makeup. Medical tests are already available to identify individual genetic variations that affect a patient's response to commonly used medications. These tests can allow doctors to avoid adverse reactions and choose medications appropriate for specific individuals. Someday we may even be able to repair or replace the disease-causing genes, re-orchestrating the molecular pathways needed for health. Already, a small clinical trial is showing great promise in reversing the effects of sickle-sell anemia.