Charting the coastline of the mammalian promoterome, and keelhauling on some dragons on the way

Publiceret April 2009

In this text, I will describe recent studies aimed to explore the landscape of promoters in mammalian genomes, which have changed our understanding on how promoters work, and some of the underlying biology.

First, what is a promoter and why is it interesting? The promoter is loosely the region around the start of genes in the genome, which holds much of the regulatory elements that govern the transcription output of the gene. Like much of molecular biology, the promoter terminology is borrowed from "simpler" organisms like bacteria, where it is sufficient, but in mammals the promoter definition is imprecise in terms of exactly where it starts and ends relative to the gene. As we will find out, the definition will not be easier in the future - rather the contrary. In this text I will focus on the part of the promoter that lies just around the start of the gene, which is often called the core promoter.

It is necessary to reflect the new findings to the "text-book" conception of the core promoter. The pioneers of experimental promoter analysis focused on a few highly expressed genes expressed in one or a few tissues. Almost all of these genes had a distinct TA-rich pattern at a fixed distance of 30 nucleotides upstream of start of the gene - popularized as the TATA-box. This made an attractively simple model: the TATA-box will tell the cellular machinery where to start transcribing the gene, and this architecture will be the same for all genes (Figure 1A). Indeed, there are parts of the proteins of the transcription machinery that bind specifically to the TATA-box and even bend the DNA in a 45 degree angle to start transcription. However, when the number of studied genes grew, it turned out that not all had a TATA-box. In fact, the fraction of genes having a TATA-box has decreased with the size of studies - the most comprehensive studies so far estimates that only 10-20% of promoters have a clear TATA-box. Despite this, the TATA model is generally presented as the typical promoter in many textbooks - perhaps since it is easily understandable.

Core promoters - models and real data. Click for a larger version.
Figure 1: Core promoters - models and real data.
A) A typical text-book model of the core promoter: the transcription start site is perceived as a single nucleotide, governed by a TATA box 30 nucleotide upstream of it
B) Example from Carninci et al 1996 of a core promoter detected by CAGE that adheres to the textbook model. The number of tags mapped to the genome sequence is shown on the y axis. The x axis represents the genome (upper panel shows mouse data, lower panel human data). Note that both the genome sequence and CAGE tag distributions are close to identical.
C) A core promoter not adhering to the textbook model: the CAGE data shows a complex distribution of start sites, which is similar between human and mouse.
Larger version.

Detailed studies of a small number of genes resulted in a few other patterns with similar function as the TATA-box, including the INR pattern, which could occur with or without TATA-boxes. Since at this time the detection of genes and promoters took a lot of time and resources, this spurred the development of computer programs to predict the gene starts from genomic sequences. These were generally based on the above patterns, and the finding that many promoters are enriched for CG dinucleotides - so called CpG islands. This proved to be a challenging problem: the diversity of promoters and the lack of strong patterns only permitted an accuracy of around 50% - which (perhaps not) incidentally was the fraction of promoters having CpG islands.

In the late 90s, when the first eukaryotic genomes were sequenced, it was clear that the computational methods were not accurate enough to replace experimental methods, which on the other hand were too time-consuming and expensive. In many ways, this mirrored the search for other biological features in the genome such as transcripts, enhancers and transcription factor binding sites, where the computational methods had some, but not total, success. This motivated the development of high-throughput methods to find promoters and other elements, which all had in common, that they relied on a completed genome to "map" against. The methods with the highest resolution were based on sequencing of the starts of the actual RNAs transcribed from a given gene, and then mapping these sequences back to the genome (since the RNA sequence is complementary to the genomic DNA). The methods that are most well known are CAGE (Cap analysis of gene expression) and Oligo-capping, which both use the same conceptual approach but use different chemistry to be certain to capture the true end of the RNA molecule (Figure 2 shows a conceptual protocol).

Simplified protocol for transcription start site tagging
Figure 2: Simplified protocol for transcription start site tagging
A) A gene on the genome is transcribed, spliced and processed into mature mRNA. A cap structure is added at the start of the RNA by the cell
B) The RNA is converted into double-stranded, complementary DNA by the use of reverse transcriptase. Only those cases where the transcriptase reaches the cap are retained. An adapter replaces the cap.
C) The complementary DNA is cut 20 nucleotides downstream of the adapter - the remaining DNA is degraded
D) Each captured RNA will result in a tag. These tags are sequenced, and mapped back to the genome, where they will indicate the start of the RNA that was transcribed in A)

Another difference is the length of the sequence: if we are only interested in finding the end of a molecule, it is not necessary to sequence the whole of it - it suffices to only sequence a small part of it. Theoretically, with a genome as large as the human, the necessary length is 18 nucleotides, but for practical reasons, slightly longer sequences are necessary, from 21 nucleotides and upwards (genomes have much repetitive sequences, and sequencing machines will introduce errors). These small sequences are referred to as "tags". CAGE tags are around 21 nucleotides and oligo-capping tags are substantially longer.

Both CAGE and oligo-capping were designed to be scalable, so that they could be applied en masse to chart the human and mouse promoter-omes. The initial libraries of CAGE and oligo-capping methods could give several hundred thousand up to a million tags. Since oligo-capping tags are longer, the CAGE libraries had on average many more tags per library.

An important difference between these methods and "classical" techniques is that they are "blind" in terms of targets. Small-scale experiments are almost often targeted towards a certain region on the genome, which can lead to ascertainment bias: in other words that we look for biological elements at places where other researcher have reported that they are in order to maximize the chance of success, which gives a negative spiral since we the will establish a "truth" which might not be conclusive - much like the TATA-box model described above. A blind method will be more unbiased.

The blindness, together with the throughput, gave several advantages. Tags mapped to the genome could both give the location of the start of genes - transcription start sites - but also their relative usage in the cell, since we are randomly sequencing a small part of the actual RNA population. That means that highly used genes will have many RNA molecules present, which in turn will get captured more often as tags, and therefore the number of tags mapping to the same location will be indicative of expression, or really the usage of that transcription start site. Since we can sample RNAs from different tissues, we can also use these techniques to find out what transcription start sites that are used in most, or only a few, tissues. This means that the tags can also be used as expression measurements much like microarray experiments, with the difference that the entities studied are not genes but actual promoters (as we will see, this makes a real difference).

Expectations and surprises - one promoter gives one forest of start sites

The team behind the CAGE method had read the same textbooks as the rest of us, and expected that most of their tags would map to one or a few very distinct nucleotide within the promoter, with a TATA box upstream of them. A surprising finding was that this was only true for a handful of promoters - only in around 25% of the core promoters had virtually all their tags located at a single nucleotide, and only about 25% of these cases had a clear TATA-box(Figure 1B shows one such example). Most core promoters had a wide distribution of tags - a forest of transcription start sites (Figure 1C). This was first thought to be caused by methodological noise, but there are many indications that the observations reflect real biology. The perhaps most convincing argument is that human and mouse tag distributions for the same promoters often are very similar, despite that different tissues have been sampled( Figure 1B-C). Another intuitive argument is that if the method is noisy, there should be no promoters where all the tags map to the same location - you would always expected a noisy distribution around it, and this is clearly not the case (See Figure 1B). The team also found in retrospect that there are a handful of detailed studies on single promoters that shows the same phenomenon - so, in one way, the results were not so novel, but what the study really did was to show that most promoters look like this, and not the other way around. Also, other tag-based studies could confirm it.

Sequence analysis showed that the promoters with a single peak were enriched for the TATA-box - many of these promoters are bona fide text-book-model promoters, which shows that the old model is not incorrect, just not very representative. These promoters are generally more conserved over evolution than the promoters with broader distributions. Conversely, the promoters with a forest of transcription start sites rarely have TATA-boxes, but are more often overlapping CpG islands. They are also often expressed in many tissues, while the single peak promoters often are tissue-restricted. This goes hand in hand with the older observation that housekeeping genes are CpG-enriched. There was also a clear trend in both types of promoters for the actual start sites to use certain dinucleotides - a pyrimidin (C/T) followed by a purine (A/G). Comparison between human and mouse and human promoters showed that on average, mutations which disrupt this consensus lead to less usage of the start site, and vice versa.

Why do have promoters have wide distributions of start sites?

This is a hard question to answer with certainty. The most promising explanation so far is based on evolutionary pressure on promoters. The function of a promoter is to recruit the transcription machinery with certain efficiency if the right cues are present (like the binding of certain transcription factors, chromatin state, etc). The usage rate (the "expression") of a promoter is linked to the RNA and protein content and will have an effect on the fitness of the organism. Severe changes in efficiency due to mutations will likely cause problems for the organism, or could even be lethal when important genes are affected.

If we consider the single peak promoter, with a TATA-box upstream the start site, and introduce randomly located mutations in this promoter, the results will likely be binary - either the effect will be severe (mutations disrupting the TATA-box, for instance), or insignificant. Conversely, random mutations in promoters with many start sites will likely affect the promoter, but not as much (even if a whole start site is made useless by the mutation, this will only be one part of the total output). Therefore, the evolutionary advantage of having many TSSs in a promoter is that it is more resilient to mutations, but also that it is easier to fine-tune expression of such a promoter. There may also be other reasons - it could be that the two different promoters are subject to partially different mechanisms in terms of the physical structure of the promoter - that they have different chromatin marks, for instance. This is currently under investigation.

How can the cell determine the start site distribution?

The mutation study described above implies that at least part of the information needed by the cell to know what start sites to use (and how much to use them) is embedded in the local DNA sequence. In a subsequent study, we could show that the local DNA information is sufficient to distinguish between start sites that have a high difference in terms of start site usage. In other words, the DNA sequence can predict whether one start site is "better" than another, in the same promoter. On the other hand, the DNA sequence had almost no predictive value in term of total expression- that is, the actual number of tags. This implies that the actual distribution of the tags, interpreted as a probability distribution summing to 1, is embedded in the core promoter, while the usage of the core promoter (the number of times it is activated), is governed by other effects - probably involving distal elements like enhancers, but also epigenetic effects. This is breaking up the promoter concept in two parts - the opening and stabilization of chromatin, which enables sampling the underlying probability distribution of start sites, which in turn is embedded in the DNA information.

This type of reasoning might explain why earlier computational predictions of promoters were not successful: these methods implicitly assumed both that the start site is distinct, that all DNA is accessible, and that the promoter it is either active or inactive. The model we propose indirectly says that all nucleotide are potential start sites, although some are never accessible.

One gene, many distinct promoters

If we zoom out from single core promoters and look at whole known genes, it quickly becomes apparent that most genes have more than one promoter; most often two, but sometimes up to seven distinct alternative promoters. This is interesting for several reasons - to start with, having multiple core promoters allows a gene to have multiple regulatory programs that are distinct, and can evolve independently of each other. For example, it is possible to have one promoter that is used in liver, and one that is used in brain. Indeed, many of the alternative promoters are highly biased to some tissues. An extreme example, the Dlgap1 gene, has four major promoters where each is used primarily in one unique brain tissue (Figure 3). Depending on the placement of the alternative promoter, the selection of alternative promoters will affect the final protein product, as part of the RNA will be skipped. There are many observed cases where entire functional protein domains are missing in one tissue due to this, which has obvious biological implications. This will work in parallel with alternative splicing, and in the end generate a large diversity of RNA and protein products from the same gene locus. This also shows that the measuring of gene and promoter expression might not yield the same results, and that it might be more relevant to talk about "tissue specificity" for promoters than genes.

Example of usage of different alternative promoters by different tissues in the mouse Dlgap1 gene
Figure 3: Example of usage of different alternative promoters by different tissues in the mouse Dlgap1 gene. The histogram shows the number of tags mapping to the genome, but we have zoomed out to see the whole gen locus. Below the histogram, known isoforms of the gene are shown (mRNAs), which coincide with the major alternative promoters. Each such promoter is biased to one brain tissue.

What happens now?

A very clear trend in molecular biology is that sequencing technologies - in particular so-called ultra-high-throughput sequencers such as 454 or Solexa - are gaining ground over hybridization-based methods such as micro arrays and tailing arrays. The tag methods were originally designed for Sanger-based sequencing, but are now adapted for use with the various new sequencers. This gives new opportunities and challenges. One important point is that the sequencing depth can be increased significantly; this is necessary if we want to achieve close to full coverage of all the promoters used in a complex tissue. Another research direction that is gaining importance is to minimize the number of cells cells that are needed to perform these experiments. In the studies described above, whole tissues had to be used, sometimes from several individual animals, which will give a mixture of different cells. This is problematic, as weak promoter signals might either be due to an overall low activity for this promoter in all the sampled cells, or combination of a very high activity in subset of cells on one hand and no activity at all in the remaining cells.

The start site tagging technologies are complementary to the emerging RNA-seq technology, which instead of sequencing the starts of the transcripts break down the transcripts randomly to make tags that span the gene. Initial results show that RNAseq is more accurate in terms of expression measurements, but can on the other hand not reliably capture the actual starts of genes. So, for the time being, the methods are complementary, and to some degree overlapping. This also highlights an in-built weakness of the start site tagging technologies, namely that the information about the remaining transcript is lost. To a large degree, known transcripts can be used for inference (e.g. if a tag hits a known transcripts it can be assumed to belong to a variant of the same gene), but this is next to impossible when tags hit regions in the genome where no annotation exists. Combinations of RNAseq and start site tagging might be able to resolve this.

Another challenge is to find clinical settings where these methods can be applied to - these might involve the effect of genetic polymorphisms on the promoter-ome. From a computational perspective, tag sequencing opens up many new avenues for analysis, as many of the problems with hybridization methods simply disappear. On the other hand, the field as a whole has not reached the stage where we know the full strength and weaknesses of the new sequencing technologies, and the associated experimental methods. This means that a combination of enthusiasm, creativity and critical mindsets are needed to bring this new technology to bear on biological problems of the future.

Further reading

These reviews and articles are a reasonable starting point for those wishing to learn more about core promoters, tagging technologies and research on these themes. 

S. T. Smale and J. T. Kadonaga: The RNA polymerase II core promoter. Annu Rev Biochem 2003, 72:449-79

M. Harbers and P. Carninci: Tag-based approaches for transcriptome research and genome annotation. Nat Methods 2005, 2:495-502

A. Sandelin, et al: Mammalian RNA polymerase II core promoters: insights from genome-wide studies. Nat Rev Genet 2007, 8:424-36

M. C. Frith, et al : A code for transcription initiation in mammalian genomes. Genome Res 2008, 18:1-12

E Valen et al: Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Res 2009, 19 (February issue, preprint)