Impact of Functional Genomics on Biotechnology

Publiceret Juli 2002

Abstract:

Functional genomics encompasses the "new biology" that follows genome sequencing - such as assigning gene functions; understanding protein expression and processing; as well as modeling reactions in metabolism, signal transduction, and other networks. 

At the conference, researchers showed how they use comparative analysis of all the available genomes, reconstruction of the cell?s metabolic network, and structural mapping of the unassigned genes in order to increase the annotation rate and find more new genes.  

Several talks from universities participating in the large Danish Biotechnology Instrument Center (DABIC) grant demonstrated the impressive technique development and buildup of instrumentation that it has made possible. In the proteomics field in particular, impressive scientific accomplishments were also presented.

A major achievement for industrial biotechnology was the genome sequencing of A.niger, which open many possibilities to find new industrial enzyme products as well as opportunities for optimizing recombinant protein production in this industrially useful host.

Led by the genome sequencing programs, a massive gathering of information has characterized the field of biotechnology over the last several years. This has made necessary a whole new set of scientific apporaches to store and communicate the data, to analyse its significance, and, importantly, to put it into a relevant biological context. ?Functional genomics? is the best common descriptor of all the systematic, information-intensive approaches for analysis of this data ranging from gene sequencing to mathematical models of entire cells.

In May, a record crowd of 187 gathered in Munkebjerg, Vejle for a two-day conference on the "Impact of Functional Genomics on Biotechnology". This, the 8th annual "Danish Biotechnology Conference", is aimed for scientists working in Denmark and the surrounding regions.  It is organized by Danish Biotechnology Forum (DBF) – this year with support from DABIC.

Genome analysis:

The scientific program was kicked off by Niels Larsen from Integrated Genomics, USA - a company which focuses on sequencing, expression analysis, and gene annotation. With 60 prokaryotic genomes sequenced presently, he predicted the sequencing would increase to 100-200 genomes/year in 2010 – even in the absence of technological leaps in sequencing such as nanopore sequencing or others. With so much information available, he proposed that data analysis would be the most fruitful (and profitable) area to focus on.

In a thought-provoking slide, he showed the well-known ?Boehringer chart? of metabolic pathways including 800 reactions, and explained that for 22% of the reactions, no gene is known, while for 17% only a prokaryotic gene is known.  The high frequency of unknown genes underscores the gap between the high level of genome sequence information and the relatively limited understanding of cellular processes. There are still many fundamentally important gene functions to be discovered by innovative assignment algorithms. The high proportion of genes specific for prokaryotes represent an interesting set of targets for novel antibiotics, which due to their action on prokaryotic genes or gene products would be selective for the infectious agent while harmless to the patient.

Having set this stage for the importance of assignment Niels Larsen illustrated several clever ways to use the availability of many genomes to extract annotations which would not be possible from mere sequence homology analysis. These included incorporating, into the genome analysis, the strain's biochemistry and data on the clustering of open reading frames in the genome.

For instance, Integrated Genomics incorporate general metabolic network information in order to identify which pathways are represented in a given genome, suggesting that pathway genes, which cannot readily be identified in the genome most likely will be found in the unannotated fraction. Further, they link the genome analysis to metabolite information specific for that organism, as it is often found that genes for proteins acting on similar metabolites reside closely in the genome. In the case of Termotoga maritima - a thermophilic bacteria which is the source of many interesting leads for industrial enzymes - the application of these methods lead to functional assignment of an additional 10 % of the genes compared to simple sequence homology searches (Nucl. Acid. Res. 2000, 28, 123-125).

Lars Juhl Jensen from Søren Brunak's group at DTU, Lyngby gave a concise and assertive talk, in which he highlighted the group's achievements in sequence-based functional assignment. A simple property like gene length is selectively associated with the group of transport and binding proteins, which could be predicted at sensitivity of 0,9 and with only 10 % false positive. However, the approach came up short for prokaryotes, which in my opinion underlined the need to move beyond single gene analysis, and include gene-specific biochemical information and genome localisation for multiple genomes, as was highlighted in the first talk.

Julian Gough from MRC Laboratory of Molecular Biology in Cambridge, UK brought the audience into the three dimensional world of structural genomics.  His aim is to protein sequence information to predict which structural superfamily it belongs to. In many cases, this information leads to implied functional annotation of otherwise unannotated genes, as proteins from the same superfamily usually have identical or related functions. 

His approach involves building a library of so-called Hidden Markov Models (HMM) representing each of the more than 1000 superfamilies of protein structures classified in the SCOP database of protein motifs. All available protein structures were used to build the models. Thus a sequence of interest can be matched to the superfamily to which it belongs by searching for similarity to the underlying HMM's rather than by simple homology searching to the members of the superfamilies. This improves the number of hits - especially among the members of a superfamily which are more distantly related to the other members.

The method is fast enough to analyse entire genomes, leading to classification of close to half the genes of a genome to the appropriate SCOP superfamily. At the structural level, this showed that eucaryotes contain genes belonging to around 600 different superfamilies, with 97% identity of the superfamilies represented in the human genome compared to yeast or plant genomes! Bacteria, predictably, use fewer of the superfamily motifs for their proteins.

Focusing on the unannotated parts of the genomes, he showed that this method could identify the superfamily relationship for several hundred of the unannotated genes in prokaryotic genomes and likewise for several thousand unidentified genes in eucaryotic genomes. This is equivalent to ~15% of the genes in each genome, which were formerly designated ?unknown function?, but which now could be at least assigned to a superfamily fold. Of these novel assignments, approximately half (more for eucaryotes than for prokaryotes) could have been obtained using  BLAST searches to the SCOP domains as published, hence the HMM approach significantly increases the assignment level, especially in prokaryotic genomes. The method is made available for submission of sequences for analysis etc. at www.supfam.org.

Annotation of genes to different functional classes is central to the use of genome information, however, there is surprisingly little consistency in the way genes are annotated in different genomes. Research groups choose different functionality classification schemes, and even with the same set of categories, groups may have different criteria for whether the match to a category is significant enough to allow assignment. This inconsistency make life more difficult for companies like Integrated Genomics, which make a living from extracting information from cross-genome comparisons, as highlighted by Niels Larsen. In contrast, Gough?s method provide a consistent classification system, which is computationally efficient enough to allow classification of all public genomes to be carried out for a single publication (J. Mol. Biol., 2001, 313, 903-919). It will be interesting to see what can be achieved by combining the approaches taken by these two speakers.

Danish Biotechnology Instrument Center (DABIC)

Established in Jan 2001, this 135M DKK government initiative links five universities (DTU, SDU, AU, KU, KVL) to enforce centers of excellence in biotechnology instrumentation. The facilities include structural analysis, nucleotide sequence analysis, proteomics, bioinformatics, microelectronics, pathway analysis, animal genetics, and bioimaging. The build-up will be complete with the inauguration next year of the last of the five Cassiopeia beamlines for macromolecular structure determination at the synchroton at the MAX2-lab in Lund – an aspect of DABIC for which Sine Larsen (University of Copenhagen) is responsible. In general, the DABIC facilities are available at low cost to industry and academics (www.dabic.dk) providing a fertile ground for focus on the instrumentation-intensive functional genomics studies.

Several talks highlighted development of technologies and facilities for the DABIC grant, including Sine Larsen and Flemming Poulsen (University of Copenhagen) who spoke about structure solving by x-ray diffraction and NMR spectroscopy respectively and Jens Nielsen (DTU) who explained his group?s ambitious "systems biology" approach to model stoichiometrically the metabolism of the entire cell represented by 1200 reactions, 700 genes, and 700 metabolites in 3 distinct compartments ? the cytosol, mitochondria and extracellular. The model has been used for solving several problems in central metabolism and in amino acid sensing, however, there are still many, many unknown aspects as 20% of the reactions are not associated with a gene yet, and the model does not represent reaction kinetics at all ? only stoichiometry. Perhaps the most fascinating contributions from the DABIC-funded centers, came from the proteomics experts at University of Southern Denmark in Odense: Ole Nørgaard Jensen elegantly and efficiently combined gel electrophoresis with liquid chromatography and tandem MS for identification of phophorylated peptides and proteins in signal transduction events; while Stephen Fey told a fascinating story about identifying crucial post-translational modifications in the transfected cells involved in diabetes type 1.

In his introduction to DABIC, Peter Roepstorff, ironically noted that now that the investments are almost completed, the 3-year grant period is coming to an end, leaving it up to the individual investigator?s fundraising capabilities to maintain the instrument park. Anybody who has negotiated a service contract on major instruments will know that this will put a heavy drain on grant support over the next years. Roepstorff suggested a longer horizon on grants for this type of equipment, to secure full scientific value for the investment.

Aspergillus niger genome sequence

In one of the scientific highlights of the conference (especially for those of us with vested interests in producing large amounts of protein in fungi) - Han de Winde from DSM, in Delft disclosed some of the insights gained from having sequenced the Aspergillus niger genome. This filamentous fungus has for years been both a major source of new enzyme products and a widely used host for recombinant production of heterologous proteins. It is of central importance to the Life Sciences division of the DSM industry group, which produces biomass, enzymes, and beta-lactams using only three microbial work horses: Saccharomyces cerevisiae, Penicillium chrysogenum, and Aspergillus niger.

The sequence was achieved in a consortium with Qiagen (sequencing) and Biomax (bioinformatics). Using the Bacterial Artificial Chromosome (BAC) technique, sequencing the 34 Mbase genome was completed in just 15 months. In contrast to the popular shotgun sequencing, the BAC approach is based upon individual sequencing and assembly of hundreds of BACs each containing 50-100kb. The genome contains 14.400 genes, of which 45% could be assigned, leaving a major information trove for new discoveries. Even among the assigned half of the genome, close to 400 proteases and carbohydrases were found. These enzyme classes are of major interest as products for food production, for animal feed, and in industrial applications, however the most interesting applications of the genome information may be for optimization of heterologous protein production. Such optimization is, to a large extent, based upon accumulating beneficial mutations through successive rounds of random mutagenesis and screening. Identifying where the genome is actually mutated in the improved strains is central to understanding the mechanisms of improved protein production. This may be achieved by genome-wide expression analysis of the different mutants – a technique which also has large potential in the subsequent optimization of inoculation and fermentation conditions for the selected strain ? but which requires a comprehensive array of markers for the host organism?s genes. For that use, Han de Winde revealed that Affymetrix will soon make available a gene chip based on the A.niger sequence information. In an admirable initiative, the sequence information is made available to the academic community, provided that each group makes a specific disclosure agreement.

In the final talk of the conference Kristian Almstrup of Copenhagen University Hospital had been selected to present his poster on Spermatogenesis. Inspired by the fact that 40% of Danish conscripts to military service have sperm counts that indicate they may have fertility problems, he set out to identify affected genes by a set of different functional genomics approaches. Using differential display competitive PCR, he randomly amplified a large number of gene pieces, from cDNA libraries and picked the ones which were differentially expressed in testicular tissues from low-sperm count mice. These genes were arranged on glass slides for microarray analysis of gene expression during development of these cells, and also used for in-situ hybridization probes. The microarray data were in good agreement with the differential display, however, the in-situ studies showed that an apparent down-regulation in a tissue slide can cover upregulation in certain cell types which constitute only a minor part of the tissue. Nevertheless, the differentially expressed gene set proved useful for analysing the effects of adding endocrine disrupter chemicals suspected of causing reduced sperm quality.

This presentation implicitly carried a simple but important message about using several overlapping approaches to address a biological problem. This was appropriate advice for an audience that for two days had been discussing the most specialized and refined methodologies. Users of such exquisite tools inherently are at risk of becoming too reliant on the technological advances for their  own particular tool to take other supplementary approaches into use. While the "new biology" is driven forward on technological wizardry, one must remember the virtues of traditional scientific methodology when trying to find useful answers to biological problems.

DBC IX:

With a new topics for each year's Danish Biotechnology Conference, a regular participant will appreciate the breadth of biotechnology research practiced in this region.  The first few conferences focused on the industrial and food applications of biotechnology, however, in recent years the congress program has been increasingly open to technology topics, including their application in medicine. DBC IX will be held May 22-23, 2003 with the topic "Eukaryotic Cell Factory and Protein Expression".