INTRODUCTION — Even a complete knowledge of the entire three billion base pairs that make up the human genome (as provided by the Human Genome Project) is insufficient to understand the genetic basis of disease [1]. The sequence data are far more informative if they can be correlated with functional information. This correlation is often accomplished through the use of model systems.
This topic summarizes some of the most useful online sources of genomic information and analysis, and presents selected examples of the use of model systems to explore the relationship between genetic constitution and function. (See "Basic genetics concepts: DNA regulation and gene expression".)
BASIC CONCEPTS — The genome is the full complement of genetic information encoded on a complete, haploid set of chromosomes. The human DNA sequence, as defined by the human genome project, represents a composite compiled by studying DNA obtained from many individuals. While these data include considerable information regarding sequence variation among individuals, inter-individual variation remains incompletely cataloged.
Efforts to define variation between individuals and to relate such variation to disease risk remain central tasks for human geneticists. (See "Basic genetics concepts: DNA regulation and gene expression", section on 'Genetic variation'.)
Much of this investigation is focused upon single nucleotide polymorphisms (SNPs), sequence variations which occur roughly once in several hundred bases [2].
The high density of SNPs and development of methods to efficiently screen for them allow use of SNPs to identify functional genes. Such genes contribute to a wide array of traits, including susceptibility to polygenic diseases and response to pharmaceutical agents [3-5].
Two complementary approaches are being pursued to accomplish the goal of relating gene composition to function, or genotype to phenotype:
●Epidemiologic studies seek to establish associations between genes and traits.
●Functional studies seek to elucidate the mechanisms by which sequence variation leads to phenotypic differences.
Structural and functional conservation of the genome over the course of evolution makes genomics a fruitful approach to studying human physiology and pathology. Genes and their protein products are highly conserved. As an example, approximately 30 percent of human genes have an equivalent in yeast (ie, a homolog). The percentage of homologous genes increases as the evolutionary distance between humans and model organisms decreases. The genome sequence of man's closest relative, the chimpanzee, shares the highest degree of sequence conservation with the human genome, with only approximately 1.2 percent divergence in single nucleotide polymorphism.
Another type of evolutionary relationship is evident in the organization of the genome: groups of genes physically clustered together on one chromosome in one organism are more often clustered together in other divergent species (so-called syntenic conservation) [6-9].
GENOME-BASED EXPERIMENTAL DESIGNS
Genome-wide association studies (GWAS) — A genome-wide association study (GWAS) is a gene mapping study that assesses evidence of association between genetic variants and heritable traits across the genome. The performance of GWAS is feasible because of the availability of dense SNP coverage across the human genome and genotyping platforms that allow one million genotypes to be assessed in a single experiment. Typical studies consist of genotyping hundreds of thousands of common SNPs, using DNA microarrays in large case-control populations, with the goal of identifying specific risk alleles that are more prevalent in cases than in controls. (See "Genetic association and GWAS studies: Principles and applications".)
GWAS investigations of at least 50 traits or diseases have been reported. While these studies have succeeded in identifying variants in linkage disequilibrium with clinical disorders, the contributions of the allele-specific effects have been only modest [10]. As an example, one study of stature identified three alleles affecting height, but the most powerful of these had an estimated effect size of less than half a centimeter [11].
Several databases of GWAS investigations are available. (See 'Online sources of genetic information' below.)
Mendelian randomization — GWAS has allowed identification of many SNPs associated with continuously varying traits, in the best cases accounting for a substantial fraction of the genetic trait variance.
Mendelian randomization refers to another method of analyzing epidemiologic data to determine possible causal associations between gene variants and clinical phenotypes. As an example, suppose one wishes to determine whether a trait A contributes to a second trait B, and GWAS has been done for trait A. The Mendelian randomization study design tests whether there is a causal relationship between trait A and trait B [12]. A subset of SNPs associated with trait A serve as the "genetic instrument" for trait A. Because meiosis segregates alleles at birth, prior to phenotype emergence, genotype serves as a means to randomize the population, mimicking the randomization of participants in a traditional randomized trial, with genotype serving as a proxy for trait A. (See "Mendelian randomization".)
Mendelian randomization offers an attractive opportunity to make causal inferences in observational studies as in randomized trials, but it is essential to recognize that it relies on three strong assumptions, each of which can be questioned in most clinical settings [13]:
●The SNPs must be associated with trait A and the included SNPs should be representative of trait A's total genetic variation. In practice, the SNPs used as genetic instruments in a Mendelian randomization study typically account for a small percentage of trait A's variance.
●The influence of the chosen SNPs must be mediated via trait A, rather than via any alternative pathway. This is commonly referred to as the "no pleiotropy" assumption.
●The chosen SNPs must be independent of other potential confounding variables.
In practice, verifying these assumptions is challenging and mandates that Mendelian randomization be interpreted with appropriate caution. Moreover, negative findings may be more reliable than positive findings. For example, Mendelian randomization findings are inconsistent with C-reactive protein (CRP) levels contributing to atherosclerotic disease risk directly, and instead suggest that the observed association between CRP and atherosclerotic disease are mediated by confounding factors such as excess body weight [14]. In contrast, various putative causal links between CRP and cardiovascular disease outcomes have been questioned [15]. (See "C-reactive protein in cardiovascular disease".)
Expression quantitative trait loci (eQTL) analysis — The pattern of gene expression within a tissue can be treated as a series of quantitative traits, analogous to stature. The expression of many genes can be measured simultaneously using microarrays. Either linkage mapping (expression quantitative trait loci mapping, eQTL mapping) or GWAS approaches can be used to map and identify sequence variants that account for differences in the levels of expression of specific genes [16]. Results from such studies can be used to prioritize variants (ie, SNPs) for possible functional impact on human disease. Variants that confer changes in gene expression are more likely to lead to phenotypic variation at the cellular or clinical level.
Deep sequencing — Technical advances in DNA sequencing, assembly of sequencing reads, and statistical analysis of sequence data have dramatically reduced the cost of sequencing. This has enabled "deep sequencing" (also called "next-generation sequencing") projects in which the whole exome (protein-encoding portion of the genome) or the whole genome is sequenced to seek biologically-significant sequence variation [17-19]. (See "Next-generation DNA sequencing (NGS): Principles and clinical applications".)
Deep sequencing studies have revealed the contribution of rare variants to disease [20,21]. In polygenic disorders, deep sequencing has the potential to account for the "missing heritability" that is encountered in GWAS designs [22,23].
RNA-seq — Advances in sequencing technology and statistical analysis have made it possible for many laboratories to sequence the mRNA of a sample, hence the name RNA-seq [24]. This approach is now often preferred to gene expression arrays as a means of characterizing the transcriptome (set of expressed genes) of a sample. RNA-seq offers two important advantages over array-based methods:
●It is better able to characterize alternative splicing as a regulatory mechanism
●It provides absolute as well as relative quantification of gene expression
Technical advances have also made it possible to undertake RNA-seq using single cells as samples [25].
Immunoprecipitation techniques (ChIP on Chip, ChIP-Seq) — Immunoprecipitation (IP), commonly used to identify protein-protein interactions, has more recently been used to identify protein-DNA interactions, including genome-wide binding patterns of transcription factors and other DNA-binding elements. Results from these studies can be used to map transcription factor binding sites to each of the genes whose expression they control and to define the functional noncoding regions of the genome sequence. Many such annotations are available through the UCSC Human Genome Browser.
Chromatin immunoprecipitation (ChIP) is performed by immunoprecipitation (IP) of fragmented but unpurified DNA, using a transcription factor or other DNA binding protein as the target of the IP. Following the precipitation, the DNA is purified and assayed. Assays are performed genome-wide through the use of microarrays ("gene chips," hence the term "ChIP on Chip") [26]. ChIP-seq is a related methodology in which the immunoprecipitated DNA is sequenced instead of being hybridized to a microarray. This approach requires access to high-throughput sequencing and supporting bioinformatics resources. The resulting data constitute a map of the antigen DNA binding site. (See 'Deep sequencing' above.)
Epigenomics — An individual's genome is essentially constant across tissues, but the pattern of gene expression among tissues varies greatly. This observation points to the existence of regulatory mechanisms beyond DNA sequence variation. Transcriptional activity of a genomic region is determined in part by the methylation status of the DNA and how the DNA is packaged into chromatin, which in turn is dependent in part on the modification of histone proteins. These epigenetic modifications underlie tissue-specific differences in gene expression, X-chromosome inactivation, and imprinting. (See "Principles of epigenetics" and "Genetics: Glossary of terms", section on 'X-inactivation' and "Inheritance patterns of monogenic disorders (Mendelian and non-Mendelian)", section on 'Parent-of-origin effects (imprinting)'.)
DNA methylation can be measured on a genome-wide scale, and such experiments are routine in many laboratories [27]. Chromatin immunoprecipitation (ChIP) is a robust approach to identifying modified histones, enabling genome-wide assessment of specific histone modifications [28]. (See 'Immunoprecipitation techniques (ChIP on Chip, ChIP-Seq)' above.)
Gene editing — The value of model organisms has been enhanced by the development and widespread use of gene editing technology [29]. (See "Overview of gene therapy, gene editing, and gene silencing".)
Gene editing has accelerated the accuracy and speed with which specific variants can be introduced into the genome. Just as gene editing is being used to correct molecular lesions in a therapeutic context, it can be used to engineer specific mutations for study in model organisms [30-34]. Gene editing technology has had a greater impact on multicellular organisms, where isolating and maintaining germ cells is challenging. For unicellular organisms, targeted mutagenesis has long been a standard method.
EXAMPLES OF MODEL SYSTEMS — Although not comprehensive, a few examples of model systems are presented here to illustrate the use of appropriately chosen model systems to relate genetic and functional data. The value of model systems depends upon establishing a proper balance between the simplification they offer and the extent to which they mirror human physiology. These factors usually are inversely related, as summarized in the table comparing the different systems (table 1).
Model systems allow investigators to relate raw DNA sequence data with functional information. Models are useful insofar as they offer the technical means to accomplish studies that could not be undertaken in human subjects; model organisms permit a host of functional studies that could never be undertaken in a clinical setting.
Prokaryotes and yeast — Prokaryotic and yeast models derive much of their value from the ability to investigate large experimental populations in search of rare individuals. Bacteriophages (phages) are a type of virus that infect bacteria. Yeast are single-celled eukaryotes that contain many well-conserved genes, many of which have human homologues.
Standardized protocols can be used for selecting and/or screening for specific mutations. By selection, only individual organisms possessing a specific metabolic property survive to be examined. Screening employs identification of a specific metabolic property in some organisms that results in an easily scored difference in phenotype, allowing such organisms to be reliably identified, isolated, and recovered for further study.
Bacteriophage lambda — The bacteriophage lambda model has advanced general understanding of transcriptional changes in response to environmental stimuli. Lambda is a temperate phage (a type of virus that infects bacteria) that can grow in either a lytic or lysogenic life cycle. This is manifested morphologically by production of turbid plaques, in which surviving infected bacteria are present.
There are various lambda mutants that produce clear plaques in which survival of infected bacteria is much reduced or that behave aberrantly with regard to superinfection. These properties are understood at a detailed molecular level.
Interactions of human transcription factors with their cognate DNA response elements and the resulting modulation of transcription are more complex variations on the themes first defined in prokaryotes.
Yeast — Saccharomyces cerevisiae, a budding yeast, is one of the most fruitful model systems. Yeast have many properties that make them ideal model organisms: the possibility of targeted gene replacement, growth as both a haploid and a diploid organism, and an array of selection and screening schemes. The sequence of the entire S. cerevisiae genome has been available for decades.
An example of the application of these features was the 1999 report of the effects of systematic deletion of over 2000 yeast genes under a variety of growth conditions [35]. A significant insight arising from studies in yeast is the discovery of the sirtuin family of proteins, now recognized as critical determinants of longevity [36].
The ease of experimental manipulation in yeast makes this model system particularly useful for performing experiments in which many genotypes are screened for function. As an example, the yeast 2-hybrid system allows a systematic search for interactions among proteins [37]. This strategy was applied in a study in which approximately 1000 interacting protein pairs were detected [38]. Importantly, the method is not restricted to yeast proteins, so that interacting proteins from any species can be studied in yeast [39].
Protein-protein interactions are important features of virtually every cellular process. At the most basic level, it is important to recall that protein function depends on each protein assuming its proper conformation during synthesis; misfolded proteins lose function. So-called "chaperone" proteins interact with nascent peptides to assist in their correct folding during protein synthesis. While nascent protein-chaperone interactions can be studied individually and in other models, the yeast model system is particularly well-suited to global functional analysis of these interactions [40].
This experimental strategy has been adapted to seek interactions between proteins from other organisms. As noted above, approximately one-third of human genes have yeast homologs and understanding of human biology has been greatly advanced by study of these in the more easily manipulated yeast system.
Data about yeast are readily available at the Saccharomyces genome database (www.yeastgenome.org) [41].
Nematode — The nematode, Caenorhabditis elegans, is used extensively as a model system [42-50]. Use in developmental biology derives from the ability to trace the cell lineages of every cell [51,52]. C. elegans is a popular model in neurology, as its complete neural circuitry is known, and in studies of aging, as a number of lifespan mutants have been discovered [42-50]. The C. elegans system has also provided important insights regarding the role of molecular chaperones in protecting against neurodegenerative diseases [53].
C. elegans data are available through WormBase and the Sanger Centre's Worm Genome page [54,55].
Fruit fly — The fruit fly, Drosophila melanogaster, the first organism in which detailed linkage maps were constructed, remains an extremely useful model.
It has been particularly informative in studies of how spatial information is established during development. The Wnt signaling pathway occupies a central role in mediating multiple aspects of mammalian development and physiology, and it was first found through its homology to the Drosophila wingless gene. Wnt signaling is critical in establishing the body plan and regulating cell polarity and proliferation, including regulating skeletal mass [56,57]. (See "Normal skeletal development and regulation of bone formation and resorption".)
An excellent online resource is Flybase [58].
Zebrafish — The zebrafish is useful because its embryos develop freely, are transparent, and genetic manipulations can be readily performed. Zebrafish are particularly well-suited to studying environmental factors such as drugs and thermal stress because of their small size, aquatic environment, and poikilothermic metabolism. They are also easily amenable to in vivo RNA interference (RNAi), a molecular technique for targeted silencing of specific genes, allowing a wide variety of developmental features to be studied.
As an example, RNAi-mediated suppression of HSP90 (90 kD heat shock protein) across several strains of zebrafish resulted in variable ocular development, with some strains demonstrating severe mutations and others more mild phenotype [59]. The issues addressed in this study fully exploit the advantages of zebrafish as a model system.
The Zebrafish Information Network provides further information and additional links [60].
Mouse and rat — Mice are commonly used to study aspects of mammalian physiology that cannot be modeled in invertebrates or non-mammalian vertebrates. Mice are easy to breed, have been extensively studied genetically, and are readily available. There are hundreds of inbred strains, as well as congenic, recombinant inbred, and recombinant congenic strains. The numerous mice mutants available involve all aspects of physiology.
Mice are also used to study the effects of mutating or complementing specific genes, through the use of transgenic technology. This allows study of the consequences of alterations in candidate genes. With the use of Cre-loxP systems, it is possible to engineer specific mutations and to restrict genetic lesions to specific tissues. Specific "reporter" mice are widely used to validate these manipulations [61] and to assess the activity of specific pathways [62]. These capabilities are being exploited by the International Mouse Phenotyping Consortium, which seeks to generate floxed knockout alleles for every mouse gene, phenotype these animals, annotate the findings, and make these resources widely available to investigators [63].
Humanized mice have allowed investigators to engraft human hematopoietic immune response cells into immunodeficient mice that can be manipulated experimentally. (See "Tools for genetics and genomics: Specially bred and genetically engineered mice".)
The Jackson Laboratory's Mouse Genome Informatics site provides access to information about genes, mutants, inbred strains, mapping and developmental data, and homology to genes in other organisms [64-66]. Data on the rat genome can be accessed at the Rat Genome Database [67].
ONLINE SOURCES OF GENETIC INFORMATION
●OMIM – The most clinically oriented genome-based database is OMIM (Online Mendelian Inheritance in Man). OMIM is a manually curated compendium of genes and genetic phenotypes, regularly updated by reviewers who cull the published literature. Each entry is a full-text overview, with references hyperlinked to PubMed entries. OMIM's emphasis is the relationship between gen locus, genetic variation, and their phenotypic consequences. OMIM also provides the historical context in which understanding about various genes and diseases has emerged.
●GTR – For clinicians needing guidance in management of patients with genetic conditions, the Genetic Testing Registry (GTR) is a database that provides information on genetic conditions and laboratories that offer testing for pathogenic variants in disease genes.
●GeneCards – The Weizmann Institute's GeneCards site is a well-organized gene-centered database, providing genomic, proteomic, and functional information on all known and predicted human genes [68]. The included information is primarily geared toward researchers, but includes links to diseases, drugs, and drug candidates.
●National Institutes of Health (NIH) – The National Center for Biotechnology Information (NCBI), operated by the National Library of Medicine, consists of a series of interconnected databases (the full list is given at www.ncbi.nlm.nih.gov/sites/gquery?itool=toolbar), and serves as a repository for all publically submitted genomic data. The site includes nucleic acid sequence generated through federally-funded genome projects, genetic sequence variants, results from gene expression profiling studies and genome-wide association studies. The Entrez browser system is one possible entry point (www.ncbi.nlm.nih.gov/gquery/gquery.fcgi). Many readers are already familiar with this site's medical literature search system, PubMed.
An Entrez portal (the genotypes and phenotypes database, dbGaP www.ncbi.nlm.nih.gov/sites/entrez?Db=gap) allows users to search by study or disease. The National Human Genome Research Institute (NHGRI) and the European Molecular Biology (EMBL) also collaborate to produce a database of published GWAS at http://www.ebi.ac.uk/gwas/.
The Entrez system also allows users to conduct a variety of analyses. For example, the enables searching of all nucleotide (DNA) or protein databases for sequence homologies to a query sequence [69-75]. BLAST reports include all sequence matches and provide metrics of the strength of alignment, confidence scores for the alignment, and graphical depictions of the alignments.
While Entrez is extremely flexible and comprehensive, its interface is not as easy to use as that of several other resources that are more graphics-based. Two easily used graphics-based genome browsers are Ensembl (www.ensembl.org/index.html) [76] and the UCSC Genome Browser [77]. These browsers provide graphical representation of the genome with sequence annotation of genes, variants, functional elements, and other types of genomic data. Much of the information presented is derived from the NCBI database, and daily cross-references and updates of NCBI, Ensembl, and the UCSC browser ensure similar query retrievals regardless of which interface is used for the query.
●Broad Institute – The Broad Institute hosts a suite of tools for analyzing sequence data, the Genetic Analysis Toolkit (GATK, https://gatk.broadinstitute.org/hc/en-us). In addition to software for undertaking analysis of next generation sequencing data, the site also features extensive documentation that helps users to follow best practices in seeking sequence variants.
The genome aggregation database (gnomAD) is one such tool, integrating reference sequence data collected across various experimental platforms and facilitating variant interpretation [78]. The Broad Institute also hosts cancer program scientific tools (https://www.broadinstitute.org/cancer/cancer-program-scientific-tools-and-resources) and disorder-focused knowledge portals (https://hugeamp.org) for several disease groups, including type 1 and type 2 diabetes, cardiovascular disease, cerebrovascular disease, sleep disorders, amyotrophic lateral sclerosis, lung disease, musculoskeletal disorders, and lipid droplet biology [79].
SUMMARY
●Rationale – Understanding the genetic basis for human disease involves knowledge both of genome sequences and of gene function. Knowledge of variations in gene sequence is largely dependent on the use of single nucleotide polymorphisms (SNPs) in a variety of study designs. Gene function studies are enabled by the use of model systems. (See 'Basic concepts' above.)
●Experimental approaches to genome analysis – Genome-based experimental designs include genome-wide association studies (GWAS), quantitative analysis of gene expression (eQTL analysis), and studies involving immunoprecipitation techniques. Although GWAS studies have identified multiple variants associated with clinical disorders, the physiologic effect of the GWAS-identified individual variant alleles has been relatively modest. (See 'Genome-based experimental designs' above.)
●Animal model systems – Multiple model systems, chosen appropriately for their unique properties, can give insight to the relationship between genetic composition and function within the species. The value of model systems depends upon establishing a proper balance between the simplification they offer and the extent to which they mirror human physiology. Advantages and disadvantages of different model systems are summarized in the table (table 1). (See 'Examples of model systems' above.)
●Online resources – Links to online resources including OMIM, GTR, NIH sites (National Library of Medicine, Entrez), and the Broad Institute, are listed above. (See 'Online sources of genetic information' above.)
آیا می خواهید مدیلیب را به صفحه اصلی خود اضافه کنید؟