Data Analysis and BioInformatics in real-time qPCR (1)

What is Bioinformatics?
In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes of several species. In fact, to date, several bacterial genomes, as well as those of some simple eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast)  and more complex eukaryotes (C. elegans and Drosophila) have been sequenced in full. The Human Genome Project, designed to sequence all 24 of the human chromosomes, is also progressing and a rough draft was completed in the spring of 2000.

Types of data

Analysis and Interpretation of Data

The various types of data:
Many different types of data are collected and stored in databases to facilitate retrieval. Depicted here are amino acid sequences, protein domain cartoons, different renderings of three-dimensional structures, and protein hydrophobicity data. Databases consisting of data derived experimentally such as nucleotide sequences and three-dimensional structures are known as primary databases. Those data that are derived from the analysis or treatment of primary data such as secondary structures, hydrophobicity plots, and domains are stored in secondary databases. A protein database consisting of the conceptual translation of nucleotide sequences would also be considered a secondary database.

The analysis and interpretation of various data types:  Illustrated here are various ways in which individual entries in sequence and structure databases can be compiled to reveal patterns and trends in biology. For example, sequence families or neighborhoods can be defined and annotated based on the similarity of each sequence to other members of the family. Common sequence features in sequence families can be identified in multiple alignments. These motifs may provide clues to the biochemical function of members of the family. Clustering of sequences into trees that reflect the degree of similarity between each sequence and all of the others in the family reveals evolutionary relationships. Finally, identification of homologs to each gene in well-characterized metabolic pathways provides information about the prevalence of that pathway in other organisms.

qPCR software applications:

Molecular Biology Freeware for Windows

A. General - below
B. Microarray - next page

C. Java programs - next page

Good places to start are Genamics SoftwareSeek and BioExchange and eBioinfogen. For general software see Winsite. The following sites are arranged in the order that I discovered them. At some point they will be clustered by poreference:

A. DNA, RNA and genomic analysis
B. Plasmid graphic packages
C. Primer design
D. Protein analysis
E. Viewing three dimensional structures
F. Alignments
G. Phylogeny
H. Miscellaneous

Statistical power calculations
R. V. Lenth
Department of Statistics and Actuarial Science, University of Iowa, Iowa City 52242

ABSTRACT: This article focuses on how to do meaningful power calculations and sample-size determination for common study designs. There are 3 important guiding principles. First, certain types of retrospective power calculations should be avoided, because they add no new information to an analysis. Second, effect size should be specified on the actual scale of measurement, not on a standardized scale. Third, rarely can a definitive study be done without first doing a pilot study. Some simple examples as well as a complex example are given. Power calculations are illustrated using Java applets developed by the author.
and    (runs more stable in Internet Explorer 7)

Java applets for power and sample size

This software is intended to be useful in planning statistical studies.  It is not intended to be used for analysis of data that have already been collected. Each selection provides a graphical interface for studying the power of one or more tests.  They include sliders (convertible to number-entry fields) for varying parameters, and a simple provision for graphing one variable against another.
Each dialog window also offers a Help menu.  Please read the Help menus before contacting me with questions.
The "Balanced ANOVA" selection provides another dialog with a list of several popular experimental designs, plus a provision for specifying your own model.

Note: The dialogs open in separate windows. If you're running this on an Apple Macintosh, the applets' menus are added to the screen menubar -- so, for example, you'll have two "Help" menus there!

You may also download this software to run it on your own PC.

Power Calculator

    Written in PHP by Arno Ouwehand, using the DSTPLAN distribution by Barry Brown et al. These calculators extend the functionality of the old Xlisp-Stat based Power Calculator by not only computing the power for given sample size, or sample size for given power, but will also compute the other available items when specified.
Further statistical calculators here =>
by UCLA Department of Statistics

URI Genomics & Sequencing Center
Calculator for determining the number of copies of a template

qPCR-DAMS:  a Database Tool to Analyze, Manage, and Store Both Relative and Absolute Quantitative Real-Time PCR data.
Jin N, He K, Liu L.
Physiol Genomics. 2006
Physiological Sciences, Oklahoma State University, Stillwater, OK, USA.

Quantitative real-time PCR is an important high throughput method in biomedical sciences. However, existing software has limitations in handling both relative and absolute quantification. We designed qPCR-DAMS (Quantitative PCR Data Analysis and Management System), a database tool based on Access 2003, to deal with such shortcomings by the addition of integrated mathematical procedures. qPCR-DAMA allows a user choose among four methods for data processing within a single software package: (I) Ratio relative quantification, (II) Absolute level, (III) Normalized absolute expression, and (IV) Ratio absolute quantification. qPCR-DAMS also provides a tool for multiple reference gene normalization. qPCR-DAMS has three quality control steps and a data display system to monitor data variation. In summary, qPCR-DAMS is a handy tool for real-time PCR users.
Availability:  This software is free for academic use and downloadable at

FastPCR - 1998-2005 v.3.6 (for Windows)
PCR primer design, DNA and protein tools, repeats and own database searches

FastPCR is a free software for Microsoft Windows and is based on a new approach in the design of PCR primers for standard and long PCRs, inverse PCR,  direct amino acid sequence degenerate PCR, multiplex PCR and in silico PCR; for sequence alignments, clustering and any kind repeat sequence searching. 

At this moment the program is only for OS Microsoft Windows, but C# .Net Linux and Mac program versions are currently under preparation.

FastPCR Software can simultaneously work with multiple nucleic acid or protein sequences  (up to 1,000,000). The multiplex PCR primers design and "in silico" PCR are also supported. The FastPCR program is an ideal software for personal databases homology searches which are similar to the basic local alignment search tool (BLAST) algorithm (a segment-to-segment alignment principle similar to DIALIGN). The program includes various bioinformatics tools and supports the clustering of sequences. A new repeats search theory was developed and applied to the program, which makes the accomplishment of all DNA repeat types searches fast and powerful. 

FastPCR software has several specific, ready-to-use templates for many PCR and sequencing applications:

  • Standard, inverse and long PCR - Locates optimal primers for PCR, hybridisation, or sequencing.
  • Multiplex PCR primers design - fast primers design  with a cross-dimers test for high sensitive multiplex PCRs. 
  • Design group-specific PCR primers.
  • Degenerate PCR: primers are designed directly on an amino acid sequence.
  • In Silico PCR - prediction of probable PCR products and the mismatche primer location search.
  • Primers Secondary structures - self-dimers and cross-dimers primer analyses; primer alignment and melting temperatures calculation.
  • False priming - primers checking for multiple annealing sites using sequence alignment algorythms.
  • Primer quality - a unique way for PCR efficiency determination.
  • Comprehensive primer report - comprehensive pairs and individual primers analysis.

The software supports several file formats: FASTA, text and Excel files.


  • Primer tests and dimer detection;
  • Powerful Repeats Search: Invert, Direct, Simple and others;
  • Clustering Sequences;
  • Make complement, reverse complement and inverted stand;
  • Search the sequence with universal degenerated code with alignment;
  • Extract the sequence from selected sites;
  • Protein/DNA translation;
  • Calculation the annealing temperature of PCR in case unknowns PCR product.
  • Database tools;
  • Restriction analysis.
  • Each application document contains customisable search settings, based on the latest published primer selection criteria for those applications.

Bioinformatics analysis of alternative splicing
Christopher Lee & Qi Wang
Briefings in Bioinformatics      Volume: 6 Number: 1 Page: 23 -- 33

Over the past few years, the analysis of alternative splicing using bioinformatics has emerged as an important new field, and has significantly changed our view of genome function. One exciting front has been the analysis of microarray data to measure alternative splicing genome-wide. Pioneering studies of both human and mouse data have produced algorithms for discerning evidence of alternative splicing and clustering genes and samples by their alternative splicing patterns. Moreover, these data indicate the presence of alternative splice forms in up to 80 per cent of human genes. Comparative genomics studies in both mammals and insects have demonstrated that alternative splicing can in some cases be predicted directly from comparisons of genome sequences, based on heightened sequence conservation and exon length. Such studies have also provided new insights into the connection between alternative splicing and a variety of evolutionary processes such as Alu-based exonisation, exon creation and loss. A number of groups have used a combination of bioinformatics, comparative genomics and experimental validation to identify new motifs for splice regulatory factors, analyse the balance of factors that regulate alternative splicing, and propose a new mechanism for regulation based on the interaction of alternative splicing and nonsense-mediated decay. Bioinformatics studies of the functional impact of alternative splicing have revealed a wide range of regulatory mechanisms, from NAGNAG sites that add a single amino acid; to short peptide segments that can play surprisingly complex roles in switching protein conformation and function (as in the Piccolo C2A domain); to events that entirely remove a specific protein interaction domain or membrane anchoring domain. Common to many bioinformatics studies is a new emphasis on graph representations of alternative splicing structures, which have many advantages for analysis.

Comparison of different melting temperature calculation methods for short DNA sequences.
Alejandro Panjkovich & Francisco Melo
Bioinformatics (21,6):  711 -- 722

Motivation: The overall performance of several molecular biology techniques involving DNA/DNA hybridization depends on the accurate prediction of the experimental value of a critical parameter: the melting temperature Tm. Till date, many computer software programs based on different methods and/or parameterizations are available for the theoretical estimation of the experimental Tm value of any given short oligonucleotide sequence. However, in most cases, large and significant differences in the estimations of Tm were obtained while using different methods. Thus, it is difficult to decide which Tm value is the accurate one. In addition, it seems that most people who use these methods are unaware about the limitations, which are well described in the literature but not stated properly or restricted the inputs of most of the web servers and standalone software programs that implement them.
Results: A quantitative comparison on the similarities and differences among some of the published DNA/DNA Tm calculation methods is reported. The comparison was carried out for a large set of short oligonucleotide sequences ranging from 16 to 30 nt long, which span the whole range of CG-content. The results showed that significant differences were observed in all the methods, which in some cases depend on the oligonucleotide length and CG-content in a non-trivial manner. Based on these results, the regions of consensus and disagreement for the methods in the oligonucleotide feature space were reported. Owing to the lack of sufficient experimental data, a fair and complete assessment of accuracy for the different methods is not yet possible. Inspite of this limitation, a consensus Tm with minimal error probability was calculated by averaging the values obtained from two or more methods that exhibit similar behavior to each particular combination of oligonucleotide length and CG-content class. Using a total of 348 DNA sequences in the size range between 16mer and 30mer, for which the experimental Tm data are available, we demonstrated that the consensus Tm is a robust and accurate measure. It is expected that the results of this work would be constituted as a useful set of guidelines to be followed for the successful experimental implementation of various molecular biology techniques, such as quantitative PCR, multiplex PCR and the design of optimal DNA microarrays.
Availability: A binary software distribution to calculate the consensus Tm described in this work for thousands of oligonucleotides simultaneously for the LINUX operating system is freely available upon request to the authors or from our website
Supplementary information: The large set of oligonucleotides, the detailed results of the comparative and accuracy benchmarks, and hundreds of comparative graphs generated during this work are available at our website

A data-driven clustering method for time course gene expression data.

Ma P, Castillo-Davis CI, Zhong W, Liu JS.
Nucleic Acids Res. 2006 Mar 1;34(4):1261-9. Print 2006.
Department of Statistics, Harvard University, Cambridge, MA 02138, USA.

Gene expression over time is, biologically, a continuous process and can thus be represented by a continuous function, i.e. a curve. Individual genes often share similar expression patterns (functional forms). However, the shape of each function, the number of such functions, and the genes that share similar functional forms are typically unknown. Here we introduce an approach that allows direct discovery of related patterns of gene expression and their underlying functions (curves) from data without a priori specification of either cluster number or functional form. Smoothing spline clustering (SSC) models natural properties of gene expression over time, taking into account natural differences in gene expression within a cluster of similarly expressed genes, the effects of experimental measurement error, and missing data. Furthermore, SSC provides a visual summary of each cluster's gene expression function and goodness-of-fit by way of a 'mean curve' construct and its associated confidence bands. We apply this method to gene expression data over the life-cycle of Drosophila melanogaster and Caenorhabditis elegans to discover 17 and 16 unique patterns of gene expression in each species, respectively. New and previously described expression patterns in both species are discovered, the majority of which are biologically meaningful and exhibit statistically significant gene function enrichment.

Distribution-insensitive cluster analysis in SAS on real-time PCR gene
expression data of steadily expressed genes.

Tichopad A, Pecen L, Pfaffl MW.

Comput Methods Programs Biomed. 2006 Apr;82(1):44-50. Epub 2006

Cluster analysis is a tool often employed in the micro-array techniques but used less in the real-time PCR. Herein we present core SAS code that instead of the Euclidian distances takes correlation coefficient as a dissimilarity measure. The dissimilarity measure is made robust using a rank-order correlation coefficient rather than a parametric one. There is no need for an overall probability adjustment like in scoring methods based on repeated pair-wise comparisons. The rank-order correlation matrix gives a good base for the clustering procedure of gene expression data obtained by real-time RT-PCR as it disregards the different expression levels. Associated with each cluster is a linear combination of the variables in the cluster, which is the first principal component. Large set of variables can then be replaced by the set of cluster components with little loss of information. In this way, distinct clusters containing unregulated housekeeping genes along with other steadily expressed genes can be disclosed and utilized for standardization purposes. Simulated data in parallel with the data from a biological experiment were taken to validate the SAS macro. For both cases, good intuitive results were obtained.

Real-time RT-PCR: Neue Ansätze zur exakten mRNA Quantifizierung
BioSpektrum 1/2004  (in German)

Die molekularen Technologien Genomics, Transcriptomics und Proteomics erobern immer mehr die klassischen Forschungsgebiete der Biowissenschaften. Die enorme Flut an gewonnenen Daten und Ergebnissen ist von überproportionalem Nutzen in der molekularen Diagnostik und Physiologie sowie die „Functional Genomics“. Immer neue ausgeklügelte Methoden und Anwendungen sind daher nötig um komplexe physiologische Vorgänge zu beschreiben. Da wir uns erst an Anfang dieser molekularen Ära befinden, ist es notwendig diese Techniken zu optimieren und komplett zu verstehen. Eine dieser technisch ausgefeilten Methoden zur zuverlässigen und exakten Quantifizierung spezifischer mRNA, stellt die real-time RT-PCR dar. Dieser Artikel beschreibt im Wesentlichen die effizienzkorrigierte relative Quantifizierung, die Normalisierung der Expressionsergebnisse anhand eines nicht regulierten „Housekeeping Gens“, die Berechnung der real-time PCR Effizienz sowie die Verrechnung und statistische Auswertung der Expressionsergebnisse. Alle beschriebenen Themenkomplexe können im Detail auf der korrespondierenden Internetseite in internationalen publizierten Originalarbeiten nachgeschlagen werden.

NAR hot papers

Nucleic Acids Research - Recent Hot Papers

Nucleic Acids Research 2005 vol 33 (Database issue)
The 2005 Database Issue of Nucleic Acids Research is the twelfth in a series dedicated to factual databases in the field of molecular biology. Such databases are an essential resource for working biologists and this compilation provides descriptions and updates of the most important of these databases and serves to introduce newly ... [Full Text of this Article]

Database Categories List
  1. Nucleotide Sequence Databases
  2. RNA sequence databases
  3. Protein sequence databases
  4. Structure Databases
  5. Genomics Databases (non-vertebrate)
  6. Metabolic and Signaling Pathways
  7. Human and other Vertebrate Genomes
  8. Human Genes and Diseases
  9. Microarray Data and other Gene Expression Databases
  10. Proteomics Resources
  11. Other Molecular Biology Databases
  12. Organelle databases
  13. Plant databases
  14. Immunological databases

Nucleic Acids Research 2004 vol 32 (Web Server issue)
Last year Nucleic Acids Research published a special issue devoted to web servers. This issue complemented the annual Database Issue, which has now appeared in 11 successive years. The Web Server Issue highlights the many servers that are available on the web to perform useful computations on DNA, RNA and protein sequences and structures. Between them, the two issues provide an unparalleled array of useful computational services. The new Web Server Issue aims to provide a repository in which authors of web servers can highlight their offerings and readers can find out what is available.
In the current issue there are reports of 137 web servers that run the gamut from BLAST services to three-dimensional protein structure prediction. The servers described have all been subjected to rigorous peer review, are available free of charge and provide invaluable resources to the scientific community. The scientists and programmers who have provided these resources deserve our immense thanks. They illustrate the very best of the scientific spirit that transcends national boundaries and promotes cooperation and the sharing of resources.

A web server for performing electronic PCR
Kirill Rotmistrovsky, Wonhee Jang and Gregory D. Schuler
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20984, USA

‘Electronic PCR’ (e-PCR) refers to a computational procedure that is used to search DNA sequences for sequence tagged sites (STSs), each of which is defined by a pair of primer sequences and an expected PCR product size. To gain speed, our implementation extracts short ‘words’ from the 3' end of each primer and stores them in a sorted hash table that can be accessed efficiently during the search. One recent improvement is the use of overlapping discontinuous words to allow matches to be found despite the presence of a mismatch. Moreover, it is possible to allow gaps in the alignment between the primer and the sequence. The effect of these changes is to improve sensitivity without significantly affecting specificity. The new software provides a search mode using a query STS against a sequence database to augment the previously available mode using a query sequence against an STS database. Finally, e-PCR may now be used through a web service, with search results linked to other web resources such as the UniSTS database and the MapViewer genome browser. The e-PCR web server may be found at

Sequence Mapping by Electronic PCR
Gregory D. Schuler
Genome Research
Vol. 7, No. 5, pp. 541-550, May 1997
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20984

The highly specific and sensitive PCR provides the basis for sequence-tagged sites (STSs), unique landmarks that have been used widely in the construction of genetic and physical maps of the human genome. Electronic PCR (e-PCR) refers to the process of recovering these unique sites in DNA sequences by searching for subsequences that closely match the PCR primers and have the correct order, orientation, and spacing that they could plausibly prime the amplification of a PCR product of the correct molecular weight. A software tool was developed to provide an efficient implementation of this search strategy and allow the sort of en masse searching that is required for modern genome analysis. Some sample searches were performed to demonstrate a number of factors that can affect the likelihood of obtaining a match. Analysis of one large sequence database record revealed the presence of several microsatellite and gene-based markers and allowed the exact base-pair distances among them to be calculated. This example provides a demonstration of how e-PCR can be used to integrate the growing body of genomic sequence data with existing maps, reveal relationships among markers that existed previously on different maps, and correlate genetic distances with physical distances.

iPCR iPCR  =  Virtual PCR

In silico PCR In silico simulation of molecular biology experiments
In silico experiments with complete genomes

This site has been developed by Dr. Joseba Bikandi, Dr. Rosario San Millán and co-workers in the Department of Immunology, Microbiology and Parasitology, Faculty of Pharmacy, in the University of the Basque Country.
Some tools included in this site or their prior versions where primarily developed to obtain theoretical PCR results with Salmonella by the group of Dr. Javier Garaizar and Dr. Aitor Rementeria research group. Latter they were adapted to be used with any bacterial species sequenced up to date. The list of genomes is updated shortly after their availability at NCBI, and the number of tools available will also increase in the near future. Additional databases used by these tools have been obtained from NCBI and in some cases a link will redirect users to NCBI in order to obtain specific information.

UCSC In-Silico PCR UCSC In-Silico PCR

In-Silico PCR searches a sequence database with a pair of PCR primers, using an indexing strategy for fast performance.

Configuration Options

  • Genome and Assembly - The sequence database to search.
  • Forward Primer - Must be at least 15 bases in length.
  • Reverse Primer - On the opposite strand from the forward primer. Minimum length of 15 bases.
  • Max Product Size - Maximum size of amplified region.
  • Min Perfect Match - Number of bases that match exactly on 3' end of primers. Minimum match size is 15.
  • Min Good Match - Number of bases on 3' end of primers where at least 2 out of 3 bases match.
  • Flip Reverse Primer - Invert the sequence order of the reverse primer and complement it.

New real-time PCR primer and probe databases:

more PRIMER links  =>  here
Publication: PATTYN, F., SPELEMAN, F., DE PAEPE A. & VANDESOMPELE, J. (2003). RTPrimerDB: the Real-Time PCR primer and probe database. Nucleic Acids Research, 31(1): 122-123)
Publication: Xiaowei Wang and Brian Seed (2003) A PCR primer bank for quantitative gene expression analysis. 
Nucleic Acids Research 31(24): e154; pp.1-8.
  • The Quantitative PCR Primer Database (QPPD) provides information about primers and probes that can be used to quantitate human and mouse mRNA by reverse transcription polymerase chain reaction (RT–PCR) assays. All data has been gathered from published articles, cited in PubMed.