PCAGO helps you analyzing your RNA-Seq read counts with principal component analysis (PCA). We also included other helpful features like read count normalization, downloading annotations and GO terms for your genes and a tool to find a gene variance cut-off for PCA.
Upload your read count data or choose one of our example data sets. If your data is not normalized, you can apply read count normalization with DESeq2 or TPM directly within the app. You can view and export all generated data, as well as reports about the processing.
You can filter the genes by scaffold, biotype or GO terms. Just upload an annotation or use the integrated feature to extract data from online databases.
Create interactive PCA and hierarchical clustering plots of your sample and use the animation feature to analyze how adding more and more variant genes into the calculations changes the outcome. We also included highly flexible customization and coloring options directly within the web application.
PCAGO is a tool to analyze RNA-Seq results with principal component analysis. You can use it to …
PCAGO provides additional features that lets you do your tasks easier, like …
Analyze
pageSidebar > Data > Import read counts
or choose example dataSidebar > Data > Import samples annotation
or choose example dataThe source code of PCAGO is available at GitHub. If you want to use PCAGO as desktop application, you can use PCAGO-Electron or execute a Docker container (recommended). See GitHub for further instructions.
Working with PCAGO can be categorized into three sections:
First, read count data, sample and gene annotations need to be imported. The absolute minimum consists of the read count data and the sample annotation that defines the conditions of each sample. For other tasks like filtering or normalization, additional sample annotations or gene annotations are required.
As second step, PCAGO allows additional processing steps that remove genes with a variance of zero, read count normalization and filtering of read count data by only including genes with specific properties. At last, principal component analysis (PCA) is applied to the data.
Lastly, output plots and files need to be created for further usage.
This help page will cover all required information about each step and will use certain terms to refer to specific user interface elements.
The user interface of the main application is separated into three parts:
The content area displays the currently selected output. This output consists of plots, tables and additional information about the currently displayed data. The content navigation allows access to all available outputs. The sidebar contains all controls, which includes plot settings and all controls that allow importing or processing of data. With exception to Plot, the categories in the sidebar are independent from the currently selected content. Each category is marked with an icon that indicates if user input is expected, recommended or required:
This page will guide you through all steps to process and visualize a data set.
It uses Monocytes
data set you can find here.
You need:
Open the Data > Import read counts
section of the sidebar and select the
option to upload a file. Select the read count file and click on Submit
.
Check if the read counts are loaded correctly. PCAGO will automatically show the data when it is available.
You can either upload a file or generate the sample annotation from sample names.
You can exclude any sample condition from further calculation steps by
clicking the X
button next to the condition.
You can again upload a gene annotation file or query data from online databases like Ensembl BioMart. Please note that the resources of our server are limited, which can cause issues with large annotations. In this case, run PCAGO from your computer.
We offer normalization with TPM or DESeq2. Please keep in mind that this process can take a long time for larger data sets.
You can filter genes by their gene annotation. This includes GO terms, chromosome and other properties. Just click into the filter area to show the list of available filters.
You can also filter genes by their variance. Just drag the slider to select the top variant genes. This filter also supports animation: Just click the play button.
At any time, you can switch to the PCA samples plot that displays PC1 and PC2 by default.
To change the displayed principal component axes or switch to a 3D PCA plot,
change the displayed axes in the Axes
setting.
PCAGO allows you to change many plot settings directly within the application. You can for example assign a color and/or shape to a condition or change the name displayed in the plot.
To provide read counts, go to Sidebar > Data > Import read counts
and choose
an example data set or upload a file. You can find more information about the
upload widget in Importing > Upload widget
help section.
Read counts are provided in tabular form where a column represents a sample and a row represents a gene. The first row contains the sample names that are refered in the sample annotation. The first column contains the name of the gene.
ID | Sample1 | Sample2 | Sample3 | … |
---|---|---|---|---|
Gene1 | Read count of Gene1 in Sample1 | Read count of Gene1 in Sample2 | … | … |
Gene2 | Read count of Gene2 in Sample1 | … | … | … |
Gene3 | … | … | … | … |
… | … | … | … | … |
PCAGO can import read count data from CSV files.
See Importing > File formats
for more information about file formats.
Read count samples are annotated with two different types of annotations, the sample conditions and the optional sample info annotation. The sample conditions assign a set of conditions to each sample and is used for visual representation and normalization. The sample info annotation currently only contains the mean fragment length that is needed by TPM normalization.
PCAGO represents the sample conditions as a boolean table that contains logical values. The rows represent the samples, while the columns represent the available conditions. Similar to the read count table, the first row and the first column are reserved for labeling. All other cells determine if a sample of a given row has a condition as given by the column.
Condition1 | Condition2 | Condition3 | … | |
---|---|---|---|---|
Sample1 | TRUE or FALSE | … | … | … |
Sample2 | … | … | … | … |
Sample3 | … | … | … | … |
… | … | … | … | … |
Example
A valid sample condition table looks like following:
Sample | Vitamin_C | Vitamin_Control | Infection_EColi | Infection_Control |
---|---|---|---|---|
S1 | TRUE | FALSE | TRUE | FALSE |
S2 | TRUE | FALSE | FALSE | TRUE |
S3 | FALSE | TRUE | TRUE | FALSE |
Sample S1 is infected by E. coli and treated with Vitamin C. Sample S2 is not infected, but treated with Vitamin C. Sample S3 is only infected with E. coli.
PCAGO is able to convert a different format to represent sample conditions to its internal representation. This factor table is more easily readable by humans and represents sample conditions as choice from specified categories of treatments. Here the columns represent the categories.
Sample | Category of treatments 1 | Category of treatments 2 | … |
---|---|---|---|
Sample1 | Specific treatment of given category | … | … |
Sample2 | … | … | … |
… | … | … | … |
Example
Following factor table represents the same conditions as with the other example:
Sample | Vitamin | Infection |
---|---|---|
S1 | C | E. coli |
S2 | C | Control |
S3 | Control | E. coli |
PCAGO supports importing sample conditions from a boolean table or
a factor table. The tables must be provided in CSV format.
See Importing > File formats
for more information about file formats.
The sample conditions also can be built based on the names of the
samples in the read count table. The generator splits the name of each sample
by a userdefined character and adds each of the resulting substrings to the set of conditions.
If for example the name of a sample is Infection_EColi_N1
and the generator is set to split
using a _
character, this sample has the conditions Infection
, EColi
and N1
.
If the character does not occur in the sample name or the split character is empty,
the whole sample name is assumed to be a condition.
Other components like the visual editor use the condition table to determine cell properties.
The order of the conditions within the table may change the output depending on the component.
Additionally, the generator can generate unwanted conditions that would affect normalization
(an example might be N1
).
To solve this, the Rearrange conditions control below the samples annotation importer
allows rearrangement and disabling of specific conditions.
The sample info currently only contains the mean fragment length used by TPM normalization and similar to the condition annotation is stored as a table. Again, the rows represent the sample. The columns represent the currently supported annotation values. To allow extension of PCAGO, both the first row and the first column are reserved for labeling of the data.
ID | meanfragmentlength |
---|---|
Sample1 | … |
Sample2 | … |
… | … |
The table must be provided in CSV format.
See Importing > File formats
for more information about file formats.
You can upload multiple sample annotations per upload widget (see Importing > Upload widget
for more information).
The data will be integrated into a final sample annotation. You can upload multiple data for the same type.
The sample info annotation will be merged, while the conditions are completely overwritten. Newer data overwrites
old data.
The gene annotation can be used to select only a specific set of genes. Other annotation data is needed for read count normalization.
A gene has following annotations:
You can upload multiple gene annotations per upload widget (see Importing > Upload widget
for more information).
The data will be integrated into a final annotation. Older data will be overwritten by newer data.
PCAGO supports two file formats, which are GFF files downloaded from Ensembl (Ensembl GFF) and
a tabular format internally used by PCAGO (PCAGO gene annotation table). Please note that
Ensembl GFF files do not support GO terms. The tables exported from Content Navigation > Data > Genes > Annotation
are consistent with the PCAGO gene annotation table format and can be imported.
The uploader widget allows selection of specific annotations to be imported. Deselect a checkbox to exclude the associated annotation data from being imported.
ID | scaffold | start_position | end_position | length | exonlength | biotype | go_ids | custom |
---|---|---|---|---|---|---|---|---|
Gene1 | … | … | … | … | … | … | term1|term2|… | custom1|custom2|… |
… | … | … | … | … | … | … | … | … |
PCAGO includes access to following sources of gene annotations:
Following parameters are required:
Following parameters are required:
The general upload widget handles all importing of your data. This includes uploading files, loading samples, generating data from online resources, as well as integrating multiple data sources.
This widget has two modes:
The widget imports one set of data. An example are a read count table or a visual style definition. If you choose a different data source, the existing one will be overwritten if you click Submit.
Depending on the data this can be imported, you have following choices for the source of the data:
Below the data source selection, there are always additional parameters that belong to the selected data source. If uploaded file is selected, an upload widget will appear. If manual input is selected, a text area allows insertion of the data. If you want to load a sample data set, a selection box will appear below the data source setting.
Below the data source specific parameters, there will be at least one additional parameter called Importer or Generator. As there are often multiple file formats or online databases, the upload widget might offer multiple options. You can find specific information about the supported file formats in the help page sections that correspond to the sidebar.
If data cannot be entirely obtained with one uploaded file or database query, the upload widget is set to the integration mode. This mode allows you to combine data from multiple files, online resources and samples that are automatically integrated into the final data set.
The integrating upload widget works like the widget in single data mode, but additionally has
Use the list of imported data to keep overview and individually delete data sets from the uploader.
The integration function will return a callback that contains information about the integrated data, i.e. how many genes have an annotation or which information is missing.
This page gives you a brief overview about common file formats.
Character Separated Value (CSV) files represent tables that are read from plain text files.
A line in the text file represents a row, while the columns are determined by
splitting each line at a separator.
If a separating character is valid data (like 1,000
in case of comma separated files), the
value is put into quotes.
Example
Following table
ID | A | B | C |
---|---|---|---|
G1 | 4,54 | 242 | 121 |
G2 | 122 | 454 |
can be written as following in CSV format if the separator is a comma:
ID,A,B,C
G1,"4,54",242,121
G2,122,,454
If the separator is a whitespace, it can be written as following:
ID A B C
G1 454 242 121
G2 122 "" 454
You can apply additional processing to your read count table:
If enabled, all genes with zero read counts are omitted from the table.
You can use this option if you provided a transposed table (where the genes are the columns and the samples are the rows).
If your read counts are not normalized, it is advised to apply normalization. Currently, you can choose to apply library size normalization with the DESeq2 package or apply within sample normalization with TPM.
Note that you can only normalize integer read counts.
Currently the DESeq2 normalization tool supports normalization. It uses the sample condition (see Data > Sample annotation
help page for more information) as basis to build the design matrix.
TPM needs following information:
The feature length / exon length is provided by the gene annotation, while the
mean fragment length is provided by the sample annotation. See the respective help pages Data > Gene annotation
and
Data > Sample annotation
help page for more information.
We use the following formula to calculate the TPM value for one feature (gene) i:
where ci is the raw read count of gene i, li is the effective length of gene i and N is the number of all genes in the given annotation. The effective length of a gene i is calculated as:
feature_length - mean_fragment_length + 1
if a mean fragment length is provided within the sample annotation, otherwise the effective length is simply the length of the feature obtained from the gene annotation.
This TPM equation was initially defined by Li et al., 2010 and
exemplarily applied in the publications of Klassert et al., 2017
and Riege et al., 2017
from which one of our example data sets (Monocytes
) is derived.
If enabled, all genes with constant read counts are omitted from the table. This is needed if you want to apply variance scaling to the data before PCA as genes with constant read counts have a variance of zero.
You can filter your read counts by restricting them to a specific set of genes. The first available filter mechanism is to use the data from the gene annotation.
Click on the large selection field to list all available criteria. Type into the field to search for criteria.
Currently, following criteria are available:
Note: Don't forget to remove the All (*) element from the list or all genes are selected.
You can change how the criteria are matched. You can either select all genes that have at least one matching criterion (OR) or only genes where all criteria in the selection field have to apply (AND). Additionally, you can invert the selection.
You have genes A, B, C and D.
Gene | Scaffold | Biotype |
---|---|---|
A | X | protein_coding |
B | X | miRNA |
C | 1 | protein_coding |
D | 2 | lncRNA |
Depending on the filter settings, different genes are selected.
Selected criteria | Operation | Invert | Selected genes |
---|---|---|---|
X, protein_coding | OR | False | A,B,C |
X, protein_coding | AND | False | A |
X, protein_coding | AND | True | B,C,D |
X, protein_coding | OR | True | D |
A gene variance cut-off can be applied to genes filtered by the gene annotation. Use the slider or the numeric input within the gene variance filtering settings to change the cut-off.
Below the slider you can find buttons that increase/decrease the gene variance cut-off by 1 or the currently selected animation step. Use the play/pause button to enable or disable animation of the gene variance cut-off within the selected range (parameters From and To). The speed of the animation and the increase of the gene variance cut-off can be changed by the Animation step and Animation speed (ms) parameters.
Please note that the actual animation speed depends on the speed of your connection and the hardware that runs PCAGO. In certain plots, an Export .mp4 feature allows export of the animation as video file.
Due to the large amount of Gene Ontology (GO) terms, the gene annotation based filtering requires pre-selecting the GO terms considered for filtering. The GO Browser widget helps achieving this task by allowing to search for terms in the set of and browse the hierarchical structure of the GO terms.
By default, the GO browser displays the root terms of the three main categories of GO terms. To browse the hierarchical structure, select the term of interest and click List subterms. The list will then display all child terms of the selected one.
This will also update the bread crumb bar on top of the list, which tracks the position in the hierarchical structure. To quickly switch within the hierarchy, click the items in the bar.
To search for terms, type into the Search GO term field. This will hide the bread crumb bar.
Tip: You can list subterms from search results
To include one or multiple GO terms into the filter selection, select the terms in the list and click Add to filter. To remove items, click Remove from filter within the More … menu. Navigating and searching GO terms does not disturb the list of considered terms.
To show more information about one or more GO terms, select them in the list and click Show details. This will open a dialog that contains information like a definition of the term and a link to the matching AmiGO 2 entry.
You can export your selected GO terms by clicking on Export *.csv. The table will contain the GO terms and additional info like the ID and a definition.
An imported table has following format:
goid | … |
---|---|
GOID1 | … |
GOID2 | … |
GOID3 | … |
… | … |
The importer looks only for the goid
column which should contain valid GO accessions (usually in the format GO:xxxxxxx
).
The other columns are ignored and can contain additional data.
There are two types of PCA parameters:
If this parameter is enabled, the dimensions (each row in the read count table) will be transformed, so the mean is zero. This is a recommended option as PCA assumes zero-mean data.
If this parameter is enabled, the dimensions (each row in the read count table) will be transformed, so the variance is 1.
Allows scaling of the transformed data to a relative space.
NEWVALUE = (VALUE - MIN_DIMENSION) / (MAX_DIMENSION - MIN_DIMENSION)
NEWVALUE = (VALUE - MIN_GLOBAL) / (MAX_GLOBAL - MIN_GLOBAL)
For many plots you can change the colors, shape or name of the data points. This is handled by the “Visuals editor” widget in the sidebar.
The available conditions are listed in the “Available conditions” section. The conditions are colored by the current color of the selected condition. A bar at the left indicates that the condition sets a shape.
atra
, asp
and eco
gets the shape from atra
and the color from asp
.
The eco
condition is ignored, because both color and shape are already set.
Following importers are available:
Condition | color | shape | name |
---|---|---|---|
Condition1 | Leave empty for no color, otherwise a valid R color | -1 for no shape, otherwise a valid R pch | Leave empty for no name |
Condition2 | … | … | … |
Condition3 | … | … | … |
… | … | … | … |
Plots like heatmaps require a method to convert a numerical value to a color. The included gradient editor allows fully customization of this mapping.
The interface contains a list of gradient stops that act as reference points for the linear interpolation. By selecting a stop, you can change its color and the associated value that should be mapped to the color.
The values that are set within the gradient stops must be ordered ascending. If this is not the case, the plot will not render.
You can add additional stops by clicking the Add stop button. If you have more than two stops, you can remove the current selected stop by clicking the Remove this stop button.
Following importers are available:
value | color |
---|---|
value1 | Valid R color |
value2 | … |
value3 | … |
… | … |
The processing view widget shows the processing steps that have been applied to generate the data of the current view.
Click on Show details to show more info about parameters and intermediate results of a processing step.
You can export all content as HTML report by clicking the Export report button. The report will include all information (with details) about each processing step.
Martin H
Bioinformatics/High-Throughput Analysis
Faculty of Mathematics and Computer Science
Friedrich Schiller University Jena
Leutragraben 1
07743 Jena
Room: 08N05
Phone: +49-3641-9-46480
Personal data (usually referred to just as “data” below) will only be processed by us to the extent necessary and for the purpose of providing a functional and user-friendly website, including its contents, and the services offered there.
Per Art. 4 No. 1 of Regulation (EU) 2016/679, i.e. the General Data Protection Regulation (hereinafter referred to as the “GDPR”), “processing” refers to any operation or set of operations such as collection, recording, organization, structuring, storage, adaptation, alteration, retrieval, consultation, use, disclosure by transmission, dissemination, or otherwise making available, alignment, or combination, restriction, erasure, or destruction performed on personal data, whether by automated means or not.
The following privacy policy is intended to inform you in particular about the type, scope, purpose, duration, and legal basis for the processing of such data either under our own control or in conjunction with others. We also inform you below about the third-party components we use to optimize our website and improve the user experience which may result in said third parties also processing data they collect and control.
Our privacy policy is structured as follows:
The party responsible for this website (the “controller”) for purposes of data protection law is:
Martin H
Bioinformatics/High-Throughput Analysis
Faculty of Mathematics and Computer Science
Friedrich Schiller University Jena
Leutragraben 1
07743 Jena
Room: 08N05
Phone: +49-3641-9-46480
The controller's data protection officer is:
Martin H
With regard to the data processing to be described in more detail below, users and data subjects have the right
In addition, the controller is obliged to inform all recipients to whom it discloses data of any such corrections, deletions, or restrictions placed on processing the same per Art. 16, 17 Para. 1, 18 GDPR. However, this obligation does not apply if such notification is impossible or involves a disproportionate effort. Nevertheless, users have a right to information about these recipients.
Likewise, under Art. 21 GDPR, users and data subjects have the right to object to the controller's future processing of their data pursuant to Art. 6 Para. 1 lit. f) GDPR. In particular, an objection to data processing for the purpose of direct advertising is permissible.
Your data processed when using our website will be deleted or blocked as soon as the purpose for its storage ceases to apply, provided the deletion of the same is not in breach of any statutory storage obligations or unless otherwise stipulated below.
For technical reasons, the following data sent by your internet browser to us or to our server provider will be collected, especially to ensure a secure and stable website: These server log files record the type and version of your browser, operating system, the website from which you came (referrer URL), the webpages on our site visited, the date and time of your visit, as well as the IP address from which you visited our site.
The data thus collected will be temporarily stored, but not in association with any other of your data.
The basis for this storage is Art. 6 Para. 1 lit. f) GDPR. Our legitimate interest lies in the improvement, stability, functionality, and security of our website.
The data will be deleted within no more than seven days, unless continued storage is required for evidentiary purposes. In which case, all or part of the data will be excluded from deletion until the investigation of the relevant incident is finally resolved.
We use YouTube on our website. This is a video portal operated by YouTube LLC, 901 Cherry Ave, 94066 San Bruno, CA, USA, hereinafter referred to as “YouTube”.
YouTube is a subsidiary of Google LLC, 1600 Amphitheatre Parkway, Mountain View, CA 94043 USA, hereinafter referred to as “Google”.
Through certification according to the EU-US Privacy Shield
https://www.privacyshield.gov/participant?id=a2zt000000001L5AAI&status=Active
Google and its subsidiary YouTube guarantee that they will follow the EU's data protection regulations when processing data in the United States.
We use YouTube in its advanced privacy mode to show you videos. The legal basis is Art. 6 Para. 1 lit. f) GDPR. Our legitimate interest lies in improving the quality of our website. According to YouTube, the advanced privacy mode means that the data specified below will only be transmitted to the YouTube server if you actually start a video.
Without this mode, a connection to the YouTube server in the USA will be established as soon as you access any of our webpages on which a YouTube video is embedded.
This connection is required in order to be able to display the respective video on our website within your browser. YouTube will record and process at a minimum your IP address, the date and time the video was displayed, as well as the website you visited. In addition, a connection to the DoubleClick advertising network of Google is established.
If you are logged in to YouTube when you access our site, YouTube will assign the connection information to your YouTube account. To prevent this, you must either log out of YouTube before visiting our site or make the appropriate settings in your YouTube account.
For the purpose of functionality and analysis of usage behavior, YouTube permanently stores cookies on your device via your browser. If you do not agree to this processing, you have the option of preventing the installation of cookies by making the appropriate settings in your browser. Further details can be found in the section about cookies above.
Further information about the collection and use of data as well as your rights and protection options in Google's privacy policy found at
https://policies.google.com/privacy
Model Data Protection Statement for Anwaltskanzlei Wei
PCAGO is licensed under GNU General Public License Version 3 and uses various software packages provided by different authors. The source code, as well as a list of all used software packages is available on GitHub.