Using the table furnished by Kontxi Martinez, Samples IDs are replaced by the experiment name in the read courn table. The experiment name follow the current pattern:
The corrected read count table is imported. Only value of the BRAUN pair are kept. The homemade function LoadRawReadCounts() is used for this.
Read count are normalized using the ReadCountNormalisation() homemade function
The first step consist in removing features, here genes, with no or low reads in all samples. The sequencing detection limit is estimated at 10. This means that genes with less than 10 reads in all samples are considered as not expressed (or expressed under the detection limit) and are thus removed from the analyis. Here, 0 genes have been rejected. The filtering is based on the the filterByExpr from the edgeR packgage. The figure shows the density of gene read counts. Note the vertical line indicating the threshold used to sort out none expressed genes.
Barplot shows the whole number of read counts for each sample. Reads which are not associated to an annotated gene, are not considered in the calculation.
A filtering is also applied to identify and remove genes with less than 10 reads (see the Low read filtering tab). Reads associated to these filtered out genes are rejected.
This step consist in removing the compositions bias using the TMM (Trimed mean of M-value) Robinson and Oshlack 2010. In short, samples with different library size cannot be compared with each other. The process consist in normalise the read count such like each sample have artificially the same total number of reads.
The calcNormFactors from the edgeR package has been used to perform this process.
boxplot represent the distribution of genes log2(cpm). cpm are count per milion of read.
Voom transform of counts : removing heteroscedascity from count data
The basic principle is to remove a bias related to the relationship between gene expression variance and gene expression mean. The more condition/replicate we have the better it is. voom aim at prepare the data for a further linear fit see chapter 6.2 of RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR. This operation is performed using the voom function from the limma package.
For a given data set with multiple samples, it is expected the average read count associated to a genes should not depend on its variance. More simply, the variability among samples should be similar for every genes and not depends on their average expression level. For RNAseq data, this relationship between variability and expression is unfortunaltely true (figure right). Lower will be global number of reads associated to a gene, higher will be the variability among samples. This may have an impact on p-value when comparing the expression of gene between two samples.
Voom transform attempt to correct for this bias. From the comparison of the standard deviation against the average read count of each gene, a model is built (right panel, red line) and used to correct read counts value (left panel).
This step might not be essential since all samples are from the same batch. This can be interesting if sequencing files come from different runs or lanes. We will use the SVA package to infer the batch effects. Very quickly sva estimate the number of covariate (surrogate variable) and then adjust value to remove the noise associated to such covariate. Then adjust for the batch effects using Surrogate variables. Here the strategy consist in let the algorithm estimate the potential batch variables rather than feeding it with known source of batch effect.
Different visualisation can be used to observe the correction of the batch effect. The first is the PCA. PCA plot is a scatter where each dot is a sample. Ideally, all replicates from a given condition should appears close from each other, indicating a good reproducibility. Our strategy based on the estimation of variables inducing noise may reduced the distance between replicates on the PCA plot.
The second visualisation is the density plot. The is basically the density of gene count relative to the read count range (or log2(cpm)). The plot confirm the quality of the whole normalisation since all samples show similar density.
The last representation is the correlation matrix. A correlation coefficient is calculated between each samples combination. The resulting matrix of correlation coeficients is then visualised using a heatmap. A layer of Hierarchical clustering analysis has been performed on this matrix. This enables to sort rows and columns depending on the similarity of their correlation coefficients. Ideally the Heatmap should show samples from similar conditions close from each other. Note the improvement of the clustering after the batch effect correction. Note also that samples from similar experimental conditions cluster together, especially depending on the harvesting time.
V1 | CEC3617_0 V CEC3605_0 | CEC3617_3 V CEC3605_3 | CEC3617_6 V CEC3605_6 |
---|---|---|---|
Down | 0 | 0 | 35 |
NotSig | 14147 | 14147 | 14064 |
Up | 0 | 0 | 48 |
Differential gene expression is performed by calculating the expression of in the CEC3617 condition relative to the CEC3605 condition. The result is a log2(fold change) matrix indicating the expression of each genes relative to the reference condition. The various step described below use functions from the limma package.
The first step consist in build a linear model to average read counts from each biological replicates. This operation is performed using the lmFit() function. The makeContrasts() function is then used to build a comparison matrix. Values produce by lmFit() and makeContrasts() are combined in another object using the contrasts.fit(). This procedure lead to the propagation of the statistics from the linear model. Finally the ebayes() function is used to extract from combined the object the relevant statistics of differential expression by empirical Bayes moderation of the standard errors towards a common value.
To identify genes differentially expressed, the decideTests() and the summary() functions have been used. The first, classify the series of related t-statistics as up, down or not significant, while the second summaries the result of test. Differentially expressed genes has been identified using0.05 and 0.5 as cutoff for pvalue and log2(FC) respectively. The tables presented in this slide show the summary of up and down regulated genes using these parameters.
one two
adsbvpashdvbpasuvbda
blalb
[1] "No GO emerge from that list of gene"
blabla
[1] "No GO emerge from that list of gene"
blabla
blabla
blabla