Back to main download page
Two of the five lanes sequenced for sample F (090723_NUTELLA_429NWAAXX.7 & 090723_NUTELLA_429NWAAXX.8), were mapped using 75 nucleotides instead of 50 nucleotides. We believe this to have a negligible impact on our results for the following reasons.
Metrics of sample F including the 75 nt reads (% genome covered, correlation to other samples, % genes covered, etc.) are similar to those of other samples. Indeed, the Spearman’s rank correlation (for genes covered in both sets, since the reduced set has fewer reads) between genic FPKM values from sample F including the 75 nt reads and genic FPKM values from sample F excluding the 75 nt reads (‘reduced’ sample F) is 0.996. This correlation was also 0.996 when only considering genes that were classifiable. When the sample E/F pairwise correlation, among the highest in Online Supplementary Table 2, was recalculated by removing genes not expressed in both sets, it came to 0.982, still lower than 0.996. These datasets are thus similar to one another. As these data were mapped to identical splice sites, we need to consider only the small difference in mappability of 2x50 nt reads versus 2x75 nt reads in the transcriptome. For transcript-level values, the correlation of sample F with the ‘reduced’ sample F was 0.992. For transcripts used in the alternative splicing analysis, the correlation of sample F with the ‘reduced’ sample F was 0.995, suggesting the transcript-level alternative splicing analysis would be affected as little as the gene-level analysis.
Most importantly, all patterning analyses were based on the classifiers. Even if there had been an effect on the genic level, we would expect the classifier metrics to be lowered accordingly.
To obtain a subset of all genes for classification, we filtered on the basis of confidence intervals as reported by cufflinks. These quantification estimates were previously validated to be accurate even below 1 FPKM using spike-ins and in silico simulation. However, in our experiments, the amount of total sequence gathered per library varied by as much as two fold (Supplemental Expermental Procedures). Hence, even for equivalent levels of ‘true’ normalized FPKM expression, the tightness of confidence intervals will vary amongst libraries.
Indeed, genes that never attain an FPKM>1.0 in any sample are 49% more likely to be called as patterned than the Allen Mouse Brain Atlas (AMBA) curated genes at similar levels of expression (p=0.002; hypergeometric test). It was the only bin of expression in which a significant increase of expression was observed (Figure S3). Instead, for most bins, classifiable genes were significantly less likely than the Allen Mouse Brain Atlas curated genes at equivalent expression levels to be predicted as patterned. This, however, is expected: the Allen Mouse Brain Atlas genes were prioritized for manual curation using a computational algorithm intended to identify genes which were likely to be patterned, based on in situ hybridizations (see here).
So while genes at most expression levels are significantly less likely to be patterned than the AMBA curated genes, only the genes expressed below 1 FPKM in every sample are significantly more likely to be patterned than AMBA curations in that expression range.
Genes are 35% more likely (p<10-11; hypergeometric test) to be called as patterned in this bin than in all the other bins combined. So if we make the simplifying assumption that all bins should have the same ratio of patterned to unpatterned genes, we might estimate that 1-1/1.35=25% of these predicted patterned genes <1.0 FPKM are false positives. Since this bin accounts for 740 predicted patterned genes (11% of the total predicted patterned genes), that indicates 185 false positives, or 2.75% of the total patterned genes. So fewer than ~3% false positives are not already accounted for in our current classifier measures of false positives. Hence, this does not substantially affect the larger set of all patterned genes.
There is a good explanation for why the cufflinks confidence interval-based filtering may not have removed these. The sequencing libraries had not inconsiderable differences in numbers of reads, which means the confidence intervals will be consistently smaller in some. For genes at very low levels of expression (which may not be overlapped in all libraries), this could exclude 'no expression' from the 95% confidence interval of a gene in one library and the same gene could not have a single overlapping read in another library (there are many examples of this), even if they were actually expressed at the same level. Accepting the well-validated cufflinks confidence intervals, this seems the most parsimonious theoretical explanation.
Hence, we suggest the 89% of total predicted patterned genes >1 FPKM may be most fruitful for experimental followup.