DEXUS: Identifying Differential Expression in RNA-Seq Studies with Unknown Conditions

Detection of differential expression in RNA-Seq data is currently limited to studies in which two or more sample conditions are known a priori. However, these biological conditions are typically unknown in cohort, cross-sectional, and non-randomized controlled studies such as the HapMap, the ENCODE, or the 1000 Genomes project. We present DEXUS for detecting differential expression in RNA-Seq data for which the sample conditions are unknown. DEXUS models read counts as a finite mixture of negative binomial distributions in which each mixture component corresponds to a condition. A transcript is considered differentially expressed if modeling of its read counts requires more than one condition. DEXUS decomposes read count variation into variation due to noise and variation due to differential expression. Evidence of differential expression is measured by the informative/non-informative (I/NI) value, which allows differentially expressed transcripts to be extracted at a desired specificity (significance level) or sensitivity (power). DEXUS performed excellently in identifying differentially expressed transcripts in data with unknown conditions. On 2,400 simulated data sets, I/NI value thresholds of 0.025, 0.05, and 0.1 yielded average specificities of 92%, 97%, and 99% at sensitivities of 76%, 61%, and 38% respectively. On real-world data sets, DEXUS was able to detect differentially expressed transcripts related to sex, species, tissue, structural variants, or eQTLs.

Please cite:

Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. "DEXUS: Identifying Differential Expression in RNA-Seq Studies with Unknown Conditions." Nucleid Acids Research 41(21), e198-e198, 2013 doi:10.1093/nar/gkt834.

Paper:

Dexus.pdf

Supplementary Notes:

DexusSupplement.pdf

Citation:

dexus.bib

Official Link & DOI (online soon):

Official Link: http://nar.oxfordjournals.org/content/early/2013/09/17/nar.gkt834.full
DOI: 10.1093/nar/gkt834

Download the R-package

Available at Bioconductor: dexus R package

Datasets and R scripts:
The benchmarking data sets used in our publication can be downloaded below.

Simulated data with two known conditions:
Data: supervisedDataSets.zip (492 MB)
Simulated data with multiple known conditions:
Data: multiclassDataSets.zip (649 MB)
Simulated data with unknown conditions:
Data: unsupervisedDataSets.zip (1.5 GB)
"Nigerian HapMap" data set:
Data: Pickrell.zip (2 MB)
"European HapMap" data set:
Data: Montgomery.zip (2 MB)
"Primate Liver" data set:
Data: Gilad.zip (2 MB)
"Maize leaves" data set:
Data: Li.zip (1 MB)
"Mice strains" data set:
Data: Bottomly.zip (2 MB)
Additional functions and packages:
Functions for running the methods and helper functions: runMethods.R
ROCR package for evaluation: ROCR_1.0.4.tar.gz
R package we used for calculations in the paper: mixnb_0.2.tar.gz