MixupMapper

Harm-Jan Westra1, Ritsert Jansen2, Rudolf Fehrmann1, Gerard te Meerman1, David van Heel3,§, Cisca Wijmenga1,§, Lude Franke1,3,§

§ contributed equally
1 Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands
2 Groningen Bioinformatics Center, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands.
3 Blizard Institute of Cell and Molecular Science, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK.

Contents

Abstract

Sample mix-ups can arise during sample collection, handling, genotyping or data management. It is unclear how often sample mix-ups occur in genome-wide studies, as there currently are no post-hoc methods that can identify these mix-ups in unrelated samples. We have therefore developed an algorithm (MixupMapper) that can both detect and correct sample mix-ups in genome-wide studies that study gene expression levels.
We applied MixupMapper to five publicly available human genetical genomics datasets. On average, 3% of all analyzed samples had been assigned incorrect expression phenotypes: in one of the datasets 23% of the samples had incorrect expression phenotypes. The consequences of sample mix-ups are substantial: when we corrected these sample mix-ups, we identified on average 15% more significant cis-expression quantitative trait loci (cis-eQTLs). In one dataset, we identified three times as many significant cis-eQTLs after correction. Furthermore, we show through simulations that sample mix-ups can lead to an underestimation of the explained heritability of complex traits in genome-wide association datasets.

For more details on the method, we like to refer to the methods section of our paper.

Download

We supply our program in two formats: one is in the form of a Netbeans project, which includes the JAVA source code and libraries required to review the code, compile and run the program. The other format is a binary JAR format together with all required libraries.

Usage

MixupMapper was written in JAVA, and will thus work on Mac OS X, Windows and Linux operating systems. However, the program requires JRE1.6 to be installed on your machine. If you are running Windows or Linux, you can obtain a version of JAVA from Oracle. On Mac OS X, JAVA comes preinstalled with the operating system.

java -Xmx2g -jar MixupMapper.jar settingsfile.xml

The settingsfile.xml file is an XML file containing all required configuration settings to run the program. The file is similar to the settingsfile.xml used by AssociationGG. Below is a table describing available settings for the MixupMapper:

Section Option Description
settings.defaults.mixups eqtlfile String: Location of the file containing significant eQTLs determined in an initial eQTL mapping. More information on the format of this file can be found on the AssociationGG page.
outputdirectory String: Directory where the results of the analysis will be stored.
familyinformation String: Optional tab-separated text-file describing family relationships between samples.
numeqtls Integer: When performing sample mix-up analysis, limit to the following number of eQTLs.
mode String: This option has three possible settings: loadeqtls, generateeqtls and determinefdrthreshold. The loadeqtls mode loads eQTLs defined with the eqtlfile option, to perform sample mix-up identification. The generateeqtls mode is used to perform the initial cis-eQTL mapping. The determinefdrthresholdmethod is used to determine the significance threshold of the mix-ups anslysis. The default is generateeqtls.
threads Integer: defines the number of threads the program is allowed to use. Defaults to the number of processors in the machine.
settings.defaults.multipletesting threshold Double: defines the threshold used for false discovery rate (FDR) multiple testing. Defaults to 0.05.
confidence Double: defines the confidence for determining the FDR at the threshold. Defaults to 0.95
permutations Integer: defines the number of permutations the program should run. Default is 0.
settings.datasets.dataset name String: name of the dataset
location String: folder location of the genotypematrix.dat file.
genometoexpressioncoupling String: tab separated file describing the link between genotype and gene expression file. Format: genotype\tgene expression\n
expressiondata String: file location of file containing gene expression data. Data should be in TriTyper format. If not defined, the program looks for settings.datasets.dataset.location/Expressiondata.txt
quantilenormalize Boolean: defines if the dataset should be quantile normalized. Defaults to false.
logtranform Boolean: defines if the dataset should be log transformed. Defaults to false.

Output

The output of the program consists of several files. Not all files are generated, depending on the mode selected.

File Mode Description
Distribution-Counts.pdf loadeqtls Distribution of overall Z-score values. The distribution uses the raw counts of Z-scores per bin. The Z-scores are plotted on the Y-axis and the counts on the Y-axis.
Distribution-Frequency.pdf loadeqtls Frequency distribution of overall Z-scores. The Z-scores are plotted on the Y-axis and the frequencies on the Y-axis.
[datasetname]-ExpressionCorrelationMatrix.txt loadeqtls Correlation matrix on the gene expression data that is used for calculation of the principal components (PCA). The PCA scores are visualized in the Heatmap.pdf. [datasetname] will be replaced by the name of the dataset as supplied in the settings file.
[datasetname]-GenotypeCorrelationMatrix.txt loadeqtls Correlation matrix on the genotype data that is used for calculation of the principal components (PCA). The correlation to the PCA scores for the genotype data are visualized in the Heatmap.pdf. [datasetname] will be replaced by the name of the dataset as supplied in the settings file.
Heatmap.pdf loadeqtls Visualisation of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the gene expression samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colors. The grey bars next to the sample names indicate the correlation of the sample with the first principal component, which is an indicator for sample quality. Samples are sorted alphabetically on both axis.
PermutationDistribution.txt determinefdrthreshold This file contains the distribution of overall Z-scores, which is used to determine the confidence interval for FDR multiple testing.
PermutedEQTLsPermutationRound[num].txt.gz generateeqtls, determinefdrthreshold This file contains eQTLs as determined based upon permuted data. These files are used to calculate the FDR distribution.
ROC.pdf loadeqtls Reciever operator curve (ROC), describing the distance between the ovarll Z-scores of the diagonal in the heatmap (where genotype sample name equals the gene expression sample name), and all other assessed pairs. Several lines are plotted, each corresponding to an increasing a priori chance of mixed-up samples.
ROC.txt loadeqtls A tab-separated text representation of the ROC curve.
SampleMixups.txt loadeqtls A tab-separated text file describing the best matched (gene expression) sample for each genotyped sample.
ScoresMatrix.txt loadeqtls A tab-separated text matrix of overall Z-scores as calculated by the method for each of the possible pairs of genotyped and gene expression samples.
eQTLProbesFDR[threshold].txt generateeqtls, determinefdrthreshold This file contains the strongest significant cis-eQTL effect for each probe. [threshold] is replaced by the FDR threshold value supplied in the settings file.
eQTLs.txt.gz generateeqtls, determinefdrthreshold This file contains the top 150000 cis-eQTL results, sorted by p-value.
eQTLsFDR[threshold].txt generateeqtls, determinefdrthreshold This file contains all significant cis-eQTL effects. [threshold] is replaced by the FDR threshold value supplied in the settings file.
eQTLsSelected.txt determinefdrthreshold File containing the eQTLs that were selected for sample mix-up analysis.
SuggestedCouplings.txt loadeqtls This file is identical to SampleMixups.txt, except for the lack of the header and the samples which have been removed after sample mix-up correction (duplicate arrays, bad quality arrays etc)
log.txt loadeqtls, generateeqtls, determinefdrthreshold General log file.

Additional data to the manuscript

Below is a list of ZIP archives, containing the output from the software for each of the datasets analysed in our manuscript.

Dataset GEO Id/URL TriTyper ZIP-archive
Wolfs et al GSE22070 Unpublished Liver
Choy et al GSE11582 Choy CHB+JPT
Choy CEU
Choy YRI
Stranger et al GSE6536 Stranger CHB+JPT
Stranger CEU
Stranger YRI
Zhang et al GSE9703 Zhang CEU
Zhang YRI
Heinzen et al SNPExpress Heinzen Brain
Heinzen PBMC
Webster et al GSE15222 Webster Brain

Contact

For any questions regarding the software or the method, please contact Harm-Jan Westra or Lude Franke