Harm-Jan Westra1, Ritsert Jansen2, Rudolf Fehrmann1, Gerard te Meerman1, David van Heel3,§, Cisca Wijmenga1,§, Lude Franke1,3,§
§ contributed equally
1 Department of Genetics, University Medical Center Groningen, Groningen, The Netherlands
2 Groningen Bioinformatics Center, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands.
3 Blizard Institute of Cell and Molecular Science, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
Contents
Abstract
Sample mix-ups can arise during sample collection, handling, genotyping or data management. It is unclear how often sample mix-ups occur in genome-wide studies, as there currently are no post-hoc methods that can identify these mix-ups in unrelated samples. We have therefore developed an algorithm (MixupMapper) that can both detect and correct sample mix-ups in genome-wide studies that study gene expression levels.
We applied MixupMapper to five publicly available human genetical genomics datasets. On average, 3% of all analyzed samples had been assigned incorrect expression phenotypes: in one of the datasets 23% of the samples had incorrect expression phenotypes. The consequences of sample mix-ups are substantial: when we corrected these sample mix-ups, we identified on average 15% more significant cis-expression quantitative trait loci (cis-eQTLs). In one dataset, we identified three times as many significant cis-eQTLs after correction. Furthermore, we show through simulations that sample mix-ups can lead to an underestimation of the explained heritability of complex traits in genome-wide association datasets.
For more details on the method, we like to refer to the methods section of our paper.
Download
We supply our program in two formats: one is in the form of a Netbeans project, which includes the JAVA source code and libraries required to review the code, compile and run the program. The other format is a binary JAR format together with all required libraries.
Usage
MixupMapper was written in JAVA, and will thus work on Mac OS X, Windows and Linux operating systems. However, the program requires JRE1.6 to be installed on your machine. If you are running Windows or Linux, you can obtain a version of JAVA from Oracle. On Mac OS X, JAVA comes preinstalled with the operating system.
java -Xmx2g -jar MixupMapper.jar settingsfile.xml
The settingsfile.xml file is an XML file containing all required configuration settings to run the program. The file is similar to the settingsfile.xml used by AssociationGG. Below is a table describing available settings for the MixupMapper:
| Section | Option | Description |
| settings.defaults.mixups | eqtlfile | String: Location of the file containing significant eQTLs determined in an initial eQTL mapping. More information on the format of this file can be found on the AssociationGG page. |
| outputdirectory | String: Directory where the results of the analysis will be stored. | |
| familyinformation | String: Optional tab-separated text-file describing family relationships between samples. | |
| numeqtls | Integer: When performing sample mix-up analysis, limit to the following number of eQTLs. | |
| mode | String: This option has three possible settings: loadeqtls, generateeqtls and determinefdrthreshold. The loadeqtls mode loads eQTLs defined with the eqtlfile option, to perform sample mix-up identification. The generateeqtls mode is used to perform the initial cis-eQTL mapping. The determinefdrthresholdmethod is used to determine the significance threshold of the mix-ups anslysis. The default is generateeqtls. | |
| threads | Integer: defines the number of threads the program is allowed to use. Defaults to the number of processors in the machine. | |
| settings.defaults.multipletesting | threshold | Double: defines the threshold used for false discovery rate (FDR) multiple testing. Defaults to 0.05. |
| confidence | Double: defines the confidence for determining the FDR at the threshold. Defaults to 0.95 | |
| permutations | Integer: defines the number of permutations the program should run. Default is 0. | |
| settings.datasets.dataset | name | String: name of the dataset |
| location | String: folder location of the genotypematrix.dat file. | |
| genometoexpressioncoupling | String: tab separated file describing the link between genotype and gene expression file. Format: genotype\tgene expression\n | |
| expressiondata | String: file location of file containing gene expression data. Data should be in TriTyper format. If not defined, the program looks for settings.datasets.dataset.location/Expressiondata.txt | |
| quantilenormalize | Boolean: defines if the dataset should be quantile normalized. Defaults to false. | |
| logtranform | Boolean: defines if the dataset should be log transformed. Defaults to false. |
Output
The output of the program consists of several files. Not all files are generated, depending on the mode selected.
| File | Mode | Description |
| Distribution-Counts.pdf | loadeqtls | Distribution of overall Z-score values. The distribution uses the raw counts of Z-scores per bin. The Z-scores are plotted on the Y-axis and the counts on the Y-axis. |
| Distribution-Frequency.pdf | loadeqtls | Frequency distribution of overall Z-scores. The Z-scores are plotted on the Y-axis and the frequencies on the Y-axis. |
| [datasetname]-ExpressionCorrelationMatrix.txt | loadeqtls | Correlation matrix on the gene expression data that is used for calculation of the principal components (PCA). The PCA scores are visualized in the Heatmap.pdf. [datasetname] will be replaced by the name of the dataset as supplied in the settings file. |
| [datasetname]-GenotypeCorrelationMatrix.txt | loadeqtls | Correlation matrix on the genotype data that is used for calculation of the principal components (PCA). The correlation to the PCA scores for the genotype data are visualized in the Heatmap.pdf. [datasetname] will be replaced by the name of the dataset as supplied in the settings file. |
| Heatmap.pdf | loadeqtls | Visualisation of overall Z-scores per assessed pair of samples. The genotyped samples are plotted on the X-axis, and the gene expression samples are plotted on the Y-axis. The brightness of each box corresponds to the height of the overall Z-score, with lower values having brighter colors. The grey bars next to the sample names indicate the correlation of the sample with the first principal component, which is an indicator for sample quality. Samples are sorted alphabetically on both axis. |
| PermutationDistribution.txt | determinefdrthreshold | This file contains the distribution of overall Z-scores, which is used to determine the confidence interval for FDR multiple testing. |
| PermutedEQTLsPermutationRound[num].txt.gz | generateeqtls, determinefdrthreshold | This file contains eQTLs as determined based upon permuted data. These files are used to calculate the FDR distribution. |
| ROC.pdf | loadeqtls | Reciever operator curve (ROC), describing the distance between the ovarll Z-scores of the diagonal in the heatmap (where genotype sample name equals the gene expression sample name), and all other assessed pairs. Several lines are plotted, each corresponding to an increasing a priori chance of mixed-up samples. |
| ROC.txt | loadeqtls | A tab-separated text representation of the ROC curve. |
| SampleMixups.txt | loadeqtls | A tab-separated text file describing the best matched (gene expression) sample for each genotyped sample. |
| ScoresMatrix.txt | loadeqtls | A tab-separated text matrix of overall Z-scores as calculated by the method for each of the possible pairs of genotyped and gene expression samples. |
| eQTLProbesFDR[threshold].txt | generateeqtls, determinefdrthreshold | This file contains the strongest significant cis-eQTL effect for each probe. [threshold] is replaced by the FDR threshold value supplied in the settings file. |
| eQTLs.txt.gz | generateeqtls, determinefdrthreshold | This file contains the top 150000 cis-eQTL results, sorted by p-value. |
| eQTLsFDR[threshold].txt | generateeqtls, determinefdrthreshold | This file contains all significant cis-eQTL effects. [threshold] is replaced by the FDR threshold value supplied in the settings file. |
| eQTLsSelected.txt | determinefdrthreshold | File containing the eQTLs that were selected for sample mix-up analysis. |
| SuggestedCouplings.txt | loadeqtls | This file is identical to SampleMixups.txt, except for the lack of the header and the samples which have been removed after sample mix-up correction (duplicate arrays, bad quality arrays etc) |
| log.txt | loadeqtls, generateeqtls, determinefdrthreshold | General log file. |
Additional data to the manuscript
Below is a list of ZIP archives, containing the output from the software for each of the datasets analysed in our manuscript.
| Dataset | GEO Id/URL | TriTyper | ZIP-archive |
| Wolfs et al | GSE22070 | Unpublished | Liver |
| Choy et al | GSE11582 | Choy | CHB+JPT |
| Choy | CEU | ||
| Choy | YRI | ||
| Stranger et al | GSE6536 | Stranger | CHB+JPT |
| Stranger | CEU | ||
| Stranger | YRI | ||
| Zhang et al | GSE9703 | Zhang | CEU |
| Zhang | YRI | ||
| Heinzen et al | SNPExpress | Heinzen | Brain |
| Heinzen | PBMC | ||
| Webster et al | GSE15222 | Webster | Brain |
Contact
For any questions regarding the software or the method, please contact Harm-Jan Westra or Lude Franke
