Prof. Itay Mayrose Lab - Plant Evolution, bioinformatics, & comparative genomics

Clumpak - Cluster Markov Packager Across K


This is a beta version.

We are working on improving CLUMPAK for enhanced flexibility and a better user interface. If you run into problems please email evolseq@tauex.tau.ac.il and specify the job ID.
If you haven't received any results within a few hours, please contact us and re-submit. See Known Issues. Thank you!


CLUMPAK aids users in automating the process of analyzing the results of genotype clustering programs such as STRUCTURE. CLUMPAK separates groups of runs representing distinct solutions, and identifies an optimal cluster label alignment across different values of K, simplifying the comparison of clustering results across K. In addition, CLUMPAK implements a method for the identification of a preferred choice of K, and a comparison test for solutions obtained by different programs, models, or subsets of data.


1. Introduction


Overview

Clustering individuals into populations, based on multi-locus genotypes, has become a critical step in population genetics studies. Many different programs have been developed in order to face the challenge of dividing individuals into a predefined number of populations, K. The most widely used of these programs is STRUCTURE (Pritchard et al. 2000; Falush et al. 2003; Falush et al. 2007; Hubisz et al. 2009). We refer to these methods as STRUCTURE-like (Weiss and Long 2009). The result of a single cluster analysis is typically given as a matrix, [Q]ik, which contains for each individual i the membership coefficients of that individual in each of the k clusters. These coefficients can be interpreted as membership probabilities or as the fraction of the genome with membership in the cluster.

Many of the STRUCTURE-like programs are stochastic, and have the propensity of producing different outcomes for replicate runs, even when the same choice of model and parameters is used. For this reason, users often conduct multiple runs for the same model and parameters. Distinct solutions can be the result of multimodality in the solution space, or the result of label switching between clusters. In addition, since the user needs to define the number of clusters, many times a range of K values is used, each with multiple independent runs. Thus, the user is faced with the challenge of summing up and comparing hundreds, and sometimes thousands, of runs, within and across K values.

CLUMPAK - Clustering Markov Packager Across K - was developed in order to aid users in analyzing the results of STRUCTUE-like programs. The software offers a few alternative modes of action, with the main one offering a full pipeline for the summation and graphical illustration of the results obtained by STRUCTURE or other STRUCTURE-like programs. The input to CLUMPAK is STRUCTURE outputs or Q-matrices obtained by STRUCTURE or by other STRUCTURE-like programs, properly formatted to match one of the three input formats supported by CLUMPAK. Additional features allow comparison of programs or models and selecting the preferred value of K according to the method of Evanno et al. (Evanno et al. 2005)


Aims

CLUMPAK was designed to aid users in four main objectives:
(1) Separate distinct solutions obtained from STRUCTURE-like programs.
(2) Compare and align solutions obtained for different K values.
(3) Compare results obtained using different models/data subsets/programs.
(4) Indicate the preferred value of K according to Evanno et al.


2. Getting started


Standalone version

If you would like to install CLUMPAK on a local Linux machine, Download the current version. Detailed instructions are provided in CLUMPAK Documentation. The following text refers mainly to the online version, but the online and standalone versions are almost identical and have the same functionality.


3. Input files and optional parameters


The input to all of CLUMPAK’s features includes the result files as obtained through the STRUCTURE (or a STRUCTURE-like) program. CLUMPAK supports three formats for result files:

(1) Full STRUCTURE results (example: structure_output.txt).
(2) A shortened/truncated STRUCTURE format (hereafter referred to as STRUCTURE-Q, example: truncated_structure_output.txt).
(3) Simple Q-matrix (example: simple_Q_output.txt).

The input to all of CLUMPAK’s features includes the result files as obtained through the STRUCTURE (or a STRUCTURE-like) program. CLUMPAK supports three formats for result files:

Here is an explanation on the two Q-matrices formats (2 & 3 above): if NUM_INDS is the number of individuals in your data, and K is the number of predefined populations (i.e. clusters), then the simple Q-matrix (format 3) should contain NUM_INDS rows with K columns per row. Each row shows the membership coefficients for one individual. See Simple_Q.txt for an example. A STRUCTURE-Q (format 2) is different from the simple Q-matrix in that that it contains five additional columns before the membership coefficients, with the forth one being the population ID for each individual. CLUMPAK uses this population ID for graphical purposes. The other additional columns are ignored.

If simple Q-matrices are provided, an additional file is required that contains the populations IDs for the individuals in the data – a populations_file (see toy_data_populations_file.txt). This file should have the same number of lines as the Q-matrix file (i.e. NUM_INDS rows), where each line contains an integer that codes for the population ID. The order of individuals in result files (i.e. Q-matrices) and the populations_file should match.

How to zip your result files: If you use Linux, you can use the command ‘zip’ to zip files: zip my_data.zip file1 file2 file3 ..etc.. If you are using Windows, you can use WinRAR with the ‘zip’ option to zip files.

‘Advanced options’ files

For the main pipeline, ‘DISTRUCT for many K’s’, and ‘Compare’ features, there are a number of other optional input files:
(1) labels_file - contains text labels for populations (see toy_data_labels.txt). This file is optional, and affects only the graphical representation of the results. If provided, the order of populations in the produced figure will reflect their order in the labels_file, and the labels will be used below the figure. In case it is not provided, population codes will be extracted the results files.
(2) colors_file - contains colors to be used in the produced figures (see colors_file.txt). Colors recognized by CLUMPAK are those that are recognized by DISTRUCT, please consult the DISTRUCT manual for a full list of colors. Colors should be numbered and ordered, such that the number of colors equal to or larger than the largest K value in the result files.
(3) Drawparams file - confronts the format of DISTRUCT’s drawparams file (see drawparams.txt). Please consult the DISTRUCT manual for additional details.

‘Advanced options’ parameters

CLUMPAK calls CLUMPP (Jakobsson and Rosenberg 2007) for determining the similarity of runs within a single K value, both in the main pipeline and in the ‘Compare’ feature. The LargeKGreedy algorithm of CLUMPP is called, and by default 2000 repeats are being used (see the CLUMPP manual for further details). You can alter this default value through the website, but if your chosen value is likely to lead to too large running times on our server, we will set the value back to the default.

CLUMPAK uses the MCL program (Enright et al. 2002; Van Dongen 2008) in order to separate distinct solutions of runs within a single K value, both in the main pipeline and in the ‘Compare’ feature. CLUMPAK explores a range of thresholds for the inclusion of edges in the graph, and the threshold is determined independently for each K value. Alternatively, a fixed threshold can be determined by the user, by choosing the ‘User defined’ option on the website. The value of the threshold should be equal to or larger than 0, and smaller than 1, but small values are not recommended, as they may prevent the separation of distinct modes.


4. Usage options


CLUMPAK offers four modes of action – the main pipeline, DISTRUCT for many K’s, Compare, and Best K by Evanno.


Main Pipline

The main pipeline of CLUMPAK aims helping the users through the entire process of summing and presenting the results of STRUCTURE-like programs. The input for the main pipeline is a set of STRUCTURE runs, or Q-matrices obtained from STRUCUTRE-like programs, produced for the same data set for a range of K values. For example, the input might be made of 10 runs for each K value, with K ranging between 2 to 10.

CLUMPAK expects the result files, regardless of their format, to be zipped in one of the following manner:
A. Zip together all your results to one zip file, without separating different K values. Note that you should zip the files – and not a directory containing the files.
B. First, for each K value, zip together all the runs; second, the zip files for different K values should be zipped together.
Whether you chose to follow A. or B., you need to upload one zip file which contains all the results you previously obtained from STRUCTURE or an alternative STRUCTURE-like program. If your input is in the simple Q-matrix format (i.e. ADMIXTURE format), you are required to upload an additional file - a populations_file, which will identify the population code for each individual (see toy_data_populations_file.txt).

Optional input files are the label_file, colors_file and drawparams_file (see toy_data_labels.txt, colors_file.txt, and drawparams.txt).


DISTRUCT for many K’s

The ‘Distruct for many K’s’ feature aims at helping users align single results obtained for different K values. These single results might be individual runs or averages obtained for multiple independent runs. The required input is a zip file which contains a single result for each K value. For example, if the K value range is between 2 to 10, than 8 result files should be zipped together. Result files can be provided in any one of the three supported formats. If your input is in the simple Q-matrix format, you are required to upload an additional file - a populations_file, which will identify the population code for each individual (see toy_data_populations_file.txt).

CLUMPAK expects the result files, regardless of their format, to be zipped in the following manner: Zip together all your results to one zip file, without separating different K values. Note that you should zip the files – and not a directory containing the files.

Optional input files are the label_file, colors_file and drawparams_file (see toy_data_labels.txt, colors_file.txt, and drawparams.txt).


Compare

The ‘Compare’ feature aims at helping users determine whether the results obtained for a single K value (using the same set of individuals) with different programs, modeling assumptions, or different subsets of markers, are significantly different. The required input is a zip file which contains two zip files, each holding the result files obtained from one model/program/subset of data. The two sets of results are required to have the same number and order of individuals as well as the same number of clusters assumed (K). Result files can be provided in any one of the three supported formats. If your input is in the simple Q-matrix format, you are required to upload an additional file - a populations_file, which will identify the population code for each individual (see toy_data_populations_file.txt).

CLUMPAK expects the result files, regardless of their format, to be zipped in the following manner: Zip together all your results from the first model to one zip file, without separating different K values. Note that you should zip the files – and not a directory containing the files. Repeat for the second model. Now zip together the two zip files. Note that you should zip the zip files – and not a directory containing the zip files.

Optional input files are the label_file, colors_file and drawparams_file (see toy_data_labels.txt, colors_file.txt, and drawparams.txt).


Best K by Evanno

The required input is the result files of STRUCTURE runs, which were produced for the same data set for a range of K values. For example, the input might be composed of 10 runs for each K value, with K ranging between 2 to 10. This feature is currently supported only for STRUCTURE outputs, and the Q-matrices formats are not supported, since likelihood (or log-likelihood) values are required. The results should be zipped as explained under the Main pipeline section above.


5. How to cite this program


"CLUMPAK: a program for identifying clustering modes and packaging population structure inferences across K".

Kopelman, Naama M; Mayzel, Jonathan; Jakobsson, Mattias; Rosenberg, Noah A; Mayrose, Itay.

Molecular Ecology Resources 15(5): 1179-1191, doi: 10.1111/1755-0998.12387


6. References


Alexander, D. H., J. Novembre and K. Lange, 2009 Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19: 1655-1664.

Alexander, D. H., and K. Lange, 2011 Enhancements to the ADMIXTURE algorithm for individual ancestry estimation. BMC Bioinformatics 12: 246.

Enright, A. J., S. Van Dongen and C. A. Ouzounis, 2002 An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30: 1575-1584.

Evanno, G., S. Regnaut and J. Gould, 2005 Detecting the number of clusters of individuals using the software structure: a simulation study. Mol. Ecol. 14: 2611-2620.

Falush, D., M. Stephens and J. K. Pritchard, 2003 Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164: 1567-1587.

Falush, D., M. Stephens and J. K. Pritchard, 2007 Inference of population structure using multilocus genotype data: dominant markers and null alleles. Mol. Ecol. Notes 7: 574-578.

Hubisz, M. J., D. Falush, M. Stephens and J. K. Pritchard, 2009 Inferring weak population structure with the assistance of sample group information. Mol. Ecol. Resour. 9: 1322-1332.

Jakobsson, M., and N. A. Rosenberg, 2007 CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23: 1801-1806.

Pritchard, J. K., M. Stephens and P. Donnelly, 2000 Inference of population structure using multilocus genotype data. Genetics 155: 945-959.

Rosenberg, N. A., 2004 DISTURCT: a program for the graphical display of population structure. Mol. Ecol. Notes 4: 137-138.

Van Dongen, S., 2008 Graph clustering via a discrete uncoupling process. SIAM J. Matrix Anal. Appl. 30: 121-141.

Weiss, K. M., and J. C. Long, 2009 Non-Darwinian estimation: my ancestors, my genes' ancestors. Genome Res. 19: 703-710.