PrOCoil Data Repository (V1)

Contents

  1. Introduction
  2. All Data at a Glance
  3. PDB Data Set
  4. Clustering
  5. BLAST Augmentation
  6. Model Selection and Training
  7. Script for Computing Kernel Matrices

Introduction

The PrOCoil Web service and the R package procoil are based on the same support vector machine model that was trained according to the model selection procedure described in the following paper:
C. C. Mahrenholz, I. G. Abfalter, U. Bodenhofer, R. Volkmer, and S. Hochreiter. Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics 10(5):M110.004994, 2011. DOI: 10.1074/mcp.M110.004994
The purpose of this page is to make all data available that were used evaluating the computational approach and for training the PrOCoil model. The data available on this page were used to train the original PrOCoil models which were used in versions 1.x.y of the R package and by the PrOCoil Web service before May 7, 2016. The data that were used to train the updated models, are available here.

All Data at a Glance

More details on these files can be found in the sections below.

PDB Data Set

The following file contains dimeric and trimeric coiled coil segments we extracted from the PDB - The RCSB Protein Data Bank (version as of April 2007): This data set was created as follows:
  1. We scanned to whole PDB for coiled coils using SOCKET of which only parallel dimeric and trimeric coiled coils were selected.
  2. We filtered out duplicates and exact sub-sequences.
  3. In case a coiled coil segment has heptad irregularities, we only included the longest sub-sequence whose heptad register is regular.
So we obtained the above file as the maximal, irredundant set of parallel dimers and trimers without heptad irregularities.

The following file provides a back-mapping of the above sequences to coiled coil segments in the PDB as identified by SOCKET:

Clustering

As described in the paper (see reference above), sequence clustering was performed to correct for sequence clusters that are over-represented in the PDB (e.g. GCN4 mutants). This was done by single linkage clustering according to a gap-free heptad-specific alignment to ensure that no pair of sequences from two different clusters match to a degree of 60% or higher (percentage computed as the number of matching positions relative to the length of the shorter sequence). The following data set provides the final grouping of samples in our PDB data set:

BLAST Augmentation

As described in the paper (see reference above), the PrOCoil PDB Data Set was augmented by putative coiled coils that were determined by masked BLAST searches and Marcoil (exact procedure described in the paper; see reference above). The following file provides the set of additional coiled coil sequences (the "BLAST data set"): The following file provides the mapping between PDB Data Set and BLAST Data Set, i.e. by which coiled coil sequences from the BLAST Data Set each coiled coil sequence from the PDB Data Set can be augmented: The total number of coiled coil sequences (PDB + BLAST data set) is 477 + 2357 = 2834. If we join the PDB Data Set and the BLAST Data Set, where the label of each coiled coil sequence in the BLAST Data Set is chosen according to which coiled coil sequence from the PDB Data Set it augments, the following data set is obtained:

Model Selection and Training

For model selection, the Clustered PDB Data Set was randomly split into 10 folds. More specifically, every fold is the union of a random selection of clusters, so no clusters were split over different folds. Thereby, we ensured that no coiled coil sequences belonging to different folds have a sequence similarity of more than 60% (according to the definition above). The following archive contains all folds: These folds in particular serve as test sets for cross validation runs.

As described in the supplement of the paper (see reference above), we first used nested cross validation to validate our model selection procedure. In the outer cross validation loop, we withheld one fold as test set and applied 9-fold cross validation on the remaining 9 folds in the inner cross validation loop. The following archive contains all 45 + 45 training sets for the inner cross validation loops each of which, as a consequences, contains 8 folds:

In the outer cross validation loop, models were trained on all 9 training folds. For the final parameter selection, we used regular 10-fold cross validation. The following archive provides 10 + 10 training sets with only one test fold left out:

All files in any of these archives has the following format:

Script for Computing Kernel Matrices

The following Perl script can be used to compute coiled coil kernel matrices that can be supplied into support vector machine software: Example:
perl CoiledCoilKernel.pl -m=7 -norm=1 -output=LIBSVM PrOCoil_PDB_X07.csv > PrOCoil_PDB_X07.train
svm-train -t 4 -c 8 PrOCoil_PDB_X07.train PrOCoil_PDB_X07.model
perl CoiledCoilKernel.pl -m=7 -norm=1 -output=LIBSVM PrOCoil_PDB_X07.csv PrOCoil_PDB_F07.csv > PrOCoil_PDB_F07.test
svm-predict PrOCoil_PDB_F07.test PrOCoil_PDB_X07.model PrOCoil_PDB_F07.out
The first command takes the non-augmented cross validation training set without fold no. 7 and computes the normalized coiled coil kernel matrix with m=7. The second command trains a support vector machine with cost parameter C=8. The third command computes the kernel matrix of fold no. 7 as test set versus the training set. The fourth command performs SVM prediction. Note that you need Perl and LIBSVM to execute these commands.