Institute for Machine Learning

PrOCoil Data Repository (V1)

Introduction
All Data at a Glance
PDB Data Set
Clustering
BLAST Augmentation
Model Selection and Training
Script for Computing Kernel Matrices

Introduction

The PrOCoil Web service and the R package procoil are based on the same support vector machine model that was trained according to the model selection procedure described in the following paper:

C. C. Mahrenholz, I. G. Abfalter, U. Bodenhofer, R. Volkmer, and S. Hochreiter. Complex networks govern coiled coil oligomerization - predicting and profiling by means of a machine learning approach. Mol. Cell. Proteomics 10(5):M110.004994, 2011. DOI: 10.1074/mcp.M110.004994

The purpose of this page is to make all data available that were used evaluating the computational approach and for training the PrOCoil model. The data available on this page were used to train the original PrOCoil models which were used in versions 1.x.y of the R package and by the PrOCoil Web service before May 7, 2016. The data that were used to train the updated models, are available here.

All Data at a Glance

PrOCoil_PDB.csv (PDB data set; CSV; 26KB; 477 samples)
PrOCoil_PDB_Matching.csv (PDB back-mapping; CSV; 46KB; 1587 lines)
PrOCoil_PDB_60.txt (clustered PDB data set; plain text; 3.2KB; 374 lines)
PrOCoil_PDB_60_Alignment.txt (multiple alignments of PDB data set clusters; plain text; 24KB; 851 lines)
PrOCoil_BLAST.csv (BLAST data set; CSV; 736KB; 2357 samples)
PrOCoil_Augmentation.txt (mapping of PDB samples to BLAST samples; plain text; 68KB; 477 lines)
PrOCoil.csv (PDB + BLAST data set; CSV; 776KB; 2834 samples)
PrOCoil_PDB_Folds.zip (PDB data set folds/test sets; ZIP; 9.1KB; 10 files)
PrOCoil_CV_TrainingSets.zip (cross validation training sets; ZIP; 824KB; 20 files)
PrOCoil_NestedCV_TrainingSets.zip (training sets for nested cross validation; ZIP; 3.4MB; 90 files)
CoiledCoilKernel.pl (Perl script for computing kernel matrices; 7.4KB)

More details on these files can be found in the sections below.

PDB Data Set

The following file contains dimeric and trimeric coiled coil segments we extracted from the PDB - The RCSB Protein Data Bank (version as of April 2007):

PrOCoil_PDB.csv (CSV; 26KB; 477 samples)
Format:
- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register
- Column 4: oligomerization state as determined by SOCKET ("DIMER" or "TRIMER")

This data set was created as follows:

We scanned to whole PDB for coiled coils using SOCKET of which only parallel dimeric and trimeric coiled coils were selected.
We filtered out duplicates and exact sub-sequences.
In case a coiled coil segment has heptad irregularities, we only included the longest sub-sequence whose heptad register is regular.

So we obtained the above file as the maximal, irredundant set of parallel dimers and trimers without heptad irregularities.

The following file provides a back-mapping of the above sequences to coiled coil segments in the PDB as identified by SOCKET:

PrOCoil_PDB_Matching.csv (CSV; 46KB; 1587 lines)
Format:
- Column 1: unique identifier (according to PrOCoil_PDB.csv)
- Column 2: identifies whether the segment matches a coiled coil segment exactly or partly; "MATCHES" means that the sequence matches a coiled coil determined by SOCKET exactly; "CONTAINEDIN" means that the sequence matches a sub-sequence of a coiled coil determined by SOCKET; "CONTAINS" means that the sequence contains a coiled coil determined by SOCKET as sub-sequence; matching is supposed to be exact both in terms of amino acid sequence and heptad register
- Columns 3-6: these columns identify to which coiled coil segment the sequence fits (exactly or partly), where Column 3 is the PDB identifier, Column 4 identifies the chain, and Columns 5 and 6 correspond to the start and end positions of the coiled coil segments in the chain (as determined by SOCKET)
Example:
```
PDB357,MATCHES,1FN9,A,100,110
```
This line means that the coiled coil segment with identifier PDB357 (according to PrOCoil_PDB.csv, this is the sequence LSFVAQMHEMM with heptag register abcdefgabcd) exactly matches the sub-sequence 100-110 of chain A of the structure with PDB ID 1FN9.

Clustering

As described in the paper (see reference above), sequence clustering was performed to correct for sequence clusters that are over-represented in the PDB (e.g. GCN4 mutants). This was done by single linkage clustering according to a gap-free heptad-specific alignment to ensure that no pair of sequences from two different clusters match to a degree of 60% or higher (percentage computed as the number of matching positions relative to the length of the shorter sequence). The following data set provides the final grouping of samples in our PDB data set:

PrOCoil_PDB_60.txt (plain text; 3.2KB; 374 lines)
Format: every line corresponds to one cluster of sequences; each cluster is a comma-separated list of identifiers according to PrOCoil_PDB.csv
PrOCoil_PDB_60_Alignment.txt (plain text; 24KB; 851 lines)
Format: every cluster is shown as gap-free multiple alignment with the heptad register on top;

BLAST Augmentation

As described in the paper (see reference above), the PrOCoil PDB Data Set was augmented by putative coiled coils that were determined by masked BLAST searches and Marcoil (exact procedure described in the paper; see reference above). The following file provides the set of additional coiled coil sequences (the "BLAST data set"):

PrOCoil_BLAST.csv (CSV; 736KB; 2357 samples)
Format:
- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register

The following file provides the mapping between PDB Data Set and BLAST Data Set, i.e. by which coiled coil sequences from the BLAST Data Set each coiled coil sequence from the PDB Data Set can be augmented:

PrOCoil_Augmentation.txt (plain text; 68KB; 477 lines)
Format: every line starts with an identifier of a sequence of the PDB Data Set that is followed by a comma-separated list of BLAST Data Set identifiers;

The total number of coiled coil sequences (PDB + BLAST data set) is 477 + 2357 = 2834. If we join the PDB Data Set and the BLAST Data Set, where the label of each coiled coil sequence in the BLAST Data Set is chosen according to which coiled coil sequence from the PDB Data Set it augments, the following data set is obtained:

PrOCoil.csv (CSV; 776KB; 2834 samples)
Format:
- Column 1: unique identifier
- Column 2: amino acid sequence
- Column 3: heptad register
- Column 4: oligomerization state

Model Selection and Training

For model selection, the Clustered PDB Data Set was randomly split into 10 folds. More specifically, every fold is the union of a random selection of clusters, so no clusters were split over different folds. Thereby, we ensured that no coiled coil sequences belonging to different folds have a sequence similarity of more than 60% (according to the definition above). The following archive contains all folds:

PrOCoil_PDB_Folds.zip (ZIP; 9.1KB; 10 files)
This archive contains 10 files PrOCoil_PDB_F??.csv (with ?? being two-digit numbers identifying the numbers of folds).

These folds in particular serve as test sets for cross validation runs.

As described in the supplement of the paper (see reference above), we first used nested cross validation to validate our model selection procedure. In the outer cross validation loop, we withheld one fold as test set and applied 9-fold cross validation on the remaining 9 folds in the inner cross validation loop. The following archive contains all 45 + 45 training sets for the inner cross validation loops each of which, as a consequences, contains 8 folds:

PrOCoil_NestedCV_TrainingSets.zip (ZIP; 3.4MB; 90 files)
This archive contains 45 training sets PrOCoil_PDB_X??_X??.csv (with ?? being two-digit numbers identifying the folds that are left out from the training set). Each of those files corresponds to one training set without BLAST augmentation. This archive further contains 45 BLAST-augmented training sets PrOCoil_PDB_BLAST_X??_X??.csv (naming conventions analogously).

In the outer cross validation loop, models were trained on all 9 training folds. For the final parameter selection, we used regular 10-fold cross validation. The following archive provides 10 + 10 training sets with only one test fold left out:

PrOCoil_CV_TrainingSets.zip (ZIP; 824KB; 20 files)
This archive contains 10 training sets PrOCoil_PDB_X??.csv (with ?? being the two-digit numbers identifying the fold that is left out from the training set). Each of those files corresponds to one training set without BLAST augmentation. This archive further contains 10 BLAST-augmented training sets PrOCoil_PDB_BLAST_X??.csv (naming conventions analogously).

All files in any of these archives has the following format:

Column 1: unique identifier
Column 2: amino acid sequence
Column 3: heptad register
Column 4: oligomerization state

Script for Computing Kernel Matrices

The following Perl script can be used to compute coiled coil kernel matrices that can be supplied into support vector machine software:

CoiledCoilKernel.pl (Perl script; 7.4KB)
To obtain information how to use the script, execute it without arguments.

Example:

perl CoiledCoilKernel.pl -m=7 -norm=1 -output=LIBSVM PrOCoil_PDB_X07.csv > PrOCoil_PDB_X07.train
svm-train -t 4 -c 8 PrOCoil_PDB_X07.train PrOCoil_PDB_X07.model
perl CoiledCoilKernel.pl -m=7 -norm=1 -output=LIBSVM PrOCoil_PDB_X07.csv PrOCoil_PDB_F07.csv > PrOCoil_PDB_F07.test
svm-predict PrOCoil_PDB_F07.test PrOCoil_PDB_X07.model PrOCoil_PDB_F07.out

The first command takes the non-augmented cross validation training set without fold no. 7 and computes the normalized coiled coil kernel matrix with m=7. The second command trains a support vector machine with cost parameter C=8. The third command computes the kernel matrix of fold no. 7 as test set versus the training set. The fourth command performs SVM prediction. Note that you need Perl and LIBSVM to execute these commands.