Large-scale comparison of machine learning methods for drug target prediction on ChEMBL

Directory structure

genDirStructure.cpp creates the following directory structure:

`SampleIdTable.txt`	sample names corresponding to binary data stored in subdirectories
`chemFeatures`
`chemFeatures/cl`	used clustering file `cl1.info` is there, further results from clusterMinFull.cpp are stored in a subdirectory `clusterMinFull`
`chemFeatures/d`	one subdirectory for each dense, real-valued data matrix (csv)
`chemFeatures/s`	one subdirectory for each sparse matrix (fpf)
`train`	file `tocompute.info`, that describes targets (assays) to consider, and file `train.info`, that describes compound-assay relations (double entries may exist) are in this directory
`run`	in subdirectories results from the C/C++ pipeline are stored there

Further the Python pipeline assumes directories:

`dataPython`	all data stored for Python format
`dataPythonReduced`	only compounds considered in Python format (reduces main memory assumption)
`resPython`	Deep learning results stored in subdirectories

You might consider downloading data provided below by the following commands:
wget https://raw.githubusercontent.com/ml-jku/lsc/master/download.sh chmod u+x download.sh ./download.sh ~/jkuLSCData

Data `SampleIdTable.txt`

SampleIdTable.txt (MD5: df7e773de4ce0272d8ed8207c4ef6a6f)

Data `chemFeatures/cl`

clusterMinFull.zip (MD5: 039c96f160dc66b326c9bd220327528f)
cl1.info (MD5: 7187541a8706bdfbc32d53c6f936266c)

Data `chemFeatures/d`

dense.zip (MD5: 947a5ecf3aef70a4a20ae7925e9dd4d5)
semisparse.zip (MD5: b3e8068e3c20d4a5dde6cdb1e09159dd)
toxicophores.zip (MD5: 19b5a2879ba601e0fba7db237606f537)

Data `chemFeatures/s`

ECFC4.zip (MD5: 7f2d170469eb8cb09bc65d17349e8ad7)
ECFC6_ES.zip (MD5: b13e686c6c01357ab371e47a8c8835ff)
DFS8_ES.zip (MD5: fbd79243d2774d1d2f7ccc9ffad266c3)
semisparse.zip (MD5: af609242aacb3f1ed364e907d9167ccb)
toxicophores.zip (MD5: 673acdc5628898ef340fac35f3495adb)

Data `train`

train.zip (MD5: 5cb9ccd3486651a4bcd90a7c583dcbc8)
tocompute.info (MD5: e5ee2c1c729f1af7190e1d220241a780)

Data `dataPython`

dataPython.zip (MD5: 1e3166eba4406b00f490a0a7c021bd3e, does not include data for GC, Weave and SmilesLSTM)

Data `dataPythonReduced`

dataPythonReduced.zip (MD5: c665bb9f4acbfce94adc6c4ab240bac0, does not include data for GC, Weave and SmilesLSTM)
chembl20LSTM.pckl (MD5: 357f80595f6b081daef839c7f642fb30)
chembl20Smiles.pckl (MD5: bb3027dbc41163fc55cdff9b5fedbb50)
chembl20MACCS.pckl (MD5: 3a79efc3230cb4eed512d6795b124fd0)
chembl20Deepchem.pckl (MD5: 255684411c7ff4aea899fa0db294869e)
chembl20Conv.pckl (MD5: 4ddc2d1f49b379c63e518996fe1b7377)
chembl20Weave.pckl (MD5: e9394029b2a697d0757adff3fadfb6f0)

Contact: Andreas Mayr (mayr@bioinf.jku.at)

Large-scale comparison of machine learning methods for drug target prediction on ChEMBL

Directory structure

Data SampleIdTable.txt

Data chemFeatures/cl

Data chemFeatures/d

Data chemFeatures/s

Data train

Data dataPython

Data dataPythonReduced