Large-scale comparison of machine learning methods for drug target prediction on ChEMBL



Directory structure

genDirStructure.cpp creates the following directory structure:
SampleIdTable.txtsample names corresponding to binary data stored in subdirectories
chemFeatures
chemFeatures/clused clustering file cl1.info is there, further results from clusterMinFull.cpp are stored in a subdirectory clusterMinFull
chemFeatures/done subdirectory for each dense, real-valued data matrix (csv)
chemFeatures/sone subdirectory for each sparse matrix (fpf)
trainfile tocompute.info, that describes targets (assays) to consider, and file train.info, that describes compound-assay relations (double entries may exist) are in this directory
runin subdirectories results from the C/C++ pipeline are stored there

Further the Python pipeline assumes directories:
dataPythonall data stored for Python format
dataPythonReducedonly compounds considered in Python format (reduces main memory assumption)
resPythonDeep learning results stored in subdirectories



You might consider downloading data provided below by the following commands:
wget https://raw.githubusercontent.com/ml-jku/lsc/master/download.sh
chmod u+x download.sh
./download.sh ~/jkuLSCData


Data SampleIdTable.txt

Data chemFeatures/cl

Data chemFeatures/d

Data chemFeatures/s

Data train

Data dataPython

Data dataPythonReduced



Contact: Andreas Mayr (mayr@bioinf.jku.at)