============ SED package ============== Predicting Bioactivities of Ligand Molecules Targeting G Protein-coupled Receptors by Merging Sparse Screening of Extended Connectivity Fingerprints and Deep Neural Nets Environment Configuration: 1. Configure "RDKit" environment variables you need install python2.7 first("Anaconda" will be convenient) https://www.anaconda.com/download/ a) put "RDKit_2016_03_1" to disk C b) new a system path called "RDBASE" and key is "C:\RDKit_2016_03_1" c) new a system path called "PYTHONPATH" and key is "%RDBASE%" d) add "%RDBASE%\lib" to path you can test your whether your RDKit works by type "from rdkit import Chem" on your python console. Then you can Generate ECFPs with your computer. If you want to use GPU to support DNN, you need to follow: 2. install "VS2015" Make sure when you installed visual studios, you selected to install 'c++' compiler. it is not installed by default. if you didn't, re-run your visual studio installer and select to install the c++ compiler. 3. install "cudamat" https://github.com/cudamat/cudamat download "cudamat" and compiler to setup: Python setup.py install move all files in folder "cudamat" to python script path, for example, mine is: C:\Users\deepn\AppData\Roaming\Python\Python27\site-packages 4. install gnumpy http://www.cs.toronto.edu/~tijmen/gnumpy.html Run the Code: - run the MATLAB code "start_DPC" first - run the MATLAB code "demo_new.m", you need to input 'GPCR name', 'Length of ECFPs', 'radius of ECFPs' For example: 'P08908' 1024 6 Tips: 1)Never forget the quotation marks. 2)input the radius of ECFPs, if you want ECFP 12 then, the radius must be 6. 3)The DNN need GPU, CPU can run the code as well, you can change the parameter "epochs" to a smaller one. ============ DeepNeuralNet_QSAR Documentation ============== Before running the code, you need to install cudamat and gnumpy Start a commandline-window (in windows) or a terminal (in linux), and run the python scripts. python DeepNeuralNetTrain.py --seed=0 --hid=4000 --hid=2000 --hid=1000 --hid=1000 --dropout=0_0.25_0.25_0.25_0.1 --epochs=200 --data=Q9H228 models/Q9H228 python DeepNeuralNetPredict.py --seed=0 --label=1 --rep=10 --data=Q9H228 --model=models/Q9H228 --result=predictions/Q9H228 ============ Detail info of DNN ============ System requirements: Python 2.7+ Required Python Modules: Python Modules installed by default: sys, os, argparse, itertools, gzip, time General Python Modules: numpy, scipy.sparse Special Python Modules: gnumpy, cudamat (if use GPU) or npmat (if use multiplec-core CPU) CUDA toolkit: a prerequisite of cudamat Python Module. Installation of Special Python Modules: * gnumpy: http://www.cs.toronto.edu/~tijmen/gnumpy.html * npmat: http://www.cs.toronto.edu/~ilya/npmat.py * cudamat: https://github.com/cudamat/cudamat Note: Modules "gnumpy" and "npmat" are also provided in this distribution. If you have not GPU card or have problem installing cudamat module, the npmat.py module will use multiplec-core CPU to simulate the GPU computing. Create a directory for this moduel of DeepNeuralNet_QSAR, and keep all the python scripts in that directory. Usage: Start a commandline-window (in windows) or a terminal (in linux), and run the python scripts. Please refer to details below. ============ Brief explaination of all python files ============ All the files are listed in alphabetical order, not ordered by importance. Please find more detailed comments of all individual functions inside each python file. [activationFunctions.py] Define several classes of common activiation functions, such as ReLU/Linear/Sigmoid, along with their derivation or error function (if used for ouput layer). Used by [dnn.py] [counter.py] Utilize sys.stderr to produce progress bar for each training epoch. Include several different classes of progress bar, but only "Progress" and "DummyProgBar" are used. Used by [dnn.py] [DeepNeuralNetPredict.py] For making predictions for new compound structure with a single-task/multi-task DNN, which is trained by DeepNeuralNetTrain.py or DeepNeuralNetTrain_dense.py. [DeepNeuralNetTrain.py] For training a multi-task/single-task DNN with sparse QSAR dataset(s), accepts raw csv datasets or processed npz datasets. [DeepNeuralNetTrain_dense.py] For training a multi-task DNN with dense QSAR dataset(s), accepts raw csv datasets or processed npz datasets. [dnn.py] Key components of a simple feed forward neural network. Used by [DeepNeuralNetTrain.py], [DeepNeuralNetPredict.py], [DeepNeuralNetTrain_multi.py] and [DeepNeuralNetPredict_multi.py] [DNNSharedFunc.py] A group of assistant functions, such as calculating R-squared, writing predictions into file. Used by many other files in the package. [gnumpy.py] A simple python module for GPU computing, the "GPU-version" of numpy module. [npmat.py] A simple python module which is required by gnumpy.py for the simulation mode. If failed to import cudamat, using npmat (CPU computing) instead. [processData_sparse.py], [processData_dense.py] Pre-processing a group of raw csv QSAR data sets(either sparse or dense) to sparse-matrix python file format (save as *.npz), to facilitate later use. Contains many data-manipulation functions used by other files in the package. ============ How to use - Example scripts ============ Prepare input datasets [sparse datasets] Arrange all the datasets as examples in "data_sparse" folder. Example #1 (It is a subset of three tasks from the 15 Kaggle datasets): Folder name: data_sparse Contains several datasets, each has training set and test set: METAB_training.csv METAB_test.csv OX1_training.csv OX1_test.csv TDI_training.csv TDI_test.csv Example #2 (It is a single task selected from Kaggle datasets): Folder name: data_sparse_single Contains one pair of training set and test set: METAB_training.csv METAB_test.csv [dense datasets] Arrange all the datasets as examples in "data_dense_raw" folder. Example (It is a subsample from CYP datasets, which has 3 tasks): Folder name: data_dense Contains two datasets, one training set and one test set: training.csv test.csv Pre-process data (Optional, can be skipped.) preprocess sparse format datasets: create a new folder "data_sparse" under the working directory to save processed data. python processData_sparse.py data_sparse data_sparse_processed preprocess dense format datasets: create a new folder "data_dense" under the working directory to save processed data, need to tell how many tasks are there in the dense dataset, such as "3" in the example datasets. python processData_dense.py data_dense data_dense_processed 3 Train a single-task DNN for one QSAR task Default transformation of inputs is log; activation function is ReLU, minibatch size 128.... The key parameters that need to be specify by user: seed: random seed for the program. It is optional but better to be given for reproducibility. CV: (optional) proportation of cross-validation subset which randomly sampled from training set test: (optional) whether to use the corresponding external test set for checking performance on test set during training. hid: DNN structure, specify the number of nodes at each layer. dropouts: the drop out probability for each layer, to prevent over-fitting. epochs: number of epochs for training data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file the last argument: where you want to save the trained model, if the folder doesn't exists it'll be created automatically Example: use .csv raw data to train a single-task DNN for METAB, each corresponding processed .npz files will be automatically save to input data path python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single Example: use .npz processed data to train a single-task DNN for METAB (recommended, loading data faster than raw data) Parameters are the same as above. The processed datasets in folder "data_sparse_single" is created in last step. python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single Example: Without the optional 'CV' and 'test' arguments. python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse_single models/METAB_single Prediction with a single-task DNN The key parameters that need to be specify by user: model: the path to previous trained model folder, e.g. the "models/METAB_single" from step 2). data: path to the folder which contains a single QSAR task data, could contain raw csv file or processed npz file label: whether the "test" dataset have true label. Default is 0, but in this example it has true label. rep: (optional) number of dropout prediction rounds. Default is 0, means don't perform dropout prediction. seed: random seed for the program, useful for dropout prediction. Optional but better to be given for reproducibility. result: (optional) specify where to save the prediction results. Default is the same as model folder. Example: use the previous trained single DNN model for METAB to perform prediction for its test data python DeepNeuralNetPredict.py --seed=0 --label=1 --rep=10 --data=data_sparse_single --model=models/METAB_single --result=predictions/METAB_single Example: Without the optional 'rep' and 'PredictResultPath': python DeepNeuralNetPredict.py --label=1 --data=data_sparse_single --model=models/METAB_single Train a multi-task DNN for the sparse datasets Need to use the processed datasets but not raw datasets. Parameters that are different from single-task DNN: data: path to the data folder that stores all the QSAR datasets (Below are optional) mbsz: the minibatch size, default is 20, but for multi-task it may be modified to achieve better results keep: the datasets to keep in the model, if don't want to include all datasets in the 'data' folder watch: if use internal cross-validation set or external test set, choose to monitor the MSE and R-squared for certain task reducelearnRateVis: sometimes reduce the learning rate of the first layer helps the training process to converge better Example: a multi-task DNN to model all the three sparse datasets: METAB, OX1, TDI python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=5 --data=data_sparse models/multi_sparse_1 Example: load the previous trained model and continue the training process for more epochs. python DeepNeuralNetTrain.py --seed=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse --loadModel=models/multi_sparse_1 models/multi_sparse_continue Example: with more optional parameters, keep only METAB and OX1 tasks and monitor OX1 task performance python DeepNeuralNetTrain.py --seed=0 --CV=0.4 --test --mbsz=30 --keep=METAB --keep=OX1 --watch=OX1 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_sparse models/multi_sparse_2 Prediction with multi-task DNN for the sparse datasets The parameter settings are the same as single-task DNN for sparse dataset. See step 3). Only difference: data: path to the data folder that stores all the processed datasets (including test datasets). Example: prediction for all the three sparse datasets with the model trained in previous step, save results to model folder: python DeepNeuralNetPredict.py --label=1 --data=data_sparse --model=models/multi_sparse_1 Example: prediction with the model for METAB and OX1, trained in previous step, with dropout prediction, and save result to another folder. python DeepNeuralNetPredict.py --label=1 --seed=0 --rep=10 --data=data_sparse --model=models/multi_sparse_2 --result=predictions/multi_sparse_2 Train a multi-task DNN for the dense datasets Most of the parameter settings are the same as multi-task DNN for sparse datasets Difference: use integer parameters for the 'keep' and 'watch' arguments The key parameters that need to be specify by user: numberOfOutputs: number of QSAR task output columns in the raw training set (.csv) Example: keep only the first two output tasks and monitor the first output during training process, with internal cross-validation set and external test set, using raw data python DeepNeuralNetTrain_dense.py --numberOfOutputs=3 --CV=0.4 --test --keep=0_1 --watch=0 --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense models/multi_dense_1 Example: Without the optional arguments, using pre-processed data Note: for processed data, don't need to specify "--numberOfOutputs=3" python DeepNeuralNetTrain_dense.py --hid=2000 --hid=1000 --dropouts=0_0.25_0.1 --epochs=10 --data=data_dense_processed models/multi_dense_2 Prediction with multi-task DNN for the dense datasets Parameter settings are the same as prediction for sparse datasets Example: Prediction using trained DNN from previous step python DeepNeuralNetPredict.py --label=1 --dense --data=data_dense --model=models/multi_dense_1 --result=predictions/multi_dense_1