Predicting drug–disease associations by network embedding and biomedical data integration

Xiaomei Wei (College of Informatics, Huazhong Agricultural University, Wuhan, China)
Yaliang Zhang (College of Informatics, Huazhong Agricultural University, Wuhan, China)
Yu Huang (College of Informatics, Huazhong Agricultural University, Wuhan, China)
Yaping Fang (College of Informatics, Huazhong Agricultural University, Wuhan, China)

Data Technologies and Applications

ISSN: 2514-9288

Publication date: 1 April 2019



The traditional drug development process is costly, time consuming and risky. Using computational methods to discover drug repositioning opportunities is a promising and efficient strategy in the era of big data. The explosive growth of large-scale genomic, phenotypic data and all kinds of “omics” data brings opportunities for developing new computational drug repositioning methods based on big data. The paper aims to discuss this issue.


Here, a new computational strategy is proposed for inferring drug–disease associations from rich biomedical resources toward drug repositioning. First, the network embedding (NE) algorithm is adopted to learn the latent feature representation of drugs from multiple biomedical resources. Furthermore, on the basis of the latent vectors of drugs from the NE module, a binary support vector machine classifier is trained to divide unknown drug–disease pairs into positive and negative instances. Finally, this model is validated on a well-established drug–disease association data set with tenfold cross-validation.


This model obtains the performance of an area under the receiver operating characteristic curve of 90.3 percent, which is comparable to those of similar systems. The authors also analyze the performance of the model and validate its effect on predicting the new indications of old drugs.


This study shows that the authors’ method is predictive, identifying novel drug–disease interactions for drug discovery. The new feature learning methods also positively contribute to the heterogeneous data integration.



Wei, X., Zhang, Y., Huang, Y. and Fang, Y. (2019), "Predicting drug–disease associations by network embedding and biomedical data integration", Data Technologies and Applications, Vol. 53 No. 2, pp. 217-229.



Emerald Publishing Limited

Copyright © 2019, Emerald Publishing Limited

1. Introduction

Traditional drug development is highly resource-intensive, expensive, and prone to failure. The de novo drug discovery process requires the investment of billions of dollars and an average of about 9–12 years to bring an experimental drug to the market, and failures are common across all drug development pipelines (Dickson and Gagnon, 2004). Over the past decade, with the explosive growth of large-scale genomic and phenotypic data, as well as the improvement of systematic approaches, computational solutions have shown reasonable and feasible in discovering new indications of old drugs in the age of big data.

Using computational approaches combined with biomedical data resources to infer novel indications for existing drug offers great advantages in speeding up drug development with decreased risk (Nagaraj et al., 2018; Khatoon and Govardhan, 2018). Drug repositioning can reduce the lag of drug discovery and development time from 10 –17 years to potentially 3–12 years (Hurle et al., 2013). In addition, the repurposed drugs accounted for 20 percent of the 84 drug products introduced to the market in 2013 (Graul et al., 2014). Many failed drugs and existing drugs have been investigated and successfully approved for new indications. Drug repositioning is playing an increasingly important role in the drug development and precision medicine paradigm (Shameer et al., 2015) on account of the widespread attention from the pharmaceutical companies, government agencies and academic institutes. Many computational approaches have been introduced to integrate heterogeneous data sources to predict new drug indications. The biomedical data sources involved in these methods usually contain chemical structures, protein targets or phenotypic information (e.g. side-effect profiles and gene expression profiles) (Yang and Agarwal, 2011; Sirota et al., 2011).

Based on the various data resources, data mining and machine learning methods have been applied to study the underlying systems for predicting novel associations between drugs and diseases. Most computational strategies often formulate the prediction problem as a binary classification task, which aims to predict whether a drug–disease association is present or not (Wang, Guo and Wang, 2017). Feature selection is a key issue of the machine learning process; substantial effort has been devoted to extracting various features to build classification models. The availability of large-scale biological interaction networks provides an opportunity to learn the topological features of the nodes that are predictive to unknown drug–disease associations. Intuitively, the drugs/diseases with similar topological properties in the network are likely to be functionally correlated.

Recently, several techniques for generating features from graphs/networks have been proposed (Tang and Liu, 2011; Henderson et al., 2011). For example, network embedding (NE)/network representation learning (NRL) is an excellent technique for learning the latent features from natural networks. Originally, NE is used to vectorize the vertices of the social network (e.g. Wikipedia) and social media in the real world. The NE algorithm can be transferred to biomedical interaction networks on account of the structures.

2. Related works

To date, many computational approaches have been proposed to infer novel statistical associations between drugs and diseases. Of these approaches, some effective methods used the integration of biomedical resources to computational models for predicting drug–disease associations (Gottlieb et al., 2011; Wang et al., 2013). The approach PREDICT (Gottlieb et al., 2011) heuristically adopted multiple drug–drug and disease–disease similarity measures to obtain values for manually defined features and then learned a logistic regression classifiers to extract drug–disease associations automatically. The system PreDR (Wang et al., 2013) integrated biomedical resources to extract the features of drugs and defined a kernel function to correlate drugs with diseases. Then, this system trained a support vector machine (SVM) classifier based on the golden training set used in PREDICT to predict novel drug–disease interactions. The system SCMFDD adopted the method of constrained matrix factorization to uncover the latent drug–disease associations with the help of constraints from drug feature-based similarity and disease semantic similarity (Zhang et al., 2018). However, the aforementioned methods focus on pre-defined features that need much manual work or computation before the best-performed features can be selected (Gottlieb et al., 2011; Wang et al., 2013; Wei et al., 2015). Furthermore, the computational methods often urgently require the complex network integration methods to exploit biomedical resources (Pletscher-Frankild et al., 2015, Wei, 2018). Therefore, much room exists for improving the system performance, from simplifying data integration methods and enhancing the accuracy of prediction.

NE is a promising technique for learning the topology features of nodes in a network. In the real world, the vertices in a network are connected to one another and hence usually have rich information, especially structural information. Once the vector representation of the vertices is learned, network mining tasks, such as node classification (Sen et al., 2008), node clustering (Wang, Cui, Wang, Pei, Zhu and Yang, 2017), link prediction (Zhang et al., 2014; Ou et al., 2016) and social recommendation (Krakan et al., 2018), can be readily solved further by various machine learning algorithms. In light of the promising way of vertex representation, many NE models were proposed, such as DeepWalk (Perozzi et al., 2014), node2vec (Grover and Leskovec, 2016) and LINE (Tang et al., 2015). In most of the biomedical databases, the relationships among entities are stored in 2D tables and can be treated as naturally formed networks/graph. Accordingly, the data can be directly inputted into the NE module to obtain the latent features of biomedical entities, which can be used in further computation processing.

Here we present an approach that employs the NE method to learn the latent feature representations of drugs from biomedical resources. On the basis of these features, we use the SVM model to train a binary classifier to divide the drug–disease pairs into positives and negatives. The tenfold cross-validation on the golden standard data shows that our methods can obtain a performance of an area under the receiver operating characteristic (ROC) curve (AUC) of 90.3 percent, which is comparable to those of other outstanding works. Our main contribution here is to introduce the method of NE into drug–disease association inferring. In our strategy, the computation of all kinds of similarity measures and the laborious step of feature selection are avoided. Our method is also effective for predicting the new indications of old drugs.

The remainder of this paper is organized as follows. Section 3 introduces the data sets and describes our framework for predicting drug–disease associations. Section 4 presents the experimental details and analyzes the results we obtained. Conclusions are summarized in Section 5.

3. Methodology

In this section, we present a novel framework to predict drug–disease interaction by classifying the drug–disease pairs into positives and negatives. To learn the classifier, we involve the latent features of drugs from the NE module and heterogeneous biomedical sources. The system workflow is as follows, whereas the framework is shown in Figure 1:

  • collecting and preprocessing the drug/disease data from the biomedical database;

  • learning the latent features of drugs through the NE algorithm;

  • using the SVM classifier to divide the drug–disease pairs into positives and negatives; and

  • analyzing the experimental results and exploring the new drug indications.

3.1 Data sets

In recent computational repositioning strategies, many studies have indicated that chemical structures, target proteins and side effects can provide rich information for assessing the similarities among drugs, diseases and drug–disease associations (Gottlieb et al., 2011; Fernandez-Alvarez et al., 2018). These approaches usually employ the characteristics among biomedical resources, such as DrugBank, SIDER, and Online Mendelian Inheritance in Man (OMIM), combined with all kinds of similarity measures to obtain the features for machine learning models. To fully exploit the biological prior knowledge and known network topology information, our experiment uses the NE model to learn the latent representations from biomedical data, including chemical structures, side effects, target proteins/genes and drug indications.

3.1.1 Drug chemical structure

The drug chemical structure has been proven to be highly effective for characterizing a particular drug (Dudley et al., 2011; Zhang et al., 2013; Yang et al., 2014). Drugs with similar chemical structures are believed to have common therapeutic properties and may thus treat common diseases (Wang et al., 2013). In our experiment, we download the FDA-approved drugs and their canonical simplified molecular input line entry specification (Weininger, 1988) from DrugBank database (Law et al., 2013). To make the drugs’ chemical structures computable, a given drug is represented usually as a molecule fingerprint defined in PubChem. PubChem ( (Bolton et al., 2008) is a public repository for information on chemical substances and their biological activities. A total of 881 chemical substructures are defined in the PubChem database. The fingerprint is an 881-dimensional binary vector that uses 1 or 0 to represent the presence or absence of a corresponding chemical substructure in the given drug. Therefore, the fingerprint corresponding to 881 chemical substructures can be used to formally represent the structure of the given drug. We obtain the molecule fingerprints of drugs by using the PaDel-Descriptor software (Yap, 2011), which is based on the Chemical Development Kit (Steinbeck et al., 2006). We have 2137 FDA-approved drugs with canonical SMILE in total; accordingly, a 2,137×881 adjacent matrix is obtained to show the relationships between the drugs and chemical substructures. This adjacent matrix is taken as the input of the NRL module to obtain the feature representations on chemical structures.

3.1.2 Drug–target data

Drug–target interaction data are also effective in drug repositioning systems. Our drug–target interaction data are from DrugBank. There are 5,878 drugs, 3,757 proteins/genes and 19,906 drug–target interaction pairs in total. We use all the pairs to learn the feature representations of drugs on the drug–target interaction.

3.1.3 Drug side effect and indication

The drug side effect data are downloaded from the SIDER database ( (Kuhn et al., 2015) with the version SIDER4.1 released on October 21, 2015. SIDER contains information on marketed medicines and their recorded adverse drug reactions. From the SIDER database, we obtain the data for 1,430 marketed drugs, the corresponding 5,868 recorded adverse drug reactions and 139,756 drug–side effect pairs. The drugs are identified by STITCH compound IDs that are derived from PubChem compound identifiers. Thus, the drugs in SIDER can be mapped to the DrugBank database based on the STITCH compound ID. All the 139,756 adverse drug reaction pairs are inputted into the NRL model to learn the latent representations of the corresponding drugs on their side effects.

We also performed the same process on the data of drug indications in the SIDER database to attain the embedding features. However, we remove the drug–indication pairs overlapped with 1,933 golden data.

3.1.4 Online Mendelian Inheritance in Man

OMIM is a comprehensive, authoritative compendium of human genes and genetic disorders (Hamosh et al., 2005). OMIM phenotypes are the standard data set in the extraction and evaluation of the associations related to genetic disease associations (Karni et al., 2009; Vanunu et al., 2010; Singh-Blom et al., 2013). Furthermore, OMIM contains various textual information types, including patient symptoms and signs (Hoehndorf et al., 2015), which are available for achieving phenotypic semantic similarity measures. Van Driel et al. (2006) constructed the data set mimMiner by extracting disease–phenotype associations from OMIM through text mining and calculating pairwise disease similarities. In our experiment, we utilize mimMiner data as our disease features directly. The phenotype similarity data can be downloaded from the web address

3.2 NE algorithm

Originally, the NE model has been proposed to solve the problems in recommendation systems about social media. Usually, the source networks include social networks, biological networks and information networks (Cui et al., 2018; Castillo-Zúñiga et al., 2016). In our work, the entities (such as drug, disease, etc.) and their associations are from the public relational databases that can be viewed as networks. The entities are treated as the vertices/nodes and the associations as the edges of unweighted network. The adjacency matrix of a network is a matrix in which the rows and columns represent different vertices/nodes while the edges are represented by 0 or 1, indicating that these two nodes are adjacent or not. The NE algorithm is used to learn low-dimensional representations of nodes in the network (Shown in Figure 2); these representations can capture network properties, including the topological and structural characteristics of a node. The learned latent representation is in a continuous vector space that can be used to measure the relationships between nodes through computing their distances (Cui et al., 2018).

Here we introduce three typical network-embedding models to our experiment. DeepWalk is the first NE learning method inspired by the widely used word representation learning model in Natural Language Processing which is named as SkipGram (Mikolov et al., 2013). However, DeepWalk employs node sequences that are sampled by the random walking algorithm as the input of SkipGram, whereas the word embedding method considers the sentence sequences as input. node2vec considered that DeepWalk ignores the diversity of connectivity patterns in a network. Based on the diverse neighborhoods of nodes in a network, node2vec designed a biased random walk procedure to sample the neighborhood nodes. node2vec demonstrated that the adoption of breadth-first sampling and depth-first sampling can learn an enriched representation. LINE (Tang et al., 2015) has been proposed for large-scale NE and is suitable for any type of information network. Base on the notion that two nodes sharing similar neighbors should have a close vector representation despite the lack of a direct link in the network, LINE used the second-order proximity, instead of the commonly used first-order proximity, to sample nodes.

A large-scale NE is suitable for any type of information network. Based on the notion that two nodes sharing similar neighbors should have close vector representations despite the lack of a direct link in the network, LINE used the second-order proximity, instead of the commonly used first-order proximity, to sample nodes.

3.3 From network to latent vector representations

The network can be stored into the adjacency matrix, which is the traditional network representation. However, this kind of representation is high-dimensional and sparse and potentially contains noise or redundant information. The embedding procedure can reduce the noise and redundant information and preserve the intrinsic information to obtain dense and continuous representations of nodes in a low-dimensional space (Cui et al., 2018).

Most of the data in biomedical databases are stored in a relational table that can be treated as a network. From the network structure, the NE model can learn the embedding representations of nodes, which are useful for further statistical computation. For example, in Figure 3, the relationships of drugs and their biomolecule substructures are represented as an adjacent matrix shown in the upper table, in which 1 represents the presence of a certain chemical structure for a given drug, and 0 achieves the reverse. A total of 881 chemical structures are present in PubMed, so each drug can be formalized as an 881-dimensional binary vector. The lower table in Figure 3 shows the learned embedding representation of drugs in a 128-dimensional space which became denser than its previous form. Meanwhile, 128 is the parameter assigned in our experiment to represent the dimension of latent representations. Because the embedding representation captures the intrinsic information of the network and encodes node relations in a low-dimensional continuous vector space, this representation is easily exploited by further statistical models.

3.4 Classification

After the NRL module learned drug representations from biomedical resources, we concatenate all kinds of latent representations to construct the feature vectors of the drugs. Further, we combine the drug features with disease features from mimMiner by concatenating. On account of the combined features, the drug–disease association prediction task is treated as a binary classification problem. We employ the SVM model to train the classifier for the prediction task. The training set is from PREDICT (Gottlieb et al., 2011), which includes drug–disease associations among 593 drugs from the DrugBank database and 313 diseases from the OMIM database. Of the training set, the 1,933 true drug–disease associations originate from data integration and manual work; the rest of the unknown pairs (not part of the true pairs) are randomly selected from all the possible associations (except for the 1,933 positives) between the 593 drugs and 313 diseases, with twice the size of positive pairs. Similarly, our experiment considers the known drug–disease pairs as the gold-standard positives and the others as negatives.

4. Result analysis and discussion

4.1 Data sets and setting

We adopt the data set from the PREDICT system as the gold-standard data in our experiment. Multiple biomedical data sources, including DrugBank and SIDER, are employed to learn the embedding representations of drugs. Only the entities having embeddings in the PREDICT can be used to train the classification model. Consequently, 971 gold-standard drug–disease associations overlapping with 1,933 pairs in PREDICT are preserved as the positives, plus a randomly generated set of negatives drug–disease pairs, which is twice as many the positives. For the further classification task, the SVM model is employed to train the binary classifier to predict drug–disease associations. The super parameters C and γ of the SVM classifier are optimized through the grid search algorithm with tenfold cross-validation. The performance of our prediction is measured and visualized by the ROC curve (Gribskov and Robinson, 1996), which shows the trade-off between the true-positive (correctly predicted interactions) rate with respect to the false-positive (wrongly predicted interactions) rate. Additionally, the other evaluation methods, including the precision and F-measure, are involved in our experiment.

4.2 Performance of the method

4.2.1 Effects of data sources

The prediction ability of each data source is a valuable issue for the system performance. In this section, we evaluate the effect of a single data source on inferring the drug–disease associations. The involved data include chemical structure (CH), side effect (SE), target protein (TG) and indication (IND). They are used to characterize drugs by the latent representations learned from the NE model. Given the latent feature vectors of the data sources, the SVM classifier with tenfold cross-validation is performed on our data set. The predicted results are shown as ROC curves in Figure 4. The figure reveals that the prediction performances based on CH, SE, TG and IND reach the AUCs of 0.828, 0.867, 0.870 and 0.896, respectively. As presented, the ROC curves are beyond the diagonal and indicate that all the data sources positively contribute to the prediction performance.

4.2.2 Effects of NE models

In this section, we verify the effect of three typical NE models named DeepWalk, LINE and node2vec in prediction tasks. The three NE models are respectively used to learn the latent feature representations of drugs on the basis of the above-mentioned four data sets. Correspondingly, each NE model attains a group of four features that are concatenated to form the feature vectors of drugs. Based on the features, three SVM classifiers with tenfold cross-validation corresponding to the three NE models are trained to predict drug–disease associations.

The three NE models are implemented by the OpenNE software, an open source toolkit for NE. This software can be downloaded from In the implementation, we assess the effect of different dimensions on the results and find almost no difference. Thus, we choose the default parameter 128 as the representation vector dimension. The other parameters in the NE models are set as default.

We use the SVM model with tenfold cross-validation to train and test the classifiers based on the output features of NE models. The optimized parameter pairs (C, γ) of SVM classifiers are set as (100, 0.01) for DeepWalk, (10, 0.01) for node2vec and (10, 0.1) for LINE, respectively. The prediction results are shown in Figure 5. As presented, DeepWalk outperforms the other two models, but the difference is moderate from the result based on node2vec. Comparably, LINE shows the worst performance, which can be explained by its inability to reuse samples, whereas reusability can be easily achieved through the random walk methods (Cui et al., 2018) employed by DeepWalk.

Additionally, we assess the results on other evaluation criteria, such as precision, F-measure and accuracy (shown in Table I). DeepWalk obtained the accuracy 0.853, the precision 0.778 and F-measure 0.779. The results confirm that DeepWalk is more suitable to our data than the other models. Given the evaluation results, we choose the model DeepWalk as our latent vector generator in further experiments. The PreDR reported that its method of data integration achieved the maximum F-measure of 0.822 and the corresponding accuracy of 0.823. Comparatively, we obtain the accuracy of 0.885 and precision of 0.86 when the F-measure reaches its maximum of 0.82.

4.2.3 Comparison with the benchmarks

In this section, we use the SVM classifier with tenfold cross-validation to predict drug–disease associations. We perform ten independent cross-validation runs and achieve the average result score as the performance of our method. In each run, the training set is divided into ten parts randomly. The drug features are the concatenation of latent representation vectors of CH, SE, TG or IND learned by DeepWalk. Then, the drug features are concatenated with the disease features from the mimMiner database to form the features of drug–disease pairs. Finally, we obtain a robust estimation of AUC 0.903.

In comparison, PREDICT obtains an AUC 0.9 in predicting the drug indications, and PreDR achieves an AUC 0.902 with the composite features. Although our method attains an AUC of 0.903, which is almost same as the achievements of the two benchmarks, our strategy operates well and achieves a competitive performance without laborious manual feature selection and extensive similarity measures. In contrast, the PREDICT strategy applied multiple similarity measures to calculate the drug–drug similarities and disease–disease similarities. The similarity scores were used as classification features. Furthermore, the authors adopted feature selection methods in the experiment and some additional data, such as GO annotations, to assess the genetic-based disease similarity. Another system PreDR used kernel function and SVM algorithm to train the classification model. Similar to the PREDICT, PreDR involves multiple similarity measures, including Jaccard score, Smith-Waterman sequence alignment score (Weibull, 1951), and so on.

4.3 Novel predictions

Our method performs favorably in predicting known drug–disease associations. We expect to detect biomedical valuable predictions through our method. In this section, our purpose is to predict new drug–disease associations. All 1,933 golden drug–disease associations and a twofold-enlarged, randomly generated set of drug–disease pairs that are not known (i.e. associations that do not appear in our drug indication gold standard) to be associated are used as training set. The test set comprises all the remaining negative drug–disease pairs (not in the training set). We employ the DeepWalk model as the vector generator and use the SVM model to train the classifier, which is further used to label the test set. Finally, 9,956 samples in the test set are labeled as positives with classification probabilities. All the predicted positive samples are the potential new indications. Because the result list is too big, we give it in supplementary material. Here we only take the drug–disease candidate pairs whose probabilities are in the top 30 as study cases (shown in Table II).

To test whether our predictions are in accordance with current experimental knowledge, we check the extent to which they appear in the current clinical trials. Take the drug–disease pair (“Risperidone,” “Obsessive-Compulsive Disorder; Ocd”) as an example,” we find it is detected as a new association by PREDICT with a confidence score of 0.933 and evaluated by our method with a score of 0.988. Furthermore, this example appears in current clinical trials with the “ Identifier” NCT00389493.

In another example, the drug–disease pair (“Indomethacin,” “Renal Failure, Progressive, With Hypertension”) is in our prediction list in Table II, with a confidence score of 0.983. This pair also appears in the current clinical trials of the “ Identifier” NCT NCT00389493. However, the pair is not in the PREDICT result list.

We also validate the top 30 predicted drug–disease associations by analyzing their co-occurrence in publications in PubMed database. Intuitively, if a drug–disease pair occurs together in literature, it is likely associated with each other. In the validation experiment, these drug–disease pairs are used as keywords for searching in PubMed. Factually, most searches have more than one return. To a certain extent, we prove our predicted results of new indications in another way. These encouraging instances further validate that our method can successfully predict new drug indications.

5. Conclusions

In many biomedical studies, the predictive models are required to be built with information represented in different formats. In this paper, our main contribution is the method of entity representation and its application. We present a new scheme for inferring the novel drug indications of existing drugs. In the study, the NE models are used to learn the structural features of drugs from heterogeneous data resources which can combine different types of data into a vector space. Then the SVM classifier with tenfold cross-validation is applied to verify the performance of our scheme.

Our method attains high rates of specificity and sensitivity in cross-validation (AUC=0.903) that are competitive to those of existing methods. Furthermore, the experiments show that the novel drug indication predictions are supported by clinical trial cases, which reveals the valuable prospects of our method in drug repositioning. Unlike other state-of-the-art computational approaches, our method performs well without the aid of laborious feature selection algorithms. This scheme also allows the easy integration of heterogeneous biomedical resources on diseases and drugs. In summary, the topology structures are the most notable characteristic of networks; thus, structure-preserving network methods and advanced feature learning methods exhibit great potential for future research and applications. In addition, the application of biomedical data sources plays a key role in inferring new drug indications. In the future, with the increasing accumulation of applicable biomedical resources and the improvement of data integration methods, our approach may add further value to drug indication predictions. We will intend to study the compositional embeddings of drugs/disease. Methodologically, we will carry out the plan of introducing the recommendation system models to predict drug–disease associations.


System framework

Figure 1

System framework

From network to latent vector representations

Figure 2

From network to latent vector representations

From the network to the low-dimensional representation

Figure 3

From the network to the low-dimensional representation

Effect of single data (CONC: data integration, CH: chemical structure, SE: side effect, TG: target protein, IND: side indication)

Figure 4

Effect of single data (CONC: data integration, CH: chemical structure, SE: side effect, TG: target protein, IND: side indication)

Effect of NE models

Figure 5

Effect of NE models

Performance comparison for different NE models

Models AUC Acc. Prec. F-measure
DeepWalk 0.903 0.853 0.778 0.779
node2vec 0.895 0.837 0.760 0.753
LINE 0.870 0.815 0.713 0.727

Top 30 new indications predicted by our method

Drug name Disease name Score PREDICT
Cyproheptadine Obsessive-Compulsive Disorder; Ocd 0.996 0.809
Cyproheptadine Insensitivity To Pain With Hyperplastic Myelinopathy 0.996 0.984
Clonazepam Obsessive-Compulsive Disorder; Ocd 0.990
Prochlorperazine Insensitivity To Pain With Hyperplastic Myelinopathy 0.990 0.954
Cyproheptadine Sensory Ataxic Neuropathy, Dysarthria, And Ophthalmoparesis; Sando 0.989 0.699
Citalopram Schizophrenia; Sczd 0.989
Phenobarbital Hyperphosphatemia, Polyuria, And Seizures 0.988 0.783
Risperidone Choreoathetosis/Spasticity, Episodic; Cse 0.988 0.611
Promethazine Insensitivity To Pain With Hyperplastic Myelinopathy 0.988 0.951
Risperidone Obsessive-Compulsive Disorder; Ocd 0.988 0.933
Citalopram Gilles De La Tourette Syndrome; Gts 0.987 0.823
Vincristine Gastroesophageal Reflux 0.986
Cyproheptadine Hyperthermia, Cutaneous, With Headaches And Nausea 0.985 0.632
Citalopram Tremor, Hereditary Essential, 1; Etm1 0.985
Citalopram Choreoathetosis/Spasticity, Episodic; Cse 0.985
Diphenhydramine Insensitivity To Pain With Hyperplastic Myelinopathy 0.984 0.953
Guanidine Insensitivity To Pain With Hyperplastic Myelinopathy 0.984
Gabapentin Obsessive-Compulsive Disorder; Ocd 0.983
Indomethacin Renal Failure, Progressive, With Hypertension 0.983
Oxcarbazepine Obsessive-Compulsive Disorder; Ocd 0.982
Phenobarbital Panic Disorder 1; Pand1 0.982 0.616
Tretinoin Hyperhidrosis Palmaris Et Plantaris 0.982
Prochlorperazine Obsessive-Compulsive Disorder; Ocd 0.981 0.840
Guanidine Motor Neuropathy, Peripheral, With Dysautonomia 0.981
Hydroxyurea Gastroesophageal Reflux 0.980
Hydrochlorothiazide Osteolysis, Hereditary, Of Carpal Bones With Or Without Nephropathy 0.980 0.656
Clonazepam Tremor, Hereditary Essential, 1; Etm1 0.979 0.573
Cyproheptadine Choreoathetosis/Spasticity, Episodic; Cse 0.979 0.978
Citalopram Insensitivity To Pain With Hyperplastic Myelinopathy 0.978
Cyproheptadine Gilles De La Tourette Syndrome; Gts 0.978 0.784


Bolton, E.E., Wang, Y., Thiessen, P.A. and Bryant, S.H. (2008), “Chapter 12 – PubChem: integrated platform of small molecules and biological activities”, in Wheeler, R.A. and Spellmeyer, D.C. (Eds), Annual Reports in Computational Chemistry, Vol. 4, Elsevier, pp. 217-241.

Castillo-Zúñiga, I., Luna-Rosas, F.-J., Muñoz-Arteaga, J. and López-Veyna, J.-I. (2016), “Combination of techniques of big data analytics and semantic web for the detection of vocabulary of harassment school in internet”, DYNA Ingeniería E Industria, Vol. 92 No. 3, pp. 141-142.

Cui, P., Wang, X., Pei, J. and Zhu, W. (2018), “A survey on network embedding”, IEEE Transactions on Knowledge and Data Engineering, Vol. 31 No. 5, pp. 833-852.

Dickson, M. and Gagnon, J.P. (2004), “Key factors in the rising cost of new drug discovery and development”, Nature Reviews Drug Discovery, Vol. 3 No. 5, pp. 417-429.

Dudley, J.T., Deshpande, T. and Butte, A.J. (2011), “Exploiting drug-disease relationships for computational drug repositioning”, Briefings in Bioinformatics, Vol. 12 No. 4, pp. 303-311.

Fernandez-Alvarez, J., Fernandez-Alvarez, H. and Castonguay, L.G. (2018), “Summarizing novel efforts to integrate practice and research from a practice oriented research perspective”, Revista Argentina De Clinica Psicologica, Vol. 27 No. 2, pp. 353-362.

Gottlieb, A., Stein, G.Y., Ruppin, E. and Sharan, R. (2011), “PREDICT: a method for inferring novel drug indications with application to personalized medicine”, Molecular System Biology, Vol. 7, doi: 10.1038/msb.2011.26.

Graul, A.I., Cruces, E. and Stringer, M. (2014), “The year’s new drugs & biologics, 2013: part I”, Drugs of Today, Vol. 50 No. 1, pp. 51-100.

Gribskov, M. and Robinson, N.L. (1996), “Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching”, Computers & Chemistry, Vol. 20 No. 1, pp. 25-33.

Grover, A. and Leskovec, J. (2016), “node2vec: scalable feature learning for networks”, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 855-864.

Hamosh, A., Scott, A.F., Amberger, J.S., Bocchini, C.A. and McKusick, V.A. (2005), “Online Mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders”, Nucleic Acids Research, Vol. 33 No. S1, pp. D514-D517.

Henderson, K., Gallagher, B., Li, L., Akoglu, L., Eliassi-Rad, T., Tong, H. and Faloutsos, C. (2011), “It’s who you know: graph mining using recursive structural features”, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 663-671.

Hoehndorf, R., Schofield, P.N. and Gkoutos, G.V. (2015), “Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases”, Scientific Reports, Vol. 5, doi: 10.1038/srep10888.

Hurle, M.R., Yang, L., Xie, Q., Rajpal, D.K., Sanseau, P. and Agarwal, P. (2013), “Computational drug repositioning: from data to therapeutics”, Clinical Pharmacology & Therapeutics, Vol. 93 No. 4, pp. 335-341.

Karni, S., Soreq, H. and Sharan, R. (2009), “A network-based method for predicting disease-causing genes”, Journal of Computational Biology, Vol. 16 No. 2, pp. 181-189.

Khatoon, T. and Govardhan, A. (2018), “Query expansion with enhanced-BM25 approach for improving the search query performance on clustered biomedical literature retrieval”, Journal of Digital Information Management, Vol. 16 No. 2, pp. 85-98.

Krakan, S., Humski, L. and Skočir, Z. (2018), “Determination of friendship intensity between online social network users based on their interaction”, Tehnički Vjesnik, Vol. 25 No. 3, pp. 655-662.

Kuhn, M., Letunic, I., Jensen, L.J. and Bork, P. (2015), “The SIDER database of drugs and side effects”, Nucleic Acids Research, Vol. 44 No. D1, pp. D1075-D1079.

Law, V., Knox, C., Djoumbou, Y., Jewison, T., Guo, A.C., Liu, Y., Maciejewski, A., Arndt, D., Wilson, M. and Neveu, V. (2013), “DrugBank 4.0: shedding new light on drug metabolism”, Nucleic Acids Research, Vol. 42 No. D1, pp. D1091-D1097.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S. and Dean, J. (2013), “Distributed representations of words and phrases and their compositionality”, Proceedings of the Neural Information Processing Systems Conference, NIPS, pp. 3111-3119.

Nagaraj, A., Wang, Q., Joseph, P., Zheng, C., Chen, Y., Kovalenko, O., Singh, S., Armstrong, A., Resnick, K. and Zanotti, K. (2018), “Using a novel computational drug-repositioning approach (DrugPredict) to rapidly identify potent drug candidates for cancer treatment”, Oncogene, Vol. 37 No. 3, pp. 403-414.

Ou, M., Cui, P., Pei, J., Zhang, Z. and Zhu, W. (2016), “Asymmetric transitivity preserving graph embedding”, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 1105-1114.

Perozzi, B., Al-Rfou, R. and Skiena, S. (2014), “DeepWalk: online learning of social representations”, Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, pp. 701-710.

Pletscher-Frankild, S., Palleja, A., Tsafou, K., Binder, J.X. and Jensen, L.J. (2015), “DISEASES: text mining and data integration of disease-gene associations”, Methods, Vol. 74 No. 3, pp. 83-89.

Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B. and Eliassi-Rad, T. (2008), “Collective classification in network data”, AI Magazine, Vol. 29 No. 3, pp. 93-106.

Shameer, K., Readhead, B. and Dudley, J.T. (2015), “Computational and experimental advances in drug repositioning for accelerated therapeutic stratification”, Current Topics in Medicinal Chemistry, Vol. 15 No. 1, pp. 5-20.

Singh-Blom, U.M., Natarajan, N., Tewari, A., Woods, J.O., Dhillon, I.S. and Marcotte, E.M. (2013), “Prediction and validation of gene-disease associations using methods inspired by social network analyses”, PloS One, Vol. 8 No. 5, e58977.

Sirota, M., Dudley, J.T., Kim, J., Chiang, A.P., Morgan, A.A., Sweet-Cordero, A., Sage, J. and Butte, A.J. (2011), “Discovery and preclinical validation of drug indications using compendia of public gene expression data”, Science Translational Medicine, Vol. 3 No. 96, pp. 96ra77-96ra77.

Steinbeck, C., Hoppe, C., Kuhn, S., Floris, M., Guha, R. and Willighagen, E.L. (2006), “Recent developments of the Chemistry Development Kit (CDK) – an open-source java library for chemo-and bioinformatics”, Current Pharmaceutical Design, Vol. 12 No. 17, pp. 2111-2120.

Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J. and Mei, Q. (2015), “Line: large-scale information network embedding”, Proceedings of the 24th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 1067-1077.

Tang, L. and Liu, H. (2011), “Leveraging social media networks for classification”, Data Mining and Knowledge Discovery, Vol. 23 No. 3, pp. 447-478.

Van Driel, M.A., Bruggeman, J., Vriend, G., Brunner, H.G. and Leunissen, J.A. (2006), “A text-mining analysis of the human phenome”, European Journal of Human Genetics, Vol. 14 No. 5, pp. 535-542.

Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. and Sharan, R. (2010), “Associating genes and protein complexes with disease via network propagation”, PLoS Computational Biology, Vol. 6 No. 1, p. e1000641, doi: 10.1371/journal.pcbi.1000641.

Wang, X., Guo, Y. and Wang, Z. (2017), “Multi-view discriminative manifold embedding for pattern classification”, Journal of Intelligent Computing Volume, Vol. 8 No. 2, pp. 58-63.

Wang, X., Cui, P., Wang, J., Pei, J., Zhu, W. and Yang, S. (2017), “Community preserving network embedding”, 31st AAAI Conference on Artificial Intelligence, AAAI, pp. 203-209.

Wang, Y., Chen, S., Deng, N. and Wang, Y. (2013), “Drug repositioning by kernel-based integration of molecular structure, molecular activity, and phenotype data”, PloS One, Vol. 8 No. 11, p. e78518.

Wei, X. (2018), “Using computational method to extract drug-disease associations from multiple biomedical databases”, Proceedings of the 13th International Conference on Digital Information Management (ICDIM 2018), ICDIM, pp. 286-295.

Wei, X., Huang, Y., Lyu, C. and Ji, D. (2015), “Extracting nested biomedical entity relations by tagging dependency chains”, Journal of Engineering Science & Technology Review, Vol. 8 No. 4, pp. 51-55.

Weibull, W. (1951), “A statistical distribution function of wide applicability”, Journal of Applied Mechanics, Vol. 18 No. 3, pp. 293-297.

Weininger, D. (1988), “SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules”, Journal of Chemical Information and Computer Sciences, Vol. 28 No. 1, pp. 31-36.

Yang, J., Li, Z., Fan, X. and Cheng, Y. (2014), “Drug-disease association and drug-repositioning predictions in complex diseases using causal inference–probabilistic matrix factorization”, Journal of Chemical Information and Modeling, Vol. 54 No. 9, pp. 2562-2569.

Yang, L. and Agarwal, P. (2011), “Systematic drug repositioning based on clinical side-effects”, Plos One, Vol. 6 No. 12, e28025.

Yap, C.W. (2011), “PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints”, Journal of Computational Chemistry, Vol. 32 No. 7, pp. 1466-1474.

Zhang, J., Yu, P.S. and Zhou, Z.-H. (2014), “Meta-path based multi-network collective link prediction”, Proceedings of the 20th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, ACM, pp. 1286-1295.

Zhang, P., Agarwal, P. and Obradovic, Z. (2013), “Computational drug repositioning by ranking and integrating multiple data sources”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp. 579-594.

Zhang, W., Yue, X., Lin, W., Wu, W., Liu, R., Huang, F. and Liu, F. (2018), “Predicting drug-disease associations by using similarity constrained matrix factorization”, BMC Bioinformatics, Vol. 19 No. 1, doi: 10.1186/s12859-018-2220-4.


This work was supported in part by the National Natural Science Foundation of China (No. 31501076).

Corresponding author

Xiaomei Wei can be contacted at: