Ranking potentially harmful Tor hidden services: Illicit drugs perspective

Cryptomarkets onthedarkwebhave emergedas ahub forthe saleofillicitdrugs.Theyhave madeiteasierfor the customers to get access to illicit drugs online while ensuring their anonymity. The easy availability of potentiallyharmfuldrugshas resultedin asignificantimpactonpublichealth. Consequently, lawenforcement agencies put a lot of effort and resources into shutting down online markets on the dark web. A lot of research work has also been conducted to understand the working of customers and vendors involved in the cryptomarketsthatmayhelpthelawenforcementagencies.Inthisresearch,wepresentarankingmethodology toidentifyandranktopmarketsdealinginharmfulillicitdrugs.Usingnamedentityrecognition,aharmscoreofadrugmarketiscalculatedtoindicatethedegreeofthreatfollowedbytherankingofdrugmarkets.Thetop-rankedmarketsaretheonessellingthemostharmfuldrugs.Therankingsthusobtainedcanbehelpfultolawenforcementagenciesbylocatingspecificmarketssellingharmfulillicitdrugsandtheirfurthermonitoring.


Introduction
The proliferation of the Internet and communications technology has paved the way for a variety of services to the general public. This has also made it easier to carry out a range of illicit activities. Dark web, an anonymous platform is one such place where illegitimate activities are prevalent like sale of illegal drugs, distribution of child abuse content, violence and hate content [1][2][3]. The Onion Router (Tor) is the most common method to access the dark web, although other methods like Freenet and I2P do exist but are used by fewer users [4]. A website on the Tor dark web is known as hidden service (HS). Among all the goods and services present on these online platforms, illicit drugs are the most popular product [5]. These online platforms are generally referred to as cryptomarkets or darknet markets (DNM). The easy availability of these illicit drugs may significantly harm consumers due to drug Perspective on illicit drugs overdoses [6] which require continuous surveillance of DNM by the law enforcement agencies.
Much research was done to aid the law enforcement agencies in getting the modusoperandi of DNM [7][8][9]. However, it could not provide methods to quantify the overall threat or harm incurred by online illicit drug platforms in terms of health concerns. There exist key players in the online ecosystem in terms of health impact and their activity requires immediate investigation from the law enforcement agencies. They are characterized in the dark web by the broad range of illicit drugs they sell.
Therefore this study tries to fill this gap by proposing the content-based ranking methodology that can help the law enforcement agencies in identifying the most harmful dark markets dealing with drugs. Consequently, the concerned agencies can put more significant efforts in order to shut down those markets. However, this does not imply that the other DNM should be left off instead the law enforcement agencies may prioritize their efforts towards DNM that requires immediate investigation and action. The proposed methodology is evaluated with standard metrics by conducting experiments on the Tor dark web dataset. The term HS shall refer to the DNM and websites on Tor dark web that deal with illicit drugs.
The rest of the paper is organized as follows: Section 2 describes the related work. Section 3 elaborates on the proposed ranking methodology. Section 4 provides the experiment settings followed by results and discussion. Finally, Section 5 draws the conclusion and possible future work.

Related work
The dark web markets have witnessed a surge in their size and significance after the launch of Silk Road, with the estimated annual drug trade of USD 170 million [10]. Several studies have managed to reveal the properties, working and impacts of DNM [11,12]. One of the first studies on cryptomarkets was focused on the Silk Road market, where the author collected and analyzed the data from the live market on a regular basis. The author found that drugrelated products were the most popular [9]. The qualitative analysis of the discussion forums of the DNM has uncovered the user experiences regarding the purity, effects, potency of drugs and the methods to evaluate the quality of drugs [13].
The quantitative method generally uncovers the geographical reach of vendors and their lifespan, customer retention methods, trade statistics and market challenges. A descriptive study on DNM has found that the largest number of vendors operate from the United States followed by European countries. However, The Netherlands has the highest number of vendors per 100 thousands of the population [14]. In an attempt to maintain a healthy customer base and to inflate the product reviews on the DNM, the vendors often resort to giving out free samples of their major drug products to the customers [15]. The analysis of the data collected from the AlphaBay cryptomarket indicates a highly competitive ecosystem among the vendors where only a small number of vendors manage to sustain their business through aggressive advertising [16]. A study focused on the sale of psychoactive substances has found that the vendors involved in the business of psychoactive substances have a short lifespan. However, the number of vendors on DNM shows an increasing trend [17].
The crawling mechanism for the surface web is simple given the easy availability of the website addresses on the Internet. Contrary to that there does not exist any system that contains the addresses of the Tor hidden services. This renders the crawling of hidden services a slow and challenging task. Moreover, a complete scan of the entire Tor web would be impractical as there would be 32 16 different addresses to be crawled (the address of the Tor HS is composed of 16 characters) [18]. The researcher has used some of the publicly available listings of Tor HS addresses on the surface web for further exploration of the Tor dark web. The crawling process also gets marred by the requirement of login credentials in the case of DNM. In order to maintain their privacy, there is also a risk of user account getting blocked by the DNM if they suspect crawling activity. A systematic methodology was proposed for scraping the DNM data where the automated spiders were used to overcome the crawling challenges [19].
A recent study [20] has proposed a link-based ranking technique called ToRank to identify influential Tor HS. As per the authors, the HS that is ranked higher by ToRank is more popular HS among others in the Tor dark web ecosystem. Identification of popular HS may help the law enforcement agencies in getting clues about the working of DNM. However, being a link analysis algorithm, ToRank purely relies on the hyperlinks between the HS and does not take into account the content of the HS while ranking them. Other studies have focused on dark web forums to identify key users [21,22,34].
The User Rank algorithm based on Page Rank was put forth to identify influential users through message content analysis and the level of attentiveness between the users [23].
The existing work explores the general working of the markets by collecting and analyzing customer feedback and comments, product ratings, vendor profiles, etc. The analysis of the DNM data can also be leveraged to quantitatively calculate the negative implications of the DNM resulting from the usage of illicit drugs. A hyperlink based approach may not be efficient in ranking the influential HS as most of the web pages in a Tor HS are difficult to find as they are linked by very few other pages, also the Tor HS has very low outgoing links to the other services [24]. Therefore we propose a content-based approach to rank HS trading in harmful illicit drugs. The proposed approach can be utilized to proactively monitor and investigate such HS. Moreover, the proposed approach also recognizes the isolated HS (one with no incoming and outgoing hyperlinks) if they possess the illicit drug, which otherwise is not detected by the conventional link analysis algorithms.

Proposed method
The proposed algorithm is specifically designed to work in the domain of illicit drug trading in the dark web. The computational time of the algorithm can be reduced if it is fed with the pre-identified dataset of drug-related HS. The existing work that focused on classifying suspicious activities on the dark web using machine learning can be used for this purpose [25]. The proposed algorithm can then be applied to retrieve top dangerous services from the bunch of HS. Figure 1 shows the proposed ranking technique. The design of the proposed ranking technique consists of the four main components: data preprocessing, illicit drugs name extraction, harm score calculation and rankings of HS.

Data preprocessing
The dataset used in this study consist of the HTML file representing each of the HS. A custom made Python script was employed to extract the available product listings from the HTML file for each HS in the dataset. Since a DNM may sell different types of products, the extracted set of listings may contain several products other than drugs and also from different vendors. After extracting all the product listings, the textual content with HTML tags removed is obtained and stored in a plain text file for each of the HS in the dataset. The plain text file is then processed to remove all the irrelevant content like script, hyperlinks, punctuations and white spaces. It is followed by converting all text to lowercase and then removing stop-words and duplicates. A parser using regular expression was employed to identify and remove numbers having either single-digit or more than three digits. The reason for doing this shall be discussed in Subsection 3.2. The obtained data is then put to perform tokenization that breaks long strings of text into smaller pieces called tokens. In our case, tokens shall be single words and numbers of two and three digits only. After the tokenization process, the tokens are stored in the form of a list in a text file for further processing.

Illicit drugs name extraction
The preprocessed data in the form of tokens received from the first step contains other words and product names along with the names of illicit drugs. Drug name recognition (DNR) shall be applied to recognize the name of drug-related products. DNR is a particular type of Named Entity Recognition (NER) task to extract names of drugs from unstructured text. DNR becomes very important and challenging in our task of extracting illicit drug names. To confuse the law enforcement agencies, consumers and vendors trading in illicit drugs use common street names/slang words for illicit drugs with some of the names being common English phrases and words used in day-to-day life. For e.g., pastas may refer to the class of drugs called amphetamines. Moreover, they also use the brand name of the drugs instead of their generic names. The drugs may also be referred in numeric form like 77,501 etc. because of this reason specific numeric tokens were kept at the preprocessing stage. The bigrams and trigrams are generated from the list of tokens obtained after preprocessing. The purpose behind their creation is to identify drug names and their slangs composed of two or three words. A dictionary-based DNR approach is used to identify illicit drug names and their slangs followed by putting them into the appropriate class of drugs. Dictionary-based approaches require a drug dictionary to match against the text document. The dictionary-based approach is utilized in this study due to the context of the problem. As discussed above, drug vendors mostly use slang terms for trading illicit drugs on the DNM. These slang terms bear no relation with the original name of the drugs at all; also, there is no naming convention or nomenclature used for generating slang terms. In this context, a rulebased approach could not be effectively applied; on the other hand, a comprehensive dictionary of common/slang drug terms can be used for exact matching of the text. However, the drug dictionary should be updated regularly given the ever-changing ecosystem of the illicit drug trade. In our work, we have used the dictionary of slang and code words of drugs by the US Drugs Enforcement Administration [26].
In our approach, we shall be using DNR to identify eighteen different types of controlled and prohibited drugs identified in a study that evaluate the harmful effects of such drugs [27]. The study is discussed in Section 3.3. The description of the drugs is given in Table 1.
The final list of tokens containing the single tokens, bigrams and trigrams for each HS in the dataset is obtained. An associative array for each of the HS is created that records the drug type and its frequency in a key-value format. The final list of tokens is then matched against the drug dictionary. There may be typos present in the listings which can affect the search mechanism in the drug dictionary. To overcome this, we shall use the Levenshtein distance to match the similarity of the token with the drug dictionary in case of typos. The Levenshtein distance measures the number of characters that should be changed to convert a string to the other one. If an exact match is found for a token or the Levenshtein distance of the token is less than 25 percent, then the corresponding drug type of that token from Table 1 is added in the key field of the associative array of that particular HS. After that, the matched token is searched in the set of extracted product listings for the HS. Since there may be multiple listings of a single drug type from various vendors in a HS, therefore the number of listings matched with the token is counted and stored in the value field of their corresponding drug type in the associative array which shall be the frequency of the drug type. The matched listings are removed from the extracted set of listings. This procedure is repeated for each of the tokens that get matched with the drug dictionary. If a token is matched against the drug type that already exists in the associative array, then the frequency of that drug type is updated accordingly. Once all the tokens are matched, the remaining listings left in the set are non-drug products and these listings are discarded. The associative array containing drug types recognized along with their frequency for each of the HS is passed on to the next stage for the calculation of harm score. An example of the final associative array of a HS offering cocaine, LSD, cannabis and ecstasy is shown in Figure 2.

Measuring harm score of HS
Drug abuse has become a significant health concern for individuals and society. Therefore the use of such drugs is controlled and prohibited by the policy-making bodies. The abuse of drugs can affect in multiple ways from individual harm to environmental and economic damage. Hence it is required to estimate the harm of each of the controlled drugs in terms of the multitude of ways it affects. An existing study has proposed a method to solve this problem where the authors' aim was to aid policymakers in the field of health and investigation by evaluating the harms caused by drug abuse [27]. A committee of drug experts from the United Kingdom was formed to assess the harm of 20 drugs based on 16 criteria using multi-criteria decision analysis (MCDA). The drugs were given an overall score on a scale from 0 to 100, with 0 indicating the least harmful and 100 being the most harmful drug on all 16 criteria. The name of the illicit drugs and their corresponding score is shown in Table 2. The list of criteria were: Drug-specific mortality, Drug-related mortality, Drugspecific damage, Drug-related damage, Dependence, Drug-specific impairment of mental functioning, Drug-related impairment of mental functioning, Loss of tangibles, Loss of relationships, Injury (to others), Crime, Environmental damage, Family adversities, International damage, Economic cost, and Community. After some criticism of the applied methodology, the authors have come up with a follow-up study to assess the harm of drugs on a broader scope. However, the scorings obtained in the follow up were very similar to the previous one with the correlation of 0.993 [28].
To calculate the overall harm score of an HS, we shall be adopting the scoring of the illicit drugs indicating their cumulative harm from the study discussed above [27]. It should be noted that alcohol and tobacco were included in the 20 drugs that were assessed in the study; however, our work is in the context of illicit drugs hence we did not consider alcohol and tobacco as they are not controlled substances.
Let L i ði ¼ 1; 2; . . . M Þ be the associative array of the i th HS H i in the dataset, a drug vector V i of length nðn ¼ 18Þ is created for H i . The elements x k i ðk ¼ 1; 2; . . . 18Þ of V i are obtained using Eq. (1). d k and tðd k Þ are the drug type and its individual harm score obtained from Table 2 f ðd k Þ is the frequency of drug type d k in the associative array L i .
A harm score τðH i Þ is assigned to H i given by Eq. (2).
As the HS may sell a number of different products with multiple listings, jV i j may get a very large numeric value. Therefore, the logarithm function is used to calculate the harm score to conveniently express the large values. One is added in the logarithm function to avoid getting zero in the argument (when jV i j ¼ 0) for which the log function is undefined. An ethical HS that does not deal in illicit drugs shall have jV j ¼ 0 and subsequently, a harm score of zero indicating that it does not pose any ill effects on its users.

Ranking
As there would be some HS that trade in less harmful drugs while others may offer potentially harmful drugs, so in order to identify the most severe HS, we need to rank them. The overall ranking shall depend on the harm score of the HS. The HS are ranked in the descending order of their harm score implying that the HS with the highest harm score would get the top rank. In the case of a tie, the HS with a drug having the highest individual harm score is placed above in the rankings. For e.g: let X and Y be two HS, X contains a single listing of crack cocaine and Y contains two listings of cocaine. Both X and Y have the same harm score but X shall be placed above Y in ranking because of the presence of crack cocaine (with individual harm score 54).

Experiments
The proposed ranking methodology attempts to uncover potentially harmful HS using content analysis technique. The rankings generated by the proposed technique need to be evaluated on the standard metrics for ranking problems. However, to the best of authors' knowledge, there is no gold standard or ground truth on the content-based ordering of the HS against which the generated rankings can be evaluated. Hence, to evaluate the accuracy of our proposed ranking technique, the rankings of the HS in the dataset from the three experts are considered as the ground truth.

Dataset
We have used the DUTA-10K dataset in our work. The DUTA-10 K has been used in an existing study to test the To Rank algorithm [20]. The dataset contains 10,367 labeled samples spread across 28 categories. Each sample represents a hidden service from the Tor dark web and contains the root page and the first level subpages of an HS in a single HTML file. In our case, we were only interested in the HS related to illicit drug trades therefore we have taken 255 HS samples dealing in illicit drugs and they were in English. The DUTA-10K dataset is publicly available for download at [29].

Ground truth generation
The dataset was presented before the three experts for ranking the HS to obtain the ground truth. One expert was a professional medical doctor, another was a psychologist and the last one was from academia. All three experts were asked to independently rank each of the HS in the dataset based on the availability of illicit drug listings on HS, the severity of available drugs and the frequency of listings. This generates three rankings from each of the independent experts. Since the problem of ranking is very subjective and each of the expert perceptions on the drug harms may vary so the evaluation may be biased if we consider ranking from any of the single experts. To tackle this problem, we shall be using the aggregation method called rank-based aggregation (RBA) [30] to create final rankings from three experts. In this method, the individual rank of each HS from three experts is combined, and then the final rankings are generated. This method ensures that the noise that may creep in shall be compensated by other experts during aggregation. The final list thus generated shall be used as ground truth for evaluating the accuracy of the rankings generated by our proposed method.

Evaluation metrics
The correctness of the rankings obtained from the proposed ranking methodology is evaluated by the Kendall's tau [31] metric commonly used in the field of information retrieval. Kendall's tau has been chosen given its wide use in the literature [23,[35][36][37][38] and has been shown to be a more robust and efficient metrics than the others [39]. We have also used rankbiased overlap (RBO) metric that puts more importance to the top of the ranked list similar to the weighted Kendall's tau [40] as our work is focussed on identifying the top ranked HS. Moreover, RBO can efficiently handle the non-conjointness in the rankings as compared to the other metrics [32].

Results
Python v3.6 [33] is used to implement the proposed ranking algorithm using harm score. The preprocessing of the dataset, including the removal of stop-words, was performed by the NLTK package. The harm score for each of the HS is obtained using Eqs. (1) and (2).
The top ten ranked HS retrieved by the proposed method and from the ground truth are shown in Table 3, respectively. For simplicity, the HS is denoted by alphabets instead of their onion domain. The HS with the highest-ranking obtained by the proposed algorithm is the same as evaluated by the experts. This HS is found to be selling eleven different illicit drugs from Table 1. Kendall's tau measure is used for comparing the rankings generated by the proposed methodology to the ground truth rankings to assess the correctness of the ranks assigned to each of the HS.
For each pair of rank ðp i; q i Þ; ðp j q j Þ$$$$$$$$$::ðp n; q n Þ in the list P and Q; n c and n d be the number of concordant pairs (if p i < p j and q i < q j or p i > p j and q i > q j ) and discordant pairs  Table 3. Top ten ranked HS obtained from the proposed algorithm and the ground truth. ACI (if p i < p j and q i > q j or p i > p j and q i < q j ) respectively. The Kendall's tau of the two ranking lists P and Q of size n is given by Eq. (3): RðP; QÞ ¼ n c À n d 1 2 nðn À 1Þ (3) Table 4 shows Kendall's tau for the proposed ranking algorithm for different values of k, where k represents the number of pairs in the rankings. The maximum value of k is set to k 5 50 i.e. we examine the closeness between the two rankings up to the top 50 ranked HS. From Table 4, it is evident that the proposed algorithm can accurately predict the top ten harmful HS when compared to the ground truth. The Kendall's tau value close to one in Table 4 indicates that the two rankings are highly related. However, Kendall's tau slightly decreases when k increases. The HS that have been allotted the top ranks indicate their potentially harmful nature and key position in illicit drug trade. The law enforcement agencies may allocate their resources more to these top-ranked HS in busting them down.
To further check the accuracy of our ranking methodology, we have obtained eight random samples of the rankings from the entire list and computed the Kendall's tau of the random samples and the ground truth. Table 5 shows the Kendall's tau of different samples, the size of each of the sample is 25. The high value of Kendall's tau for the random samples of ranks shows that the proposed ranking methodology is close to the ground truth data.
As in our work, the top-ranked nodes are of greater importance to the law enforcement agencies, the accuracy of the proposed method in the high ranks is examined. Rank-biased overlap (RBO) is used to measure the accuracy in high ranks by giving different weights for different ranks and allotting higher weights to high ranks. A higher RBO value indicates greater similarity and correlation between the two rankings. RBO of two ranking lists P and Q is given by Eq. (4) Table 4. Effectiveness of the proposed algorithm using Kendall's tau.

Perspective on illicit drugs
where AðP; Q; pÞ given by Eq. (5) is the overlap value between two rankings P and Q up to rank d, r is the number of unique ranks and p is a configurable parameter in (0,1) such that the smaller value of p implies that the metric is more top-weighted.
AðP; Q; dÞ ¼ jP 1:d ∩ Q 1:d j jP 1:d ∪ Q 1:d j (5) Figure 3 shows the value of RBO between the two rankings at the different values of P. The proposed algorithm can be seen producing accurate rankings with the ground truth for the high ranks.

Conclusion
In this work, a methodology based on content analysis for ranking harmful Tor hidden services dealing with illicit drugs is proposed. A metric is defined to calculate the harm score for the HS based on the different types of illicit drugs present on the HS. The harm score is then used to generate the overall rankings of the HS on the Tor dark web dataset. In order to assess the accuracy and correctness of the proposed ranking methodology, we created the ground truth for the dataset with the help of experts. The standard metrics used for evaluation indicate the good performance of the proposed method in ranking highly harmful HS. The top-ranked HS can then be put under greater monitoring by the law enforcement agencies. In future works, the proposed methodology can be strengthened by using more sophisticated NLP methodology as introduced in Ref. [41] for identifying illicit drugs code words. Moreover, other factors, like the trustworthiness and usability of HS, can also be quantified to assess the impact of HS.