Research on comment target extracting in Chinese online shopping platform

Purpose – This paper aims to extract the comment targets in Chinese online shopping platform. Design/methodology/approach – The authors first collect the comment texts, word segmentation, partof-speech (POS) tagging and extracted feature words twice. Then they cluster the evaluation sentence and find the association rules between the evaluation words and the evaluation object. At the same time, they establish the association rule table. Finally, the authors can mine the evaluation object of comment sentence according to the evaluation word and the association rule table. At last, they obtain comment data from Taobao and demonstrate that themethod proposed in this paper is effective by experiment. Findings – The extracting comment target method the authors proposed in this paper is effective. Research limitations/implications – First, the study object of extracting implicit features is review clauses, and not considering the context information, which may affect the accuracy of the feature excavation to a certain degree. Second, when extracting feature words, the low-frequency feature words are not considered, but some low-frequency feature words also contain effective information. Practical implications – Because of the mass online reviews data, reading every comment one by one is impossible. Therefore, it is important that research on handling product comments and present useful or interest comments for clients. Originality/value – The extracting comment target method the authors proposed in this paper is effective.


Introduction
People can buy goods without leaving home via website.However, people cannot see the product entity, there always is no guarantee that the quality of the items on line with people's expectations.For the most part, people will glance over the product review before they confirm an order.The product review not only helps people doing purchase decision, but also helps merchants understand the customer's attitude to the product.Therefore, the merchants can improve their goods and services niche targeting (Chen et al.2015).However, because of the mass online reviews data, reading every comment one by one is impossible.
So research on handling product comments and present useful or interested comments for clients is significant.
One comment sentence represents one opinion.At present, the research about opinion mining mainly focus on two aspects, one is mining the emotional orientation (Pak and Paroubek, 2015;Yi et al., 2003;Brody and Elhadad, 2010;Pang et al., 2016).It is mainly excavating customer's attitude toward the product, generally includes negative, positive and neutral (Pang et al., 2002).Positive comments are good for the sale of products, while negative comments can inhibit the sale of products (Luo, 2009;Herr et al., 1991;Qiu et al.2015).Another one is the excavation of the opinion object (Schouten and Frasincar, 2014;Popescu and Etzioni, 2005;Jakob and Gurevych, 2010;Scaffidi et al., 2007;Wang et al., 2013).An online shopping product always has many properties.Such as a T-shirt has the characteristics of comfort, quality, color, logistics, etc.A telephone has the characteristics of cost, appearance, service etc.In addition, different people values different characteristics.Comment sentence's opinion objects always are one or several characteristics of the products.Mining the comment object in the comment sentence enables people to quickly find out the related reviews about the feature they concerned in a certain commodity from the massive review data.Therefore, that, the customers time will be saved.
In this paper, we concentrate on the second one, which is to excavate the opinion targets in online commodities comments.Opinion objects always include explicit features and implicit features (Qiu et al.2015).An explicit feature is that a word or phrase which describes a property or characteristic of the product appeared in the comment text directly.Such as comment "这款 手机价格便宜。"(this kind of mobile phone's price is cheap), the opinion object is "价 格"(price).This word display in the opinion target directly.Hence, we called the comment text explicit comment sentence and "价格"(price) is the explicit feature.An implicit feature is that the opinion objective does not appear in the comment texts and we can deduce the evaluation objective from the context or idioms.For example, comment "这款手机便宜。" (this kind of mobile phone is cheap), we can deduce the comment objective is "价格"(price).Hence, "price" is the implicit feature and this comment is called implicit comment sentence.
In this paper, we proposed an opinion object extraction method in online shopping platform comments based on association rules.In Section 2, we summarize the research of predecessors.In Section 3, the comment sentences are segmented and part-of-speech tagged.In addition, we use IF-IDF select feature words.Then we used the particle swarm algorithm based on simulated annealing (SA-PSO) to select feature words once again, and obtain the feature words set.In Section 4, we used an improved FCM algorithm based on SA to cluster the explicit comment sentences (SA-FCM).In Section 5, we mine association rules among explicit features, opinion words and categories.In addition, establish an association rules table.According to the opinion words and association rules table, the evaluation objects in the comment features can be distinguished.At last, an experiment in Section 6 show that the comment objects extraction method we proposed in this paper is meaningful.In addition, Section 7 is the conclusion.

Literature review
In the research on the extraction of comment objects, many scholars believe in that the evaluation object is always a noun or noun phrase, and the evaluation word is always an adjective (Liu et al., 2016;Zhang et al., 2010;Liu et al., 2015).Some scholars research on extracting the evaluation words and the evaluation object at the same time (Liu et al., 2016;Chen et al., 2016, Liao et al., 2017).Liu et al. (2010) research on comment target extraction and corresponding sentiment classification.First, Word segmentation, part-of-speech annotation and syntactic analysis are carried out for the given text library.Second, they extract the noun and noun phrase as the comment targets.Then, word frequency filter, PMI filter and noun pruning algorithm are used to filtrate the comment targets.At last, they use the undirected method to judge the orientation of the evaluation object.The algorithm proposed in the paper is simple but the authors ignore the implicit features.Wang et al. (2013) proposed a hybrid association rule mining method for implicit features extraction in Chinese comments.Jiang et al. (2014) used a modified collocation extraction algorithm mining the basis association rule, then he used a semi-supervised LDA topic model to extract the new rules to extend the basis association rules library, so as to improve the recognition effect of implicit features.Zhang and Zhu (2013) adopted a method based on the co-occurrence association rules, which mainly consists of four steps: (1) Determining the co-occurrence matrix.
Hai et al. (2011) also adopted the co-occurrence association rule mining method to distinguish implicit features.The first stage was generating an association rule table between opinion word and explicit features, and the second stage is the usage of the association rules.Xu et al. (2015) proposed an implicit feature extraction method based on SVM classifier.The algorithm we proposed in this paper, not only can mining explicit features but also can mining implicit features in the comment texts.

Data preprocessing
The computer cannot read natural language directly.Therefore, before extracting the comment objects, it must to process natural language into a form that computers can understand.

Representation of documents
When we collecting the comment texts, word segmentation and part-of-speech (POS) tagging are the first steps.For example, the word segmentation and POS tagging of the comment sentence "手机价格真便宜。" (the mobile phone's price is so cheap."手机/n价 格/n 真/d 便宜/a 。/wp" (the mobile phone/n price/n is so/d cheap/a) .In addition, n, v, d, a respectively denote nouns, verbs, adverbs, adjectives.Then, remove the words which is insignificant in the text, such as "because" and "er".Such words cannot provide useful information for text content and it will increase the algorithm time cost if maintained.
To distinguish the importance between different feature words, it is necessary to calculate the weights of the feature words.In this paper, TF-IDF method (Joeran et al., 2017) is used to calculate the weight.In addition, the formula is as follows: In formula 1, a is the frequency of the feature in the document set; t is the total number of times of all the features in the document set, b is the number of document in the document set, c is the number of documents that contains the feature.Suppose z i is the ith comment sentence, and each comment sentence consists of several feature words.So the comment sentence can be respected by.
, t k is the feature words, w k is the TF-IDF value and w k is the weight vector of t k .

Chinese online shopping platform
According to the TF-IDF, we can select n features with the maximum TF-IDF value as the candidate features set.In addition, the vector representation of each feature word can be calculated.

Feature selection
After using the TF-IDF to select feature words, the dimension of the feature words' vector space is relatively high.So the dimension should be selected again to reduce the running time of the algorithm.
Particles swarm optimization (PSO) algorithm is an intelligent optimization algorithm which is simulated the birds foraging behavior.In addition, traditional PSO algorithm has existed a problem of premature convergence.So import the SA algorithm to PSO, which can accept a poor solution with a certain probability when the particles update the current solution.Thereby avoiding the particles search in a local scope, and make the particles gradually find the global optimal solution.In this paper, we use the SA-PSO algorithm proposed by Zhang Qiliang and Chen Yongsheng (Zhang and Chen, 2015) to select feature words for the second time.
Using binary encoding of the feature words.The value can only be 1 or 0, when the value equals to 1, it declares the feature arises in the sentence.If not, the value equals to 0.
And the fitness was calculated by formula 2: where f x i ; x 0 ð Þdenotes the sum of all feature words and their initial position distance.x i presents the position coordinates of the ith feature, x 0 denotes a given initial position coordinate.n is the number of the features.

Cluster analysis
In this paper, we first cluster the explicit comment sentences, then excavate the correlation between the opinion words and categories.Because of a huge number of online reviews and the high dimension of the review words, in this paper we use the Fuzzy C-means (FCM) clustering algorithm which applied to high-dimensional data to cluster the comment sentences.At the same time, we combined FCM clustering algorithm with SA algorithm to avoid the algorithm into local optimum.

Fuzzy C-means algorithm
Fuzzy C-means (FCM) (Bezdek, 1981) method is a kind of algorithm based on the objective function.When classifying the sample, every sample is not belong to one category certainly, but at a certain probability (membership) to determine the degree of each sample belongs to each category.
The objective function of FCM algorithm is: Where: c = is the number of categories; V ¼ v 1 ; v 2 ; . . .; v c ð Þ donates the cluster center for each category; is the Euclidean distance of the k th sentence to the cluster center of the i th class.J U; V ð Þ represents the sum of the weighted square distances between all sentences and clustering center.The clustering criterion of FCM clustering algorithm is that, make the objective function get its minimum value under some conditions by iterating over the membership matrix and clustering center.
When objective function obtains the minimum value, the FCM algorithm will output c clustering center and one membership matrix.Then, judge the category of each comment sentences according to the principle of maximum membership.

Simulates annealing-fuzzy C-means algorithm
FCM algorithm is likely to fall into local optimum during the iteration.This paper import SA (SA) algorithm into FCM algorithm, and let the FCM algorithm escape the local optimal.In the iteration process, the new solution is accepted with a certain probability but not always accept the new solution.

Extract the comment target based on association rules
In this chapter, we first look for the association rules among explicit features, opinion words, in each category.Build an association rules table, and regard it as a basis rules when mining the evaluation objects.Then we can excavate the implicit features contrasting the association rule table and the implicit comment words.

Association rules
To establish an association relationship between opinion words and opinion objects, the first thing is to mining the frequent item sets between opinion words and evaluation objects.Then establish the association rules between opinion words and targets according to the frequent item sets.At last, build a classifier using the association rule table.
Association rule (Agrawal et al., 1993) is a kind of important correlation between data.Set I ¼ i 1 ; i 2 ; . . .; i n f grepresents a set of feature words, and T ¼ t 1 ; t 2 ; . . .; t n f grepresents a set of text, t i is a text, and t i consisting of multiple feature words, so t i is a set of feature words and t i I. Then the association rules can be described as follows: There are several important definitions in the association rules: Support reflects the probability of the feature words set {X, Y} appeared in text set T. In formula 5, n denotes the number of texts.Definition2: Confidence.
Confidence reflects that in the item set containing X, the possibility of containing Y. Definition3: Frequent item sets.Usually, you need to set a minimum support min s up and a minimum confidence min c onf to limit the rules.If Support X ð Þ !min s up, X was called a frequent item set, otherwise it is an infrequent item set.

Chinese online shopping platform
Definition4: Strong association rules.
If Support X !Y ð Þ !min s up and Confidence X !Y ð Þ !min c onf, the association rule X !Y was called strong association rule.
The support, confidence, and frequent item sets are the prerequisites and conditions for association rule mining.Strong association rules are the result of mining association rules.

Mining association rules based on apriori algorithm
The Apriori algorithm (Agrawal and Srikant, 1994) uses a layer-by-layer progressive and iterative algorithms to look up k frequent item sets that satisfy the minimum support, and looks up the k þ 1 frequent item set layer by layer until it can't find the k þ 1 frequent items.Apriori algorithm is simple and easy to implement, so in this paper we mining the association rules by using Apriori algorithm.
After the first step of excavation, the number of association rules that satisfying the given constraints may be relatively large, and it is necessary to pruned the rules so as to improve the rules ability of distinguish the categories.This paper selects the confidence threshold method to prune the rules.That is, set a minimum confidence level.When the confidence level of a certain rule is smaller than the minimum confidence level, the rule is deleted; otherwise it is retained.The algorithm is as follows (Wu and Kumar, 2013): Þis feature set that was made up of 0 and 1; F 1 fi 2 I 1 ji:supCount=n !min_supg;//frequent 1 item set, and supCount is the item set number ofXand y co-occurance; min_sup is the minimum support; R 1 ff jf 2 F 1 ; f :supCount=f :condCount !min_conf g//rule set which satisfy the condition,condCount is the item set number of X, min_conf is the minimum confidence; for (k = 2;F kÀ1 6 ¼ f ,kþþ) do I k /candidate-genðF kÀ1 Þ;//the function togenerateC k for each transaction t 2 T do for each candidate i 2 for allf 1 ; f 2 2 F kÀ1 ; f 1 ¼ fi 1 ; :::; i kÀ2 ; i kÀ1 g; f 2 ¼ fi 1 ; :::; i kÀ2 ; i kÀ1 0 g and i kÀ1 < i kÀ1 0 do i fi 1 ; ::; i kÀ2 ; i kÀ1 0 g; delete i from I k ; return I k 6. Experiment evaluation 6.1 Data preprocessing Grab the comment data of HUAWEI mobile phone (model MATE S) from the Taobao platform using the data grabber tool Octopus.There are 2265 valid comments are crawled.Each comment consists of multiple clauses and contains one or more evaluation objects.First of all, the language technology platform LTP is used to split the comment sentences, and obtained 9362 clauses.Then, the comment clauses are segmented, tagged, and removed the insignificance words.Then use the TF-IDF method mentioned in the section 2 to select the feature for the first time.Finally, using SA-based particle swarm optimization algorithm to select the feature words for the second time to obtain the feature word set.

Experiment procedure
First, mark the comment objects of each comment and divide the comment objects into nine categories manually.They are customer service, performance, appearance, battery, pixel, sound quality, price, logistics and quality.These categories are different from each other and cover almost all comment objects.In addition, divided the comment set into explicit comment sentences set and implicit comment sentences set.Then cluster the explicit comment into nine classes using the SA-FCM algorithm proposed in Section 3. Then every class of texts is put into one document set.Each document set corresponds to a comment object.After that, extract the association rules between opinion words and objects.At last, excavate the implicit features of the implicit comment sentences according to the association rules.
6.2.1 Clustering the explicit comments.The SA-FCM method is used to clustering the explicit sentences.Because the comment objects are divided into 9 categories by hand, the number of clustering class is set to nine.In this paper, the program of SA-FCM algorithm is written in MATLAB.After the iteration of the algorithm, we finally obtained the membership matrix of the explicit sentences.Part of the degree of membership is shown in Table I, where the row represents the category and the column represents the features.
According to the principle of maximum membership: the sentences' category is the highest degree of membership in the evaluation matrix.For example, in the first sentence, the 4th degree of membership is the largest.Therefore, sentence 1 belongs to the 4th category.
After obtaining the category of every explicit sentence, put the same category sentences into the same folder, then we will get nine folders.Marking each comment sentence according to its category, explicit features, and opinion words.Such as z i ¼< o i ; q i ; c j >, where z i is a comment sentence,o i is the explicit features in the sentence, and q i is opinion Chinese online shopping platform words, c j is the category, such as the explicit comment "样式非常美观" (the style is very beautiful), the result of the labeling is "<样式, 美观, 外形>" (style, beautiful, appearance).At the same time, put synonymous explicit features into an aggregate O i i ¼ 1; 2; . . .9 ð Þas the final feature set.
6.2.2 Extract the association rules in explicit sentence.Mining of association rules for these nine document sets.The feature words are different in different document set.Then let the minimum support equals to 0.01 and the minimum confidence equals to 0.3.Then mining the association rules by the algorithm mentioned in section 4. So the association rules were shown in Table II.
In Table I, "服务!不错(service !pretty good) (5.9289 per cent, 40.3587 per cent)" indicates that the feature word "service" recommended feature word "pretty good" support was 5.9899 per cent, and the confidence was 40.3587 per cent.At the same time, the feature words "service", "attitude", and "customer service" all refer to the "customer service" category.In this category, use <service, pretty good> as the association rule for that category.In total, the association rules are shown in Table III.
6.2.3 Extract the implicit features.After mining the association rules of each category, the implicit features in the implicit comments can be extracted according to the comment words of the sentences and the association rules table.For example, the sentence "手机很好 看" (mobile phone looks good), it is easy to found the association rule <appearance, good-looking> in the association rule table.Therefore, we can determine the implicit feature is appearance.

Experimental results
We always use precision and recall to evaluate the performance of the extraction method.Precision is the ratio of the correct number of documents retrieved to the total number of the documents retrieved.The recall is the ratio of the correct number of documents retrieved to the correct number of documents in the document library.
Precision = Number of correct information extracted/Number of information retrieved; Recall = Number of correct information extracted/Number of information in sample.
However, the precision and recall are contradictory in some cases.The increase of the precision may lead to a decrease of the recall.In addition, the increase of the recall may also lead to a decrease of the precision.Therefore, it is usually necessary to introduce an F-value to measure the feature extraction effect.The formula to calculate the F-value is as follows: In this paper, the accuracy of the manual annotation is set to 100 per cent.The precision, recall and F-value of the feature extraction algorithm proposed in this paper are shown in Table IV.From Table IV, we can see that after mining association rules, the precision and F-value improved a lot.At the same time, the association rule table established in this paper can not only be used to extract the implicit feature, but also can be used to extract the explicit feature according to the opinion words.Therefore, the algorithm proposed in this paper has a certain practical value.

Conclusion
The online products reviews enable user to learn about the product more comprehensive from the purchased user's real feeling.It can help users make Chinese online shopping platform purchase decisions, and the businesses also can improve the product based on the user's feedback.This paper makes an opinion mining research on user online reviews and primary research studies the features in the reviews and takes the implicit features into account, to achieve a more complete extraction of review features.This paper grabs user review data from the e-commerce platform, and carries out operations such as word segmentation, and part-of-speech tagging on the captured data.Then extracts the feature words twice.An FCM clustering algorithm based on SA algorithm is proposed to cluster explicit comment sentences.Based on the classification document sets of explicit sentences, using text association rule mining algorithm to extract rules, and establish rule libraries of each category to extract implicit features.Through an experiment, we compared the precision, recall and F-value use the association rules and not use the association rules.In addition, find that the values of using association rules are higher than that without association rules.Therefore, we can draw a conclusion that the method proposed in this paper -use SA-FCM clustering first and then use association rule table to mine comment objects is effective.At the same time, there are many deficiencies in this paper.For example, first, the study object of extracting implicit features is review clauses, and not considering the context information, which may affect the accuracy of the feature excavation to a certain degree.Second, when extracting feature words, the low-frequency feature words are not considered, but some low-frequency feature words also contain effective information.In later work, it would greatly improve the accuracy of excavate the implicit features if these two points are considered.