Detecting the Internet Water Army via Comprehensive ... - IEEE Xplore

6 downloads 0 Views 171KB Size Report
California State Polytechnic University, Pomona. Abstract—Online reviews play a crucial role in helping con- sumers to make purchase decisions. However ...
Detecting the Internet Water Army via Comprehensive Behavioral Features Using Large-scale E-commerce Reviews Bo Guo, Hao Wang and Zhaojun Yu

Yu Sun

Business Intelligence Department, MEIZU Inc. Email: [email protected]

Computer Science Department, California State Polytechnic University, Pomona

Abstract—Online reviews play a crucial role in helping consumers to make purchase decisions. However, a severe problem Internet Water Army (a large amount of paid posters who write inauthentic reviews) emerge in many E-commerce websites recently which dramatically undermines the value of user reviews. Although the word Internet Water Army originated from China, some other countries also suffered from this problem. Many organized underground paid poster groups found it extremely profitable to mislead the consumers by writing fake reviews. It had become more and more challenging to accurately detect the water army who could alter their writing style. In this paper, we design a comprehensive set of features to compare paid posters against normal users on different dimensions. Then we build an ensemble detection model of seven different algorithms. Our model has reached 0.726 in AUC measure and 0.683 in F1 measure on JD dataset, 0.926 in AUC measure and 0.871 in F1 measure on Amazon dataset, which outperforms previous studies. Our work provides some practical solutions and guidance to this severe problem for the whole E-commerce industry.

I. I NTRODUCTION An official report by China Internet Network Information Center (CNNIC) said that there were currently around 731 million Internet users in China, which is approximately 53% of its total population [1]. Given the huge Internet user base, the E-commerce industry in China had grown into prosperity in recent years. Since the review pages were valuable resources for attracting potential consumers, it has incubated a new business under the table, paid posters. A synonym phrase Internet Water Army also became popular recently in this context [2]. They mainly provide two kind of service to customers, promoting a specific product, company, person or message; smearing the competitor or adversary or competitors products or services. Another related concept is Electronic Spamming, the use of electronic messaging systems to send unsolicited messages [3]. It can be seen from the business process illustrated above that Electronic Spamming is a different concept from paid posters. The purpose of spammers is to send huge amount of junk message in a short time. Spammers does not pay much attention to the content quality but pay much more attention to the spread scope. For paid posters, however, their main goal is to attract more customers. Therefore, it is a must for them to focus on the content and quality of reviews. They need to c 978-1-5090-5957-7/17/$31.00 2017 IEEE

disguise themselves and behave like normal users or choose commendatory words very carefully. Although water army is a severe problem, there is still little attention from the academia who focused on the study of e-commerce industry. Some previous work studied strategies fighting against spammers. But because the spammers differ much from paid posters, these strategies could not be applied directly to the water army problem. Two questions left unanswered . 1) How to detect whether an user is a paid poster or not? 2) What could we do to restrict the prevalence of paid posters? Our study contribute to the existing literature in the following aspects: 1) We design an comprehensive set of features to compare paid posters against normal users on different dimensions. 2) We utilize the review content, the meta information of each review and the related product’s information, which is neglected in previous studies. 3) Seven classification algorithms are combined together to build an ensemble classification model. 4) Previous studies mainly focused on the datasets from the US. As the paid poster problem is quite severe in China, to test the general applicability of our model, we use two datasets from both China and the US. One is crawled by ourselves from JD.com. The other is from Amazon. 5) Based on the evaluation result on two datasets, our model has strong distinguishing power and outperforms previous studies. II. R ELATED W ORK A variety of detection approaches had been proposed since the pioneering work of Jindal et al. [4]. Most studies used supervised learning algorithms. These studies also designed some behavioral features, such as temporal pattern and textual features. Jindal et al. [4] studied the trustworthiness of online opinions in the context of product reviews and showed that these fake reviews were quite different from traditional email spam. But their work relied heavily on duplicate reviews which

were gradually eliminated by the regulation policy of the websites. Other research from Jindal et al. [5] analyzed the unexpectedness of reviews and proposed a domain independent technique to find those groups with high probability to write untruthful reviews. Fei et al. [6] focused on the burstiness nature of reviews to identify review spammers . Bursts of reviews can be either due to sudden popularity of products or spam attacks. More advanced text mining and semantic analysis techniques were used in this field, such as sentiment analysis and opinion extraction [7], [8]. Through sentiment analysis, they treated each review as a record associated with a sentiment label (positive, negative and neutral) and then solved the problem by traditional classification algorithms. The text mining model and the semantic language model are used in Lau et al. [9] to solve the spammer detection problem. Semi-supervised methods were also used to levy the need for huge amount of training data [10]. Working with Dianping, the largest Chinese review hosting site, Li et al. [11] used a model of learning from positive and unlabeled examples and then propose a collective classification algorithm . The key challenge in our study is that paid posters may alter their writing style. Existing literature proposed solutions to this problem by text similarity analysis [4], user group analysis [12], or temporal features [6]. Their crucial shortcoming is incompleteness. Previous studies built their detection frameworks on one specific aspect. Some paid posters were normal in this aspect but strange in other aspects. They would be neglected in their detection process, which had weakened much applicability. III. D ETECTION F RAMEWORK In this section, we describe the general business process of E-commerce, the feature system, the datasets and the classification algorithms in details. Inspired by the crucial shortcomings of existing literature, we propose a comprehensive detection framework targeting at 3 major aspects [13] : Linguistic These features focus on the semantic and sentimental characteristics of the reviews posted by each consumer. Behavioral These features mainly exploit the meta information of the reviews, such as the review time and the purchase time and the account profile. Product These features focus on the product information. Although writing styles can be easily altered, few companies will change their brand name. The inconsistency between product information and the review text does show some distinguishing power. A. Feature Selections The key notations used in this section are listed in Table I 1) Internal Text Similarity - ITS: The paid posters have strong incentives to minimize the time cost in fabricating the reviews. So they often copy the same text to post on different products. The average text similarity level among all the reviews created by the same user could be a feature

TABLE I S YMBOL D EFINITION Notation i Ci ni C ci,j ci,j (tcmt ) ci,j (treg ) ci,j (b)

Definition the subscript i always refers to a consumer the set of reviews created by consumer i the number of reviews created by consumer i, the total set of all the reviews a specific review j created by consumer i the time when this review was created the registration time of the consumer the brand of the product

targeting at this phenomenon, defined as follows. cos(ci,j , ci,k ) is the cosine similarity function of review ci,j and ci,k . The feature measures the text similarity within one user’s reviews. So this feature is called internal. ITSi =

2 ni (ni − 1)

ni  ni 

cos(ci,j , ci,k )

j=1 k≥j

2) Comment Latency - CL: Due to the time cost in the delivery process, it usually takes several days between purchase time and review time. It also takes some time for the consumer to actually experience the product. Only after a reasonable time could they tell whether the product is of high quality or low quality. There is usually a reasonable latency between the review time and the purchase time. But for the paid posters, there is no need to wait for the delivery and experience the product. They will write reviews shortly after the order. So the review latency could be used as an feature, measured in the following equation. CL =

ni 1  (ci,j (tcmt ) − ci,j (tpur )) ni j=1

3) Comment Time Interval - CTI : It is a common sense that people have little incentive to write reviews frequently. But the paid posters are usually paid to write fake reviews. There are strong incentives for them to write reviews so frequently that their behaviors differ wildly from normal users. An feature aiming at this phenomenon is defined as follows. n

CTI =

i 1  (ci,j (tcmt ) − ci,j−1 (tcmt )) ni − 1 j=2

where ci,i , ci,2 , · · · , ci,j , · · · , ci,ni are sorted by tcmt . 4) Emotional Words and Product Feature Words: Trusty reviews will shed more light upon the product features. Without really experiencing the product, it is harder to fabricate honest reviews. But those fake reviews may try to use as many positive emotional words as possible to flatter some product. How the user choose words does reflect his own characteristics. As the products used in this research are all cellphones, we consult to an expert in mobile phone industry to list a domain-specific dictionary for us. Thus 3 features are built. POS average positive emotional word number in a review

NEG average negative emotional word number in a review FEATURE average product feature number in a review 5) Brand Concentration - BC: Spammers may be hired by a shady organization to puff or smear a specific brand. These paid posters will focus on the specific brand only. In other words, they focus on a small fraction of brands. Here we propose an feature to characterize this kind of behavior, based on the Herfindahl index [14]. BC =

m 

ηk2

k=1

where ηk is the percentage of reviews in brand k. This feature is the sum of each brand’s squared market share for the given consumer. If n brands with equal review share, BC will be  n 1 2 1 1 ( n ) = n . If one brand captures the whole share, BC will be 1. A higher BC feature means a more concentrated user and more likely to be a paid poster. 6) Text Length - TL: The paid posters often try their best to influence the consumers’ behavior and imitate the normal reviews. But the normal reviewers do not have strong incentive to write a long and complicated review. So the average review length of a reviewer could be used to distinguish paid posters from normal reviewers. ni 1  TL = ||ci,j (text)|| ni j=1 where ||ci,j (text)|| is the text length of ci,j 7) External Text Similarity - ETS: The paid posters often post reviews in a copy and paste manner and many paid posters share similar templates. It is quite rare that two unrelated reviewers would post exactly the same content. So the similarity between different users can also be used as a feature. ni 1  ETS = ||{c|c ∈ C, c(text) = ci,j (text)}|| ni j=1

claim is not possible unless a paid poster admits to it or his employer discloses it, both of which are unlikely to happen. The discussion of a clear boundary between normal users and paid posters is beyond technical scope. The sample size is shown in Table II. TABLE II S AMPLE S IZE OF T WO DATASETS DataSet JD Amazon

Normal Users 485 642

Paid Posters 451 645

The purchase time related to each review is not available in the Amazon dataset. So the CL feature could not be calculated for the Amazon dataset. Although the brand of each product is provided in the Amazon dataset, but there are too many missing values. The BC feature is not available in the Amazon dataset either. C. Classification Methods We develop an ensemble model which aggregates seven classification algorithms to distinguish paid posters from normal users shown in Figure 1. These algorithms are aggregated by a majority voting mechanism. Ensemble learning is primarily used to improve the performance of a model, or reduce the likelihood of an unfortunate selection of a particularly poorly performing classifier. Combining classifiers may not necessarily beat the performance of the best classifier in the ensemble, but it certainly reduces the overall risk of making a particularly poor selection.

Neural Network Web Crawler System AdaBoost

Logistic Regression

B. Dataset To get a comprehensive and unbiased performance evaluation of our detection framework, we use two datasets described below in our research. JD JingDong (JD.com NASDAQ:JD) is one of the biggest B2C online retailer companies in China, which covers 1500 different categories and more than 20 million products. This dataset is crawled by ourselves. We also put much effort into the preprocessing and data cleaning process. Amazon This dataset was provided by [4]. It covered a very wide range of products and contained 5.8 million reviews, 2.14 million users and 6.7 million products. We manually labeled those potential paid posters by reading the contents of their reviews as well as the other meta information (most of the reviews were meaningless or contradicting). Here we use the word potential just to avoid the valueless disputes over the ground truth behind the scene. Any absolute

Reviews 32386 260711

Labeling

SVM

Ensemble Model

Random Forest

Naïve Bayes

Gradient Boosting

Fig. 1. Ensemble Model

IV. E VALUATION This section describes the evaluation process in details. We evaluate our framework’s performance by different metrics. Tools like the ROC curve are also used to examine the distinguishing power.

A. Evaluation Metrics These following metrics are used to measure model performance. Precision Percentage of all positive predictions that are correct. Recall Percentage of all real positive observations that are correct. F1 Measure The harmonic mean between precision and recall. ROC Curve and AUC The area under the curve (often referred to as simply the AUC) is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.

(a) JD

TABLE III C OMPARISON OF A LGORITHMS ’ P ERFORMANCE ON JD DATASET AdaBoost Neural Network Gradient Boosting Logistic Regression Naive Bayes Random Forest SVM Ensemble

Accuracy 0.630 0.636 0.666 0.645 0.589 0.651 0.606 0.630

AUC 0.680 0.703 0.720 0.698 0.668 0.730 0.679 0.704

F1 0.629 0.628 0.662 0.622 0.399 0.645 0.471 0.571

Precision 0.632 0.658 0.669 0.664 0.742 0.665 0.722 0.698

Recall 0.628 0.583 0.646 0.588 0.274 0.659 0.363 0.455

(b) Amazon Fig. 2. Confusion Matrix

TABLE IV C OMPARISON OF A LGORITHMS ’ P ERFORMANCE ON A MAZON DATASET AdaBoost Neural Network Gradient Boosting Logistic Regression Naive Bayes Random Forest SVM Ensemble

Accuracy 0.853 0.836 0.857 0.827 0.811 0.866 0.772 0.836

AUC 0.923 0.917 0.928 0.903 0.886 0.926 0.898 0.923

F1 0.857 0.851 0.862 0.846 0.837 0.871 0.815 0.857

Precision 0.831 0.788 0.837 0.766 0.738 0.840 0.692 0.783

Recall 0.886 0.931 0.888 0.949 0.969 0.897 0.994 0.964

All the seven algorithms as well as the ensemble model are evaluated by a five- fold cross-validation on the whole dataset. We use the average of all the rounds in the cross-validation process as the final measure. The classification result is shown in Table III and Table IV. From these results, random forest performs better than other classification algorithms with a higher F1 and AUC. Naive Bayes and SVM perform well in the precision measure but quite worse in the recall measure. In other words, these two algorithms will miss many paid posters. Due to the manual labeling process, only a moderate size of the sample data is available. Thus, algorithms that need a large sample dataset for training neural network also do not perform well here. The ensemble method gets little improvement from the voting mechanism. We choose random forest as the final classifier. Our detection framework has reached 0.726 in AUC measure and 0.683 in F1 measure on JD dataset, 0.926 in AUC measure and 0.871 in F1 measure on Amazon dataset. B. Confusion Matrix Using random forest as the final classifier, take 20% of data as the test set, the confusion matrix is plotted in Figure 2.

For the normal users, the detection system performs well in detecting the two groups of people. While for the paid posters, the system’s precision drops. This is partly due to the inconsistency of our annotation process. No ground truth could be used as a yardstick to calibrate the annotation work. Different people may have different values on spamming behavior. One review may be viewed as trustful in one person’s eyes and be viewed as fake and nonsense in another person’s eyes. The inconsistency does weaken our distinguishing power. C. ROC Curve

Fig. 3. ROC Curve of JD Dataset

the delivery process. The logistics information could be combined together with the review system to limit the activities of paid posters. One consumer must write reviews after the logistics information has confirmed delivery. Our data has shown that the paid posters often work in a copy and paste process. Disallowing the users to write duplicate reviews may let the users put more effort in writing honest and meaningful reviews. It is more valuable to accumulate more meaningful reviews than many meaningless and contradicting junk reviews. B. Limitations and Further Research Fig. 4. ROC Curve of Amazon Dataset

From the ROC curve, it is shown that the classifier does have strong distinguishing power. The point which has the maximum distance against the diagonal line is near the central point of the curve. So choosing 0.5 as the predict threshold is a reasonable choice. D. Comparision With Previous Work We compare our results with those previous studies who also used the Amazon dataset. In Jindal et al.’s work [4], their AUC measure ranges from 63% to 78%, using different sets of features. In Jindal et al.’s another paper [15], their AUC measure based on a ten fold cross-validation is 78%. From the figures in Wang et al. [16], their precision decrease quickly when the top N sample size increases to more than 300. They reach 94.7% precision in Top 100, and 88.5% in Top 200, and 80.8% in Top 300. Our detection framework outperforms Wang et al. [16] when the sample size gets larger. V. C ONCLUSION AND F UTURE W ORK In this paper, we study the detection of paid posters problem and propose a comprehensive set of features to characterize the behavior of paid posters. The performance of our detection framework are carefully evaluated through four scoring measures on two datasets. Our model has reached 0.726 in AUC measure and 0.683 in F1 measure on JD dataset, 0.926 in AUC measure and 0.871 in F1 measure on Amazon dataset, which outperforms previous studies. Our research has provided a practical solution to the paid poster problem from the technical perspective. A. Practical Implications More strict registration limitations would be a reasonable solution. Those paid posters often register many accounts used as cloaks to disguise themselves. For normal users, they will use their real name, phone number and address to receive the ordered goods. But for paid posters, they may collude with the seller to avoid the delivery cost by pretending they have already received the good. To be more strict, real-name rules could be applied. Normal users are quite likely to accept the real-name system. But it will be a heavy burden for paid posters. Thanks to the development of the logistics technology, nowadays, E-commerce websites could track all the details in

1) In the future, we will try more unsupervised learning algorithms which could reduce the human labor cost. 2) We will continue to increase our detection accuracy. After we reach an adequate high level of accuracy, we will estimate the percentage of paid posters in all the reviews on Chinese websites. R EFERENCES [1] CNNIC, “The 39th statistical report on internet development in china,” China Internet Network Information Center (CNNIC), China, 2017. [2] C. Chen, K. Wu, V. Srinivasan, and X. Zhang, “Battling the internet water army: Detection of hidden paid posters,” in Advances in Social Networks Analysis and Mining (ASONAM), 2013 IEEE/ACM International Conference on. IEEE, 2013, pp. 116–120. [3] P. P. Chan, C. Yang, D. S. Yeung, and W. W. Ng, “Spam filtering for short messages in adversarial environment,” Neurocomputing, vol. 155, pp. 167–176, 2015. [4] N. Jindal and B. Liu, “Opinion spam and analysis,” in Proceedings of the 2008 International Conference on Web Search and Data Mining. ACM, 2008, pp. 219–230. [5] N. Jindal, B. Liu, and E.-P. Lim, “Finding unusual review patterns using unexpected rules,” in Proceedings of the 19th ACM international conference on Information and knowledge management. ACM, 2010, pp. 1549–1552. [6] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh, “Exploiting burstiness in reviews for review spammer detection.” ICWSM, vol. 13, pp. 175–184, 2013. [7] F. Wu and B. A. Huberman, “Opinion formation under costly expression,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 1, no. 1, p. 5, 2010. [8] F. Li, M. Huang, and X. Zhu, “Sentiment analysis with global topics and local dependency.” in AAAI, vol. 10, 2010, pp. 1371–1376. [9] R. Y. Lau, S. Liao, R. C.-W. Kwok, K. Xu, Y. Xia, and Y. Li, “Text mining and probabilistic language modeling for online review spam detection,” ACM Transactions on Management Information Systems (TMIS), vol. 2, no. 4, p. 25, 2011. [10] F. Li, M. Huang, Y. Yang, and X. Zhu, “Learning to identify review spam,” in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, 2011, p. 2488. [11] H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao, “Spotting fake reviews via collective positive-unlabeled learning,” in 2014 IEEE International Conference on Data Mining. IEEE, 2014, pp. 899–904. [12] A. Mukherjee, B. Liu, and N. Glance, “Spotting fake reviewer groups in consumer reviews,” in Proceedings of the 21st international conference on World Wide Web. ACM, 2012, pp. 191–200. [13] L. Zhao, Z. Chen, Y. Hu, G. Min, and Z. Jiang, “Distributed feature selection for efficient economic big data analysis,” IEEE Transactions on Big Data, 2016. [14] A. O. Hirschman, “The paternity of an index,” The American Economic Review, vol. 54, no. 5, pp. 761–762, 1964. [15] N. Jindal and B. Liu, “Review spam detection,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 1189–1190. [16] Z. Wang, T. Hou, D. Song, Z. Li, and T. Kong, “Detecting review spammer groups via bipartite graph projection,” The Computer Journal, vol. 59, no. 6, pp. 861–874, 2016.

Suggest Documents