microblogging messages related to product selling or buying. C2C buy/sell ... potential sellers and buyers based on their social media messages due to the ...
Product Attribute Extraction Based Real-Time C2C Matching of Microblogging Messages M.R. Mohamed Rilfi, H.M.N. Dilum Bandara, and Surangika Ranathunga Department of Computer Science and Engineering University of Moratuwa, Sri Lanka {rilfi, dilumb, surangika}@cse.mrt.ac.lk Abstract—We describe a solution for real-time matching of microblogging messages related to product selling or buying. C2C buy/sell interest matching in real time is nontrivial due to the complexities of interpreting social media messages, number of messages, and diversity of products/services. Therefore, we adopt a combination of techniques from natural language processing, complex event processing, and distributed systems. First, we convert the message into semantics using named-entity recognition with CRF and Logistic Regression. Then the extracted data are matched using a complex event processor. Moreover, NoSQL and in-memory computing are used to enhance the scalability and performance. The proposed solution shows a high accuracy where classification and CRF models recorded an accuracy of 98.5% and 82.07% when applied to a real-world dataset. Low latency was observed for information extraction, in-memory data manipulation, and complex event processing were latencies were 0.5 ms, 5 ms, and 3.6 ms, respectively. Keywords—C2C; complex event processing; information extraction; named entity recognition; stream processing;
I. INTRODUCTION As social media has become a way of life, people are starting to use social media to buy, sell, or consume products or services. This is especially useful in the Consumer-to-Consumer (C2C) domain as users could post casual messages looking for potential sellers or buyers for their items or services. However, these messages are buried among other messages such as users’ thoughts and experiences, needs, ad campaigns, and events. Therefore, potential buyers and sellers may not notice these messages. It is nontrivial to automate the matching of such potential sellers and buyers based on their social media messages due to the complexities of interpreting social media messages (typically written in informal language and tends to be short and cryptic), diversity of products/services, sheer number of messages, as well as computational and memory requirements. Moreover, to be effective such matching should be carried out in near real time. II. CHALLENGES To be able to match potential buyers and sellers, first, the social media messages posted by them need to be interpreted to understand whether the message is for a buy or sell and what item or service is being offered. In our use case, we need to identify product attributes such as product name, brand, and model. While converting natural language to semantics, such microblogging messages have only a few named entities. Therefore, it is difficult
to apply existing solutions to extract information such as location, organization, and person with good accuracy. The next challenge is the handling of high-velocity social media streams in real-time. This requires real-time information extraction and matching between newly arrived messages and the ones that appeared in the recent past. III. METHODOLOGY Fig. 1 shows the high-level architecture of the proposed solution. Initially, we convert the social media messages written in the natural language to semantics. In our solution matching takes place based on the product attributes such as product group, product name, brand, model, and selling status (buying /selling). A subset of product attributes was extracted using Named Entity Recognition (NER) such as set1[product_name, brand, model]. The remaining product attributes such as set2[product_group, selling_status] were extracted using text classification. In NERbased information extraction, no fitting solution exists for product attributes related named entities, which are considered as custom named entities. For custom named entities we must create a sequence classification model. Conditional Random Fields (CRF)[1] is the best performing sequence classification algorithm to train the machine-learning model. Therefore, we created three models using CRF for each product attributes mentioned in set1. For the text classification, we used Logistic Regression as the classification algorithm to
Figure 1. Overall architecture of the proposed solution.
create the machine-learning model. Using text classification, we classify the product attributes in set2.
TABLE 1 ACCURACY AND PERFORMANCE MEASURES Module Name
Accuracy
Even though our product Brand NER attributes extraction works Product NER Status Classification with high accuracy, it is Product Group classification insufficient to fulfill the In-memory data manipulation performance requirements. CEP based matching As the velocity of the incoming social media message stream is very high, a single threaded solution cannot handle it. Therefore, we used Apache Storm distributed real-time stream processing framework to extract the semantics from the social media messages in near real-time. We build a five-node Storm cluster to distribute the product attribute extraction process, as well as other related processes. It follows a master-slave architecture [2]. Two nodes were allocated as master and coordination nodes. Remaining three nodes were allocated as worker nodes which are responsible for the execution of our software modules (topology). Here we run set of modules (spout and bolts) each running in different parallelism according to the workload. Because of our product attribute extraction now we have wellstructured product-attribute semantics stream, next step is to match the relevant buying and selling messages. Here we perform two types of matching such as time-based matching and contentbased matching. Further time-based matching can be classified as matching among real-time messages and matching between realtime and recent messages. Content-based matching can be further classified as complete matching and partial matching. All the matching processes are implemented using WSO2 complex event processor [3] which is known to be having highest throughput and low latency. In the case of matching between real-time and recent messages, we needed to store a very high number of messages (i.e., semantics) as recent messages. To manage such a large number of messages, we used a decentralized distributed NoSQL database Apache Cassandra[4]. We created an optimized data model to decrease the latency and boost read performance. Even though when we are matching real-time and recent messages, the recent messages must be manipulated with low latency to be equal to the real-time stream velocity from the NoSQL database. But NoSQL database alone cannot achieve this low latency data
Recall
0.821 0.84 0.985 0.948 -
0.932 0.922 0.993 0.944 -
F1
Latency (MS)
0.901 0.913 0.983 0.952 -
0.333 0.644 0.533 0.402 5.0 3.6
Parallel Instances 12 10 10 10 -
Training Set Size 2,03,851 8,83,101 9,10,951
manipulation. Therefore, to overcome this issue, we used Apache Spark in-memory computing framework. IV. SYSTEM ARCHITECTURE The overall architecture illustrated in Fig 1. The outcome of the information extraction is passed into three modules, namely (a) NoSQL DB for data persistence, (b) in-memory computing for data manipulation, and (c) complex event processor for matching. In-memory computing takes the real-time semantics as a query parameter and retrieves related recent semantics to the Complex Event Processor (CEP). As illustrated in Fig 2 CEP execution plan were implemented using window-based matching between the messages, i.e., we cached the real-time messages in time-window for a few seconds, and at the same time, we load the recent messages into the length window. Eventually, we compare the messages from both time-window and length-window. If both messages match, we join the matched messages into a matched stream. Finally, the matched messages produced by the CEP will be notified to the potential buyers and sellers of the C2C business model. V. PERFORMANCE EVALUATION Here in our solution supervised machine learning play a major role. Training dataset preparation included noise filtering, normalization, and string similarity. We used linked data techniques to label the NER dataset. Our primary dataset includes tweets and e-commerce product titles extracted from real users. Table I listed all the modules in our solution and their accuracy measures and performance. For example, accuracy is between 82% to 98.5% while having via low latency of 0.33 ms to 5 ms. REFERENCES [1]
[2]
[3]
[4]
Figure 2. Window-based operations on matching using complex event processing.
0.873 0.904 0.974 0.96 -
Pression
D. (Pew) Putthividhya and J. Hu, “Bootstrapped Named Entity Recognition for Product Attribute Extraction,” Stroudsburg, PA, USA, 2011, pp. 1557–1567. H. Chen, R. H. Chiang, and V. C. Storey, “Business Intelligence and Analytics: From Big Data to Big Impact.,” MIS Q., vol. 36, no. 4, pp. 1165–1188, 2012. S. Suhothayan, K. Gajasinghe, I. Loku Narangoda, S. Chaturanga, S. Perera, and V. Nanayakkara, “Siddhi: A Second Look at Complex Event Processing Architectures,” in Proceedings of the 2011 ACM Workshop on Gateway Computing Environments, New York, NY, USA, 2011, pp. 43– 50. A. Chebotko, A. Kashlev, and S. Lu, “A Big Data Modeling Methodology for Apache Cassandra,” in 2015 IEEE International Congress on Big Data, 2015, pp. 238–245.