context, two relevant data providers can be highlighted: the âLinked Open Dataâ and ... open data based on which any given organization might infer valuable ...
Semantic Information Fusion of Linked Open Data and Social Big Data for the Creation of an Extended Corporate CRM Database Ana I. Torre-Bastida, Esther Villar-Rodriguez, Javier Del Ser, and Sergio Gil-Lopez
Abstract. The amount of on-line available open information from heterogeneous sources and domains is growing at an extremely fast pace, and constitutes an important knowledge base for the consideration of industries and companies. In this context, two relevant data providers can be highlighted: the “Linked Open Data” and “Social Media” paradigms. The fusion of these data sources – structured the former, and raw data the latter –, along with the information contained in structured corporate databases within the organizations themselves, may unveil significant business opportunities and competitive advantage to those who are able to understand and leverage their value. In this paper, we present a use case that represents the creation of an existing and potential customer knowledge base, exploiting social and linked open data based on which any given organization might infer valuable information as a support for decision making. In order to achieve this a solution based on the synergy of big data and semantic technologies will be designed and developed. The first will be used to implement the tasks of collection and initial data fusion based on natural language processing techniques, whereas the latter will perform semantic aggregation, persistence, reasoning and retrieval of information, as well as the triggering of alerts over the semantized information. Keywords: Big Data, Social Media, Linked Open Data, business intelligent, information fusion, ontology management, information modelling.
1
Introduction and Motivation
Nowadays, organizations need to gather valuable information that will allow them to improve their business processes and optimize their decision making. In this context, Ana I. Torre-Bastida · Esther Villar-Rodriguez · Javier Del Ser · Sergio Gil-Lopez TECNALIA, OPTIMA Unit, E-48160 Derio, Spain e-mail: {isabel.torre,esther.villar,javier.delser, sergio.gil}@tecnalia.com © Springer International Publishing Switzerland 2015 D. Camacho et al. (eds.), Intelligent Distributed Computing VIII, Studies in Computational Intelligence 570, DOI: 10.1007/978-3-319-10422-5_23
211
212
A.I. Torre-Bastida et al.
business intelligence [1] is the set of strategies, relevant aspects and key technologies to the creation of knowledge on the data environment, through the analysis of these and the context, with the ultimate aim to facilitate business decision making. However, the principal problem to achieve this task is the vast amount of available data and the efficient extraction of useful information from huge repositories. The problems associated with data volume is the concept known as Big Data, where the collection of data sets is so tremendous and complex that their processing using traditional data management tools results computationally unaffordable. Two of the most notable data providers in Big data are the Linked Open Data (LOD [2]) and social big data. Social media is becoming an important context-rich information source for organizations and therefore many business executive consider an essential challenge to be faced in order to incorporate this user-generated information in their decisionmaking chain. The goal is that businesses achieve profits from social platforms such as Wikipedia (DBpedia in the LOD), Facebook or Twitter. Due to the heterogeneity of the received digitized data, following non-standard schemas and with low accuracy and reliability, a great human effort becomes necessary to extract, format and assimilate, trying to solve the second major problem, which is the removal of noise in data content before using it. A third problem arising therefrom is how to get to merge these datasets with traditional business data, such as relational database or corporate knowledge systems. Many projects follow these business intelligence research lines in areas such as brand recognition, competitor analysis or benchmarking [3–5]. But there are scarce studies applying them to a specific matter such as customer relationship management towards potential customers identification or existing clients’ information improvement or enrichment. Aimed at filling this gap, our approach defines a system capable of 1) implementing the generation and management of an extended corporate CRM database; 2) solving several related analytics problems (knowledge discovering and aggregation/fusion) stemming therefrom; and 3) exploiting emerging data sources by using the semantic and big-data technology stack. Technically, our system follows a semantic aggregation approach that allows taking advantage of the LOD datasets structure so as to enhance our solution. In detail, the main contributions of our scheme are the following: • Analysis of the particular business problem of discovering and improving organizations customer database. • Exploitation of new data sources (social media, LOD). • Making use of the semantic and big-data technology stack for data collection and aggregation tasks. In the rest of this paper we first introduce the main concepts related to our approach, Social Media, Linked Open Data and the closest related work. Then our scheme and its core processing steps are described in detail. Finally, we illustrate a study use case to evaluate our system prototype.
Semantic Information Fusion for the Creation of a CRM Database
2
213
Background
The web has recently undergone a transformation in the amount and type of available contents, emerging a new paradigm called Big Data. This new term is used to describe the exponential growth and availability of data, both structured and unstructured. In our approach we use two clear examples of these kind of datasets: social big data (unstructured) and Linked Open Data (semantically structured). Nowadays social media platforms are storing enormous amounts of no previously automatically analyzed data that could reveal critical information. The reason behind is that the user role has shifted from being a mere consumer to a content provider. Social media is defined by Kaplan et al. [7] as“a group of Internet-based applications that build on the ideological and technological foundations of Web 2.0, and that allow the creation and exchange of user-generated content". For this reason it can be considered as a context-rich source of big data and is usually referred to as Social Big Data. To better explain this definition we must introduce two concepts: Web 2.0 and User Generated Content (UGC). Web 2.0 describes a new method in which software developers and end-users collaborate on the World Wide Web; that is, content and applications are no more statically published by an individual, but are continuously modified by a collaborative users community instead. UGC describes the various forms of media content that are publicly available and created by endusers. On the other hand, Linked Data refers to a set of best practices for publishing and interlinking structured data on the Web. With this definition, Bizer et al. [6] defined the linked data paradigm and provided a mechanism for building the Web of Data, what is based on the semantic Web technologies and it may be considered as a simplified version of the Semantic Web. The data model for representing interlinked data is RDF [3], where data is represented as node-and-edge-labeled directed graphs. Some published Linked Data sets contain billions of triples, whose cardinality is steadily increasing to yield the Linked Data Cloud, i.e. a group of data sets available on the Web as Linked Data with links pointing at other Linked Data sets.
2.1
Related Work
In business intelligence – especially in the area of competitive information (CI) – the data gathering process can involve a large number of research areas regarding to technologies and strategies, which have unchained an intense activity in the related literature. Our approach deals with social big data which has been broadly adopted to nourish data analytic systems. Studies as the one by Rappaport in [8] introduces the essential role it can take to exploit social media in the field of business intelligence, presenting cases of study in which the social media data is turned into business advantages. In [9] a preliminary study about using text-mining techniques in the task of collaborative intelligence information gathering is presented. The main difference with our approach lies on the used techniques (our work resorts to
214
A.I. Torre-Bastida et al.
semantic fusion and big data technologies) and the application domain, which in our case lays on the specific example of creating a knowledge base of customers. Another work related to our scheme is the one presented by Shroff et al. in [10], which describes a framework for the fusion of business intelligence in various industries such as manufacturing, retail or insurance. Once again and unlike our proposal, this contribution hinges on its global and general case-based implementation without concentrating on a specific problem. Furthermore, the artificial intelligence techniques used in their work are the blackboard architecture and the locality sensitive hashing, which are far away from our semantic fusion approach. Another interesting and more specific work is presented by Agichtein et al. in [11], which elaborates on a highquality social media information gathering scheme, but only managing data posted in Yahoo!Answers social platform. Other investigations also discuss the advantages of data fusion on information collected from social media as in [12], in which multiple features in the social media environment (textual, visual and user information) are fused for later being used on a retrieval algorithm for large social media data (flickr). Likewise, in [13] a use case of a shared on-line calendar is presented and enhanced with events generated by user social networks and location data using fusion techniques. Further, we highlight the work presented by Kim et al. in [14] due to the fact that it is the only one that uses semantic fusion techniques. However, its purpose is out of the scope of business intelligence and does not provide enough technical details. Its methodology and assessment is deemed as insufficient for a fair comparison with the technical proposal next presented. Finally, we analyze the work done by Hoang et al. in [15], where a survey about technologies and applications of Linked data mahups as well as the challenges to build them are presented. In this paper a use case close to our approach is presented, since both use semantic technologies for integration. The main difference lies in the architecture (they use semantic web pipes and our approach instead uses ontology mapping/alignment techniques for the semantic integration) and datasets (they exploit freebase and do not use social media data sources).
3
Information Collection from the Web
Our system collects external information, such as company related tweets, customer feedback (comments) or business related open data and merges this data with traditional enterprise databases, with the aim at storing these aggregated information following an adequate business semantic model. This section delves into the first of these tasks, the information collection from two different on-line available sources: the Social Media and Linked Open Data Cloud, as well as into the subsequent filters to extract their relevance:
Semantic Information Fusion for the Creation of a CRM Database
215
1. Social Big Data: at this point the data collected in two streaming social platforms is selected. a. Facebook posts from specific user-ids are the considered data, extracting the comments generated by these users. b. Twitter feeds containing certain keywords or from specific user-ids. These keywords are extracted using a TF/IDF approach from the corporation documents and website. The technology used to perform these tasks are Facebook1 and Twitter2 streaming application programming interfaces (API). 2. Linked Open Data Cloud: there are several datasets related to the business domain, such as DBpedia, CrunchBase or Freebase, which can be queried by the SPARQL query language or web services. From these datasets, structured information about customers is obtained, which is latter mapped to the semantic model of our system. The data collection is detailed in Figure 1. This task is composed by three subprocesses: data collection and noise reduction, extraction of disambiguated entities and harvesting of related entities information available as open data. In a first step, the different social media data streams are captured using the aforementioned APIs. Next, the posts(from Twitter or Facebook platforms) are preprocessed. At this stage we use the Freeling API3 to carry out the language analysis, calculating their corresponding synsets (i.e. a group of data elements that are considered to be semantically equivalent, represented by an identifier). The collection of pairs formed by each post and its synsets are the input events to a set of rules that allow deducing if the post (tweet/comment) can be considered within the business domain. This stage is what we have coined as NOISE filter. This filter, composed by a set of rules is implemented by the Esper CEP engine4 built up from a set of synsets constructing a context that describes and models the business domain, for example concept: business; synsets:08056231,08058098. If any of the synsets belonging to ongoing post can be match to any one of the synsets that form the context, the rule is activated and the post is filtered as pertinent. Otherwise the post is discarded since it is assumed that its content is not about anything related to the business world. Once posts have gone through the noise filter, the result will be deemed valuable since it is likely to provide meaningful information about the domain, which is then fed to the subprocess in charge of entity extraction. Named Entity Recognition (NER) refers to the module or function in charge of detecting any kind of entities such as cities, organizations, people and is mostly utilized by NLP utilities as a contributor for semantic information. In our case, the filtered posts can contain 1 2 3 4
https://developers.facebook.com/docs/graph-api https://dev.twitter.com/docs/api/streaming http://nlp.lsi.upc.edu/freeling/ http://esper.codehaus.org/
216
A.I. Torre-Bastida et al.
Tweets by keyword
Tweets by user-id
Facebook post by user-id Tweets/fb_posts + customer
Tweets + keyword
NOISE filter (CEP) Filtered posts (tweet/fb_post + customer)
Filtered posts (tweet+ keyword)
NER – entity e extraction (Map/Reduce) Entities extracted (entities+keyword)
LOD (Dbpedia, Freebase, Cruncbase)
Entities extracted (entities+customer)
Open en data d collection (Sparql/Rest) Entities and related information
Fig. 1 Data collection process flow
any named entity corresponding to an already existing customer, a potential client or even a competitor working in the same market sectors. On this purpose, Daedalus Topic Extraction API has been selected, integrating it on a Map-Reduce framework to parallelize the algorithm responsible for extracting entities. The output obtained from the Map-Reduce job is a set of entities grouped by post. Finally, for each of the previously extracted entities, we will collect the information available in the Linked Open Data sets (freebase, DBpedia) and other open data sets such as CrunchBase. This information will be merged and aggregated to the existing data from corporation relational databases, with the final aim of feeding the semantic model.
4
Semantic Fusion: Aggregation, Model and Interlinking
The semantic aggregation process has two main goals: to improve the existing information for customers of the organization and to discover new potential customers. The entire process is detailed in Figure 2. First of all a classification process is applied to each post to determine whether its contents relate to any entity existing in the semantic data model. Depending on the result of this classification the system follows two different alternative flows. In the positive case, the semantic model is updated with the new information about customer and its partnerships/relationships. Otherwise, the data gathered from the Linked Open Data Cloud is mapped into a
Semantic Information Fusion for the Creation of a CRM Database
217
Entities by post Classification Potential entities information
Existing customer information
LOD & Relational information mapping
Corporation RDBMS
RDF repository
New customer & relationships
Semantic Model Semantic model updating
Update customer & relationships
Fig. 2 Semantic data aggregation
new instance within the semantic model. These processes are supported by a set of previously computed semantic links between our model and the LOD datasets vocabularies, which are calculated following the ontology alignment process proposed by the authors in [16]. With regard to the definition of our model schema, well-known semantic vocabularies will be reused, to promote interoperability with other RDF repositories or datasets. Our ontology model is based on the combination of the schema.org ontology along with that used in DBpedia and vocabularies as SKOS to specify semantic relationships and links. New classes or properties are also modeled in the case that existing vocabularies do not provide their definition. Finally, the new instances of the semantic data model are stored in the Virtuoso Open-Link RDF repository 5 .
5
Information Retrieval, Inference and Alert Generation
Once the information has been converted to RDF format following our semantic model and it is saved in the RDF repository, some added-value operations can be implemented over the stored data, such as the following features: • Information retrieval: In our case SPARQL – the current W3C recommendation for querying RDF data – is selected to allow users to perform selective queries. In
5
http://virtuoso.openlinksw.com/
218
A.I. Torre-Bastida et al.
our system the SPARQL endpoint provided by RDF repository Virtuoso OpenLink and the JENA API6 are the chosen tools for implementing this module. • Inference: Based on the information stored in the repository semantic inference processes (RDFS and OWL) can be performed with the aim of discovering new relationships. This task can be accomplished by semantic reasoners like Pellet [17], in combination with the JENA API. This process also allows for the definition of specific business semantic rules implemented using the SWRL (Semantic Web Rule Language combining OWL and RuleML) language. • Alert generation: Finally, an alert generation module is responsible for monitoring the data and triggering events that indicate that a number of conditions specified in the alert have been fulfilled. For its implementation a listener is utilized during the loading and inference process that allows detecting whether alert conditions have been met.
6
Use Case
This section describes in detail an illustrative example of the process followed by our system since the data collection occurs until the information is retrieved by a
Tweets by Keywords:
Keywords TF/IDF "automotive", "energy", “IT”,” …
Tweets by user-id:
1,@Gamesa_Official wins contract to supply 20 MW to Energa in #Poland http://t.co/RjIG1Hup12 2. China #Automotive ABS Market @Bigmarketreport http://t.co/IFR0pW76Ey 3.-Applying the energy of today's Taurus New Moon Eclipse empower... More for Virgo http://t.co/y4bAKcHKCd 4. Soooo much to do sooo little energy #cantbearsed […]
User-id extracted from RDB “iberdrola“, …
1, Nuevo Plan Ciencia y Tecnologí cnologí a con @Innobasque @AlianzaIK4 @tecnalia @iberdrola @jakiunde @idom … 2,Calcula y obtén un gráfico de la rentabilidad de tu inversión en @Iberdrola https://t.co/hRIi7bdycv 3.Trabajamos junto a @BSC_CNS este este proyecto para diseñar instalaciones eólicas …. 4., Consulta la actualidad de nuestra filial brasileña, Elektro http://t.co/KOjCk2W6cs […]
NLP Pre-processing NOISE filter
FB posts by user-id:
1,Hoy en el blog podéis ..de nuestra filial escocesa ScottishPower. … 2, Hoy se celebra el Día de la #Tierra. En Iberdrola …diferentes políticas ….la estrategia … 3, Iberdrola Ingeniería … construir la subestación Votkinskaya, …RusHydro […]
Business context: organization-> 8056231,08058098 business ->08061042 …
Filtered tweets: TweetsK:1, keyword, @Gamesa_Official wins contract to supply 20 MW to Energa in #Poland http://t.co/RjIG1Hup12 FB:1, Iberdrola, Hoy en el blog podéis ..de nuestra filial escocesa ScottishPower. FB:3, Iberdrola, Iberdrola Ingeniería … construir la subestación Votkinskaya, …RusHydro… TweetsC:1, Iberdrola, Nuevo Plan Ciencia y Tecnología con @Innobasque @AlianzaIK4 @tecnalia @iberdrola @jakiunde @idom … TweetsC:3, Iberdrola, Trabajamos junto a @BSC_CNS este este proyecto para diseñar instalaciones eólicas …. Dbpedia Sparql example: SELECT ?thing WHERE { ?thing rdfs:label ?name. FILTER(regex(str(?name), “Iberdrola", \"i\")) }
NER processing Entities extracted: TweetsK:1, keyword, Gamesa_Oficial, Energa; FB:1, Iberdrola, ScottishPower; FB:3, Iberdrola, RusHydro; TweetsC:1, Iberdrola, Innobasque, AlianzaIK4, Tecnalia, Jakiunde, Idom ; TweetsC:3, Iberdrola, …
Open data t collection ll ti (Freebase, Dbpedia, Crunchbase)
Entities +open information: {ScottishPower [ {{S fo foaf:homepage http://www.scottishpower.com/; dbpedia-owl:numberOfEmployees. 9953 …..]; Energa [dbpediad owl:country dbpedia:Poland ; …]
Fig. 3 Data collection example 6
https://jena.apache.org/documentation/query/
Semantic Information Fusion for the Creation of a CRM Database
219
Entities +open information: {ScottishPower [foaf:homepage http://www.scottishpower.com/; dbpedia-owl:numberOfEmployees. 9953 …..]; Energa [dbpedia-owl:country dbpedia:Poland ; …]
Classification Semantic ntic model updatingg
LOD & Rel Relational information mapping
RDF repository
PREFIX d: Select ?org ?name ?subj ?name2 where { ?org a d:organization. ?org rdfs:label ?name ?subj a d:subject. ?act rdfs:label ?name2 ?org d:relatedTo ?subj.}
PREFIX d: http://datafusion.org/ontology/ org1 a d:energycompany. web1 a d:website. org1 rdfs:label “ScottishPower”. web1 d:url http://www.scottishpower.com/. org1 d:contact web1. […] org2 a d:energycompany. org2 rdfs:label “Energa”. […]
o org1 “ScottishPower” subj1 “energy” o org2 “Energa” subj1 “energy” o org3 “Gamesa” subj1 “energy” o org4 “RusHydro” subj1 “energy” o org5 “AlianzaIK4” subj1 “energy” org5 “AlianzaIK4” subj2 “industry” org5 “AlianzaIK4” subj3 “IT” […]
Fig. 4 Generated semantic data model and Sparql execution example
SPARQL query. The data collection process is shown in Figure 3. The first input is the real data retrieved from Twitter and Facebook. Tweets and posts are preprocessed to transform them in synsets as explained in Section 3. These synsets are filtered (a noise filter for irrelevant data) using a macro that consists of a set of synsets representing the business domain (business context in the Figure). These filtered tweets and posts are subject to a named entity recognition procedure aimed at extracting the entities so as to collect from them the information available on the LOD. Finally, the data model and instances generated by the semantic aggregation process and an example of information retrieval using a SPARQL sentence are shown in Figure 4. As shown in the picture, the query returns a list of all organizations and its related subjects. In this context it must be noted that although ScottishPower is annotated as energycompany, this entity is also returned in the query, because in the ontological model (see figure 2) an energycompany is categorized as a subclass of organization. This unveils one of the advantages of using a semantic model for information retrieval.
7
Concluding Remarks and Future Research
This manuscript has gravitated on the problem of automatically creating and managing a customer database from a novel perspective: semantic aggregation. Input data comes from new sources such as social media and Linked Open Data. Furthermore, different modules have been implemented leveraging Big Data (Map-Reduce,
220
A.I. Torre-Bastida et al.
Complex Event Processing) and semantic web (RDF repository, reasoner, SWRL) technology stacks. A use case exemplifies the multiple possibilities and potentiality offered to a corporation by our approach, ranging from the discovery of new customers to the knowledge base expansion of traditional clients. This springs profitable advantages in the business domain, where the decision making is a critical process and the collection of customer information is a key factor. Future work will be devoted towards the study of new applications and enlarging the technical scope of this semantic aggregation so as to e.g. also include projects referencing entities, business concepts or places and properties that can be matched to relationships to the model inferred from the posts thanks to developing new algorithms that use PLN and classification techniques. Furthermore multilingual features will be also considered for their inclusion in the platform.
References 1. Moss, L.T., Atre, S.: Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications. Addison-Wesley (2003) 2. Bizer, C., Heath, T., Idehen, K., Berners-Lee, T.: Linked data on the web (LDOW2008). In: Proceedings of the 17th International Conference on World Wide Web, pp. 1265–1266 (2008) 3. Hoffman, D.L., Fodor, M.: Can you measure the ROI of your social media marketing. MIT Sloan Management Review 52(1), 41–49 (2010) 4. Vuori, V.: Social media changing the competitive intelligence process: elicitation of employees’ competitive knowledge. Tampereen teknillinen yliopisto. Julkaisu-Tampere University of Technology. Publication; 1001 (2011) 5. Bingham, T., Conner, M.: The new social learning: A guide to transforming organizations through social media. Berrett-Koehler Publishers (2010) 6. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. International Journal on Semantic Web and Information Systems 5(3), 1–22 (2009) 7. Kaplan, A.M., Haenlein, M.: Users of the world, unite! The challenges and opportunities of Social Media. Business Horizons 53(1), 59–68 (2010) 8. Rappaport, S.D.: Listen First!: Turning Social Media Conversations Into Business Advantage. John Wiley and Sons (2011) 9. Dey, L., Haque, S.M., Khurdiya, A., Shroff, G.: Acquiring competitive intelligence from social media. In: Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data, p. 3. ACM (2011) 10. Shroff, G., Agarwal, P., Dey, L.: Enterprise information fusion for real-time business intelligence. In: IEEE International Conference on Information Fusion (FUSION), pp. 1–8 (2011) 11. Agichtein, E., Castillo, C., Donato, D., Gionis, A., Mishne, G.: Finding high-quality content in social media. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 183–194. ACM (2008) 12. Cui, B., Tung, A.K., Zhang, C., Zhao, Z.: Multiple feature fusion for social media applications. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 435–446. ACM (2010) 13. Lovett, T., O’Neill, E., Irwin, J., Pollington, D.: The calendar as a sensor: analysis and improvement using data fusion with social networks and location. In: Proceedings of the 12th ACM International Conference on Ubiquitous Computing, pp. 3–12. ACM (2010) 14. Kim, H., Son, J., Jang, K.: Semantic Data Fusion: from Open Data to Linked Data. In: Proceedings of the European Semantic Web Conference (2013)
Semantic Information Fusion for the Creation of a CRM Database
221
15. Hanh, H.H., Tai, N.C., Duy, K.T., Dosam, H., Jason, J.J.: Semantic Information Integration with Linked Data Mashups Approaches. International Journal of Distributed Sensor Networks 2014, Article ID 813875 (2014) 16. Torre-Bastida, A.I., Villar-Rodriguez, E., Del Ser, J., Camacho, D., Gonzalez-Rodriguez, M.: On Interlinking Linked Data Sources by Using Ontology Matching Techniques and the Map-Reduce Framework. In: Corchado, E., Yin, H. (eds.) IDEAL 2014. LNCS, vol. 8669, pp. 53–60. Springer, Heidelberg (2014) 17. Sirin, E., Parsia, B., Grau, B.C., Kalyanpur, A., Katz, Y.: Pellet: A practical owl-dl reasoner. Web Semantics: Science, Services and Agents on the World Wide Web 5(2), 51–53 (2007)