FuhSen: A Federated Hybrid Search Engine

FuhSen: A Federated Hybrid Search Engine Diego Collarana, Christoph Lange, Sören Auer, Maria-Esther Vidal, Irlán Grangel-González University of Bonn & Fraunhofer IAIS {collaran,langec,auer,vidal,grangel}@cs.uni-bonn.de

Abstract A vast amount of information about various types of entities is spread across several parts of the Web, e.g., persons or organizations in the Social Web, in the Web of Documents, or in the Dark Web. End users, for example, law enforcement institutions searching for traces of the organized crime, require federated search engines that integrate distributed and heterogeneous information. This paper presents FuhSen, a keyword-based federated search engine that integrates and summarizes information about entities from existing Webs. The RDF-based, federated hybrid search engine FuhSen implements a multi-layered architectural pattern allowing for its adaptability to different domains or use cases. FuhSen interactively queries and retrieves data from a diverse set of Web APIs, to collect up-to-date data. Linked Data wrappers around these Web APIs enable ondemand, semantic integration of search results collected from different sources. The ease of integrating of such wrappers in the federated architecture and the modularity of the OntoFuhSen vocabulary make FuhSen a very flexible and reusable software resource. User evaluation results suggest that FuhSen is easy to learn and use, and users could solve searching tasks that could not be finished with traditional Web search engines during the evaluation study. Keywords: Federated Search, Integration on Demand, Linked Data, RDF

1

Introduction

The more the amount of information grows on the Web, the more important is efficient and effective querying, exploration, and retrieval. For information available as plain text, Information Retrieval is a long established research field and a vast number of mature commercial, as well as open implementations such as Apache Solr are now driving large-scale applications. Also, in the area of the Semantic Web, a number of approaches, techniques, and platforms have been developed (e.g., [1,8]) which unify search across unstructured (Web documents) and structured data (in RDF). However, for many applications, heterogeneous information represented in different modalities (structured, semi-structured, or unstructured) and spread across distributed data sources has to be made searchable and explorable for end users in an integrated way. We briefly describe such a distributed and heterogeneous search application scenario in the context of crime investigation. During a crime investigation process, collecting and analyzing information from different sources is a key step performed by investigators. Although scene analysis is always required, a crime investigation process can be greatly benefited from searching information about people, products, and

2

Diego Collarana et al. Joaquin Chapo Guzman Loera Born: La Tuna Gender: Male Description: Drug lord...

El Chapo Alias: Chapo Location: Mexico Joined: September 2012...

Joaquin Archivaldo Guzman

Works at: Illegal drug trade Studied at: Uni. of Sinaloa From: Los Mochis, Sinaloa...

Figure 1. Example. Joaquín ‘El Chapo’ Guzmán on different silos on the Web

organizations on the Web. Typically, data collected from the following data sources is utilized for enhancing crime analysis processes: (1) The Social Web encompasses user generated content and personal profiles. (2) The Deep Web advertises products and services offered by organizations, e.g., the eBay e-commerce platform. (3) The Web of Data includes billions of machine-comprehensible facts, which can serve as background knowledge for collecting information about different types of entities. (4) The Dark Web refers to sites accessible only with specific software, and restricted trading of goods that can be accessed through the so-called darknet markets. We propose FuhSen, a keyword-based and faceted search engine able to collect, integrate, and visualize data from all these sources on-demand; thus, FuhSen becomes a key tool for crime analysis. Figure 1 shows how data about a suspect of drug dealing Joaquín Chapo Guzmán, is spread across different data sources. FuhSen exploits REST APIs provided by Web data sources to search, aggregate, and summarize information about an entity (e.g., a suspect). Using Linked Data as the core technology, the FuhSen engine is able to: (1) Integrate on-demand heterogeneous data extracted from APIs into a unify data schema using the OntoFuhSen vocabulary. (2) Apply data analytics algorithms to the FuhSen integrated data schema to measure entity similarity and to find entity links, (3) Exploit the semantics of the integrated entities for summarizing and ranking. We have conducted a user evaluation study to analyze the efficiency and usability of FuhSen. The results suggest that FuhSen is able to speed up the keyword-based search processing, and provides answers to queries that existing engines are not able to solve during the time of the study. Furthermore, users found FuhSen easy to learn and to use. Challenges FuhSen addresses the following design and development challenges: i) OnDemand Integration: Data collected from heterogeneous sources needs to be integrated during keyword query execution time. ii) Heterogeneity of data: Collected data in different formats, data schemas, sizes, and accessibility restrictions need to be integrated. iii) Extensible by design: Reusability should be ensured in different use cases and scenarios, and new data sources should be added with minimal effort. iv) Efficiency and Performance: Data collection and analysis should be executed in an acceptable time. v) Provenance: Information about the origins of collected data should be tracked. Data provenance management is especially important in the crime investigation domain. vi) Learnable user interfaces: End users, e.g., police detectives with a low degree of IT knowledge, should be able to learn and use the system easily.

2

The FuhSen Architecture

FuhSen implements a three-layer architecture: a) A faceted browsing UI. b) An RDFbased search engine, and c) An RDF wrapper layer. FuhSen uses the OntoFuhSen

FuhSen: A Federated Hybrid Search Engine Users can enter keyword query such as a name of a person, e.g. Joaquin Chapo Guzman Generates keyword subqueries from an initial keyword query, e.g. {Joaquin Chapo Guzman, Joaquin Chapo, Chapo Guzman}

Executes the subqueries against data sources employing wrappers to Linked Data APIs, e.g. http://server/api/v1/search? query=Chapo+Guzman& rows=100&offset=0

Keyword query

Adds a ranking score to all results item based on three ranking criteria, e.g. fs:result1 fs:rank 8.15

Results

RDF-Based Search Engine Expanded Search Generation

Keyword subqueries

Collection of wrappers that retrieves of data using the APIs. and transforms results into OntoFuhSen, e.g. fs:result1 rdfs:type fs:SearchableEntity fs:result1 rfds:type foaf:Person fs:result1 foaf:name Joaquin Guzman ...

Users can explore and filter the results.

Faceted Browsing User Interface

Semantic Ranking

Federated Query Execution

Vocabulary-based Aggregation

Generates a human readable summary of each entity combining the triples, e.g. fs:result1 rdfs:label Joaquin Guzman fs:result1 rdfs:comments fs:result1 foaf:img

Entity Summarization

Subquery results

RDF-wrappers API Wrapper

Web of Documents

...

API Wrapper

Social Web

3

...

API Wrapper

Deep Web

...

API Wrapper

...

Aggregates all subqueries results using the global OntoFuhSen schema, e.g. org.apache.jena.rdf. Model.add(Model)

Dark Web

Figure 2. The FunSen Architecture. High-level architecture comprising the faceted browsing interface, an RDF-based search engine, and wrappers for distributed information sources

vocabulary1 as the core data model, which allows FuhSen to deal effectively with heterogeneity of source data, aggregate the results in a knowledge graph to find relations between entities, and link the search results with other sources. OntoFuhSen serves as a common language between the UI, the federated search engine, and the wrappers layers. The rationale of the vocabulary is threefold: 1) for the visualization of the results and facets; 2) as a unify data schema to apply semantic algorithms for enhancing completeness of keyword-based query answers; and 3) as a response format for exchanging data collected from the wrappers with the FuhSen engine. Figure 2 depicts the FuhSen architecture, and FuhSen layers are described as follows. Faceted Browsing UI FuhSen users pose keyword-based queries and explore query answers using a multi-faceted browsing user interface. We chose facets as user interface exploration pattern which is a user-friendly mechanism for exploration and filtering of a large amount of search results [6]. In [2], we presented a demo of the user interface, comprising the following elements: a text box for the search query, a result list, entity summaries, and a faceted navigation component. Our choice of JSON-LD, the standard JSON encoding of Linked Data, as the messaging format avoids unnecessary data transformations for the UI components, as they use JSON natively. RDF-based Search Engine This layer orchestrates data extraction processes at the RDF wrapper layer, and integrates the results in an in-memory knowledge graph. A series of micro-task services are applied to aggregate and semantically enhance the graph of results. By following a micro-task service architecture, the services are loosely 1

https://w3id.org/eis/vocabs/fuhsen#

4

Diego Collarana et al.

coupled, and each of them may evolve independently. Furthermore, new services can be plug into the pipeline easily. FuhSen provides the following micro-task services: (1) Sub-queries Generator: Receives an initial query string, containing one or more keywords separated by spaces, and produces a list of sequential sub-queries. By generating sub-queries, FuhSen is able to enhance the completeness of query answers without the intervention of end-users. To avoid noisy results, sub-queries are restricted to minimum two keywords, and the ranking service identifies the most relevant query answers. (2) Federated Query Executor: Orchestrates in an asynchronous manner, the data extraction task in the RDF wrapper layer. Requests to the RDF wrappers are created based on corresponding Web APIs2 of the data sources which are described in terms of OntoFuhSen, i.e., with information such as the URL of a service, its parameters, and the user’s secure API key. Once a result is received, a request to aggregate it in the results knowledge graph is created and sent to the Vocabulary-based Aggregator component. (3) Vocabulary-based Aggregator: Creates an in-memory RDF knowledge graph where all responses produced by the RDF wrappers are aggregated and described using OntoFuhSen. A vocabulary-based approach keeps the data aggregation task relatively simple. (4) Entity Summarization: Enhances query answers with triples containing images and human understandable textual descriptions for every entity. OntoFuhSen states the properties to be summarized for each entity according to the entity type, i.e., rdf:type. (5) Semantic Ranking: Using the predicate fs:rank, an RDF triple with a ranking score is related to each entity in the result. Scores are calculated from three factors: (a) exact match of a keyword in the entity, (b) amount of properties and relations of the entity, (c) data source trustworthiness described in OntoFuhSen vocabulary. RDF Wrappers This layer comprises a collection of wrappers around the data sources APIs to exchange and extract data. The OntoFuhSen vocabulary allows for the description of the APIs (e.g., service URL address, parameters, or secure API key). An RDF wrapper implementation utilizes this description to transform a keyword query into a specific API request, and returns query answers in terms of the OntoFuhSen vocabulary. A keyword search service in the API is a prerequisite to integrate that a data source into FuhSen; thus, a keyword query is just transferred to the search API service. The OntoFuhSen Vocabulary OntoFuhSen3 allows for the description of user search activities, data sources, and entities in the federation (cf. Figure 3). The vocabulary is divided into the following three modules: (1) Search engine metadata: comprises classes modeling a user search activity (e.g., fs:Search, fs:SearchableEntity). This module has been designed taking into account the provenance of resources. To enable the tracking of provenance, the PROV 4 vocabulary is used. PROV classes have been extended to model the provenance of the information related to user search activities during a search process. 2

3 4

Example of a RDF-wrapper request: https://wrapper-url/ldw/twitter/ search?query=Joaquin+Guzman&numresults=100 https://w3id.org/eis/vocabs/fuhsen# http://www.w3.org/ns/prov

FuhSen: A Federated Hybrid Search Engine (1) (2) (3)

Search engine metadata Data sources metadata Domain specific metadata (e.g. Organized Crime)

prov:startedAtTime : dateTime prov:endedAtTime : dateTime

fs:SearchableEntity found by fs:title : string fs:excerpt : string fs:image : foaf:Image

gr:ProductOrService fs:IllegalPoS fs:LegalPoS

fs:Counterfeit

fs:Drug

foaf:Agent

...

foaf:Person

...

search in

fs:Operation

has Parameter

fs:API has API

fs:Search fs:uid : string fs:queryDate : dateTime fs:keyWord : string

fs:SocialNetworkPage

org:Organization

fs:Bank

has Operation

prov:Activity

prov:Entity prov:generatedAtTime : dateTime

5

fs:Parameter

fs:InformationSource

fs:SocialPlatform fs:KnowledgeBase fs:RelationalData

fs:Professional

fs:Media

fs:Generic

...

foaf:knows : Person

fs:PhotoSharing fs:VideoSharing fs:Suspect

fs:Victim

Figure 3. An Overview of the OntoFuhSen vocabulary. The three modules of the OntoFuhSen vocabulary are depicted in different colors; main classes of each module are presented

(2) Data sources metadata: contains classes describing data sources API services and access points (e.g., fs:API, fs:Parameter, fs:Operation). These classes model the APIs and services from which the data is extracted (e.g., Facebook or Twitter). (3) Domain specific metadata: includes classes for describing the results collected from FuhSen during keyword query processing. For the crime domain concepts include: gr:ProductOrService and org:Organization. Reusing existing terms is considered a best practice in vocabulary engineering [3]. Based on this principle, we built some of the concepts of the FuhSen vocabulary by utilizing existing well-known ontologies, e.g., terms from FOAF, GoodRelations, and the Organization Ontology 5 . Reusability Each layer of FuhSen can be replaced with domain-specific implementations, i.e., FuhSen can be tailored for different domains. Similarly, micro-task services inside the engine layer can be adjusted, modified, or extended to improve the final results. Typical reuse scenarios include the replacement of the user interface layer, or micro-task services in the RDF-based search engine. The FuhSen GitHub project page6 makes available video tutorials, class, and interaction diagrams to facilitate the understanding of the FunSen architecture. A two-step process for adding a new data source to the federation is described as follows: Step 1: Extending the vocabulary (a) Add a new subclass of fs:SearchableEntity to the domain specific module. (b) Describe a Web API service using the OntoFuhSen vocabulary, and relate this service description to the new class. Step 2: Adding a new wrapper (a) Create a new RDF wrapper class. (b) Implement the interfaces RestApiWrapperTrait and SilkTransformableTrait to plug the wrapper into the federation. (c) Define the transformation mappings.

3

Evaluation

We conducted a user evaluation study to validate the following hypotheses: (1) Are end users able to execute keyword-based queries more efficiently using FuhSen rather than 5

http://xmlns.com/foaf/spec/, http://purl.org/goodrelations/v1, http://www.w3.org/ ns/org 6 https://github.com/LiDaKrA/FuhSen-reactive

Diego Collarana et al. b)

100 80 60 40 20 0

140 120 100

Task 1

Task 2

Task 3

Conventional Search Engines

FuhSen

I liked the user interface

80 60 40 20

Easy to find the information…

Task 1

0

1 Strongly agree

2

3

4

5

6

User 5

User 4

User 3

User 2

User 1

Task 2 Conventional Search Engine

Overall system satisfaction

User 5

User 4

User 3

User 2

User 1

User 5

Easy to learn Simple to use

User 4

User 1

0

I could become productive…

User 3

c)

Time in seconds

Time in seconds

a)

User 2

6

Task 3 FuhSen

7 Strongly disagree

Figure 4. User Evaluation results. (a) task execution time (average in secs.); (b) comparison of execution time on FunSen and traditional engines; (c) Average of scores of usability tests

universal search engines, e.g., Google? (2) Is the FuhSen user interface simpler and pleasant to use than interfaces of universal search engines? We used a formative evaluation technique and a usability evaluation in a controlled environment. We selected 10 users with high expertise on Web search engines. A moderator introduced the experiments to the participants, controlled the task execution time, and provided a usability survey to be filled out anonymously. Formative evaluation: Method. To measure the quality of FuhSen and validate our research hypotheses, we measure execution time of three tasks; ten users were included in the evaluation as suggested by Xu and Mease [9]. Task1: Find a person named “John Smith Allegro”, who is 33 years old and lives in Bonn, Germany. Task2: Find yourself. Task3: Find offers of a used Nexus 4 in the United States. We instructed users to stop when they considered that they had invested enough effort to accomplish the task. In the longest case a participant took 5 minutes to complete the task. Results. Five users search tools such as Google or Bing, while the others used FuhSen. The gold standard for Task1 was built from the Google+ account of John Smith Allegro, while an eBay offer for a Nexus 4 were created as used as the ground truth of Task3. Information the evaluation participants was used as the gold standard of Task2. Figure 4 (a) reports on average task execution (in secs.) during our user evaluation. Discussion. We observed that no user was able to complete Task 1 using a conventional search engine, whereas all participants were able to find John Smith Allegro with FuhSen. We hypothesize that the ranking algorithm used by conventional search engines prevent the finalization of Task1 on time. Only one person could not complete Task2 using FuhSen; from the post-study questionnaire, we realized that this person did not have any account with the information sources that are part of our current prototype. Similarly, only one person could not complete Task3 using FuhSen. A possible explanation might be that the participant employed several tricks learned from the use of conventional search engines. Figure 4 (b) shows the time participants needed to com-

FuhSen: A Federated Hybrid Search Engine

7

plete the tasks using FuhSen. The maximum time to complete Task 1 using FuhSen was one minute, which seems an acceptable value. Participants completed the Task2 faster using conventional search engines. An explanation could be that the participants knew exactly beforehand which keyword combinations would lead them to the expected results. Results of Task3 suggest the advantage of using FuhSen to find specific information about entities faster than with conventional search engines. Usability Evaluation: Method. This evaluation was performed with those participants who used FuhSen. Two techniques were used during this evaluation: think aloud protocols and a Post-Study System Usability Questionnaire (PSSUQ) [5]. Results. Figure 4 (c) summarizes the results, and indicate that FunSen user interface received high scores in all aspects, which suggests the good design decisions for the user interaction implemented in FuhSen. Discussion. One of the main usability troubles found is the filters: users did not realize they were filters at first use but only after exploring the interface. Another relevant observation is that users tend to apply keyword search tricks learned from using conventional search engines, such as searching for John Smith Allegro Facebook or Bonn Germany John Smith. This practice should be taken into consideration for further improvements of the user interface and query expansion.

4

Related Software

Law enforcement organizations demand more intelligent software to support their work. Therefore, both in academia and in industry efforts are made to build innovative crime analysis software. In [7], the DIG system builds a knowledge graph to combat human trafficking by crawling web sites with escort ads. In [4] a crime investigation tool is presented focusing just on online social networks. Maltego7 is an open source forensics application. It offers mining of information as well as visualization tools to determine the relationships between entities such as people, companies, or websites. Finally, Poderopedia8 is an initiative to promote transparency of power control in South America. It builds a knowledge graph of people and the power they have in the continent by registering the relations with organizations and other people. Journalists and contributors manually add entities and relations in the knowledge graph. In contrast to these tools, FuhSen creates a knowledge graph on demand when a keyword query is entered; results are built by integrating results collected from search APIs (e.g., Facebook or Twitter). In addition, the obtained results are enriched by semantic metadata.

5

Conclusions

In this article, we showed the foundations for a novel federated, RDF-based hybrid search engine, starting from the challenges to a comprehensive architecture and evaluation. The implementation of FuhSen is open source and is designed for reuse and 7 8

https://www.paterva.com/ http://www.poderopedia.org/

8

Diego Collarana et al.

extensibility. Our federated, vocabulary-based hybrid search concept employs a novel architectural pattern, incorporating elements from universal search, semantic integration as well as multi-modal search and retrieval. Although we initially focus on the criminal investigation domain, we deem that there are numerous further use cases, e.g., related to e-commerce (e.g. price comparison). Consequently, FuhSen is designed in such a generic, modular and flexible way that it can easily be adapted to scenarios beyond crime investigation. Through its clear interface definitions, the flexible vocabularybased integration models, and the modular architecture, new information sources can be plugged in with minimal effort and the platform can easily be tailored towards new application domains. When applied more widely, this federated search approach can contribute to realizing novel applications and business models, previously prevented by the prohibitive cost of full data integration. We have shown the relevance of this work in crime analysis, where FuhSen is constantly being evaluated by domain experts to maximize its applicability. Based on our evaluation we can conclude that the platform has a prominent future to help criminal investigators especially in the search of information.

References 1. Bhagdev, R., Chapman, S., Ciravegna, F., Lanfranchi, V., Petrelli, D.: Hybrid search: Effectively combining keywords and semantic searches. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC. Lecture Notes in Computer Science, vol. 5021, pp. 554–568. Springer (2008) 2. Collarana, D., Lange, C., Auer, S.: Fuhsen: A platform for federated, rdf-based hybrid search. In: Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11-15, 2016, Companion Volume. pp. 171–174 (2016) 3. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology, Morgan & Claypool, San Rafael, CA, 1 edn. (2011), http://linkeddatabook.com 4. Huber, M.: Social snapshot framework: Crime investigation on online social networks. ERCIM News 2012(90) (2012) 5. Lewis, J.R.: Ibm computer usability satisfaction questionnaires: psychometric evaluation and instructions for use. International Journal of Human-Computer Interaction 7(1), 57–78 (1995) 6. Simonini, G., Zhu, S.: Big data exploration with faceted browsing. In: Int. Conf. on High Performance Computing & Simulation (HPCS). IEEE (2015) 7. Szekely, P.A., Knoblock, C.A., Slepicka, J., et al: Building and using a knowledge graph to combat human trafficking. In: The Semantic Web - ISWC 2015 - 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11-15, 2015, Proceedings, Part II. pp. 205– 221 (2015) 8. Usbeck, R., Ngonga Ngomo, A.C., Bühmann, L., Unger, C.: HAWK – hybrid Question Answering over linked data. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC. No. 9088 in LNCS, Springer Verlag, Heidelberg (2015), http://svn.aksw.org/papers/2015/ESWC_HAWK/public.pdf 9. Xu, Y., Mease, D.: Evaluating web search using task completion time. In: ACM SIGIR conference on Research and development in information retrieval. pp. 676–677. ACM (2009)