search results in big data environment ... Keywords: Big Data, Online Search, Unstructured Data, Hadoop .... meet the needs of the analysis and processing of.
Communication, Management and Information Technology – Sampaio de Alencar (Ed.) © 2017 Taylor & Francis Group, London, ISBN 978-1-138-02972-9
Towards an approach based on hadoop to improve and organize online search results in big data environment K. Aoulad Abdelouarit Information Technology and Modeling Systems Research Unit Computer Science, Operational Research and Applied Statistics Laboratory Abdelmalek Essaadi University, Tetuan, Morocco
B. Sbihi Information Technology and Modeling Systems Research Unit Computer Science, Operational Research and Applied Statistics Laboratory I-School ESI, Rabat, Morocco
N. Aknin Information Technology and Modeling Systems Research Unit Computer Science, Operational Research and Applied Statistics Laboratory Abdelmalek Essaadi University, Tetuan, Morocco
ABSTRACT: In this article we study the technical specifications required for the proper conduct of online search process in Big Data environment, with the intention to evaluate the consistency of collected data and identify opportunities to improve and organize search results through a fictitious model that can make them well presentable and their information easily consumable in the future. The online data volume has increased dramatically but the quality of the information brought in these data and their form of presentation has clearly deteriorated. This is mainly due to the problem of representation the majority of the data generated in the Web in an unstructured form of information. Which prevents traditional search engines to effectively meet the information needs expressed by users or applications. It is in this context that we propose to design a technique that process massive and unstructured data to improve and organize online search results. Our solution is based on the combination of three systems: Hadoop, Lucene and Solr. As a result of this solution, massive and unstructured data can be processed from Big Data layer to structure them by Hadoop technique, in addition to index them by Lucene engine and finally organize their information to be accessible for online search through Solr framework. Keywords: 1
Big Data, Online Search, Unstructured Data, Hadoop
INTRODUCTION
With the emergence of the Web 2.0, a new vision of the Web was created by considering the user as a potential producer of information and not just a consumer (Sbihi et al. 2010). This radical change has significantly increased the amount of Internet data known as Big Data. The data of Big Data phenomenon represent the largest portion of data on the Internet. This mass of data that occupies our daily life does not cease to increase and requires advanced ways to capture, communicate, aggregate, store and analyze (Matei 2014). Blogs, social networks, wikis, etc., are one of the reason of the large amount of data in Internet. This impacts directly the online search systems:
where everyone is launching its query to obtain a particular result, but since data comes from multiple sources, the result becomes large and rich (Gayathri et al. 2013). The Big Data phenomenon made possible the development of highly skilled online search engines. The web pages generated by search engines are based on search terms that require sophisticated algorithms and ability to handle a huge number of requests (Lakhani et al. 2015). In their Internet search, users often rely on the best elements of the results page returned by search engines, and just beyond the first few pages, although the summaries of search findings are less relevant than other results on other pages. Currently, search engines offer automated queries to assist users in their information search, but the
543
ICCMIT_Book.indb 543
10/3/2016 9:25:52 AM
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016
proposed requests are often out of context and based on popular searches, rather than on the specific information needs of the user (Leeder et al. 2016). This leads us to ask: how do we customize search results such a way that users can gain more profit while online searching for information? On the other hand, massive data returned by online search engines are not necessarily textual and do not always manifest itself in a structured format. What makes them difficult to use by users and applications and benefit from the wealth of its information (Aoulad Abdelouarit et al. 2015). So we can make our question more specific: how do we deal with massive and unstructured data returned by the search engines to improve their presentation and consumption by users? Also, is there any concrete model to represent this kind of data that are not necessarily in text format? This article is the following of the problem of massiveness and unstructured data generated from Big Data phenomenon that we have already exposed in our previous work (Aoulad Abdelouarit et al. 2015). As we said, this problem impacts negatively the online search process. Thus, we study the technical specifications required to the success of this process in the Big Data environment. Our solution is based on the integration of three systems: − a system for processing massive and heterogeneous data generated from the Big Data layer; − a framework that offers searching on massive data; − a tool as an indexing engine for data. The following paragraph presents the state of the online search in Big Data environment by exposing the problem of the massive and unstructured data of the Web and its impact on online search results; the paragraph 3 presents the concepts and approaches of existing solutions to process the massive and unstructured data. Then we present in paragraph 4 the used solution based on the Hadoop technique integrated with the Solr search system, which tends towards the customization of search results for a better organization and improvement of use. The last paragraph presents a general conclusion putting forward a series of perspectives.
2 2.1
information. However, most users are unable to limit the subjects of their search and are overwhelmed by the amount of results provided by search engines, especially when they do not have the skills or resources to access and manage this information intelligently. Furthermore, online users use keywords and very simple terms for their search, and they assume that search engines will understand their queries. The majority of online search users do not behave strategically in their search, they wait rather than the search engine can find the answer for them, regardless of their own strategy (Leeder et al. 2016). Indeed, search engines have developed greatly since the advent of Big Data phenomenon. The effectiveness of the search for information, especially on the Web, would be particularly related to the expertise use of search engine system, including the knowledge of procedures and online research tools, but also the strategies to use in the search for information, and to quickly and accurately assess the content quality and the credibility of data and information returned. The significant growth of information in the Internet requires more effective search tools that can distinguish relevant information from hundreds or even thousands of raw data. However, the quality of the results provided by traditional search engines is not always appropriate, in particularly when the user request becomes increasingly complex (Aoulad Abdelouarit et al. 2015). Figure 1 shows the model of online search engine in Big Data environment. As shown in this figure, the user accesses the online search page via the web browser. He grabs the keywords and valid the search form. The search engine intercepts the user request and it start searching on the Internet data based on keywords entered. The collected data is referenced, indexed, and ordered before presenting them to the user on the results page. The parsed data includes all
THE ONLINE SEARCH IN BIG DATA ENVIRONMENT Executing the online search process in big data environment
The scope and volume of information on the Web requires good search skills such as the ability to formulate relevant keywords to find the
Figure 1.
Online search model in Big Data environment.
544
ICCMIT_Book.indb 544
10/3/2016 9:25:52 AM
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016
internet data like: websites, social networking, data generated by users and other external data sources than the Web data. This great mass of heterogeneous data impacts negatively the user of the online search and makes the search process difficult to succeed. Thus, the use of a support system to search online information, based on the raw data of Big Data phenomenon presents major achievements. To do this, the need for a fictitious model to represent and process this type of data that are not necessarily text becomes essential. 2.2
The problem of using massive and unstructured data generated from big data
The massive and unstructured data on Internet is growing exponentially: If 20% of the data available on the web is structured data, the other 80% is unstructured (Goher et al. 2016). Unstructured data means that the elements within the data have no structure and they do not follow a specific or universal format manifesting the information that she wears. Unstructured data can include content that is similar or identical to corresponding structured data but not organized so that they are easily consumable, presentable, or be used by an application or user (Donneau-Golencer et al. 2016). Table 1 shows the different categories of unstructured data that circulate in the web with examples for each category. As presented in this table, the information circulating on the Internet are manifested in several types (image, video, text, etc.) and comes from different sources (satellites, websites, social networks, etc.). While unstructured data do not take any form of organization, their information are so valuable that companies and researchers are constantly trying to find new ways to exploit them. Unstructured data can also contain digital information and factual details that may serve as a potential source of information (Nikhil et al. 2015).
3
3.1
PROCESSING MASSIVE AND UNSTRUCTURED DATA TO IMPROVE AND ORGANIZE ONLINE SEARCH RESULTS How to make massive and unstructured data consumable by users and applications?
Unstructured data cannot be processed effectively because of their raw format. Thus, the information extraction techniques have been widely applied to extract important structured and manageable data from the original unstructured data. Indeed, unstructured data are processed by extraction solutions of structured data. The extracted data
Table 1. Categories of unstructured data generated from Big Data layer. Data category
Examples
Satellite images
weather data, satellite surveillance imagery, etc. seismic imagery, atmospheric data, and high energy physics. security, surveillance, and traffic video. vehicular, meteorological, and oceanographic seismic profiles. documents, logs, survey results and e-to mails. YouTube, Facebook, Twitter, LinkedIn, Flickr, etc. text messages and location information. any site delivering unstructured content, like Wikis, YouTube, or Instagram.
Scientific data Photographs/video Radar or sonar Data Text internal the company Social media data Mobile data Website content
provide summaries and sketches from the original unstructured data. There must be a loss of information after processing data reduction. However, in many applications, unstructured data is so large that a small summary can be precise enough to meet the needs of the analysis and processing of data (Chen et al. 2013). With the variety, velocity and volume of data circulating on the Web, it has become increasingly difficult to find patterns that lead to meaningful conclusions based on these data. Thus, the solution for processing massive and unstructured data may be performed in several steps: Integration and cleansing: Data integration is the process of standardizing the data definitions and data structures of multiple data sources by using a common schema thereby providing a unified view of the data. Data cleansing is the process of detecting, correcting or removing incomplete, incorrect, inaccurate, irrelevant, out-of-date, corrupt, redundant, incorrectly formatted, duplicate, inconsistent, etc. records from a record set, table or database. Data cleansing is considered a major challenge to the Big Data era because of the increasing volume, velocity and variety of data in many applications. Reduction: it consists in to reduce the countless amounts of data based on significant parts. It is the transformation masses of data or information, usually empirically or experimentally derived, into corrected, ordered, and simplified form, a kind of summarized reports. Big Data opens new opportunities and challenges to these techniques that have been well studied in the past and which include the approach of machine-learning that is a possible
545
ICCMIT_Book.indb 545
10/3/2016 9:25:52 AM
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016
way to improve traditional techniques for reducing data to process massive data. Query and indexing: Indexing is sorting a number of records on multiple fields allowing searches and queries to be performed on it. In Big Data era, methods of indexing and searching with small structured data sets are no longer adequate. The tree structure is very popular in traditional indexing. While in the field of large volumes of data, this approach does not work well to provide simultaneous read and write operations without bottlenecks in the data structure. The need for a technical and effective method to improve data query and indexing from Big Data is essential. Analysis and exploitation: Data analysis is the process of inspecting and modeling data with the goal of discovering useful information. Exploitation is to gain valuable and actionable insights from large and ever more complex data. However, considering heterogeneity and massiveness of data generated from big data, analysis and exploitation becomes very difficult. Typical relational database technologies meet a lot of difficulties when they are used to meet the challenges of deep analysis of massive data, because of their limited ability to expansion. Figure 2 shows the appearance of the already mentioned solution involving the processing of massive and unstructured data generated by the
Big Data layer so that they are easily consumable and presentable to users and applications. As shown in this figure, the massive and unstructured data generated from Big Data layer must undergo treatment integration and cleaning at the beginning. Then, the reduction technique helps to avoid querying and indexing of large data retrieved from the Big Data layer. Finally, analysis and
Table 2. data.
Processing tools for massive and unstructured
Solution
Description
Hadoop
Apache Hadoop is a framework for distributed storage and processing of large sets of structured and unstructured data. It is an open source project based on Google and uses the MapReduce algorithm for data processing. It’s the basis of the Hadoop technology and represents the combination of the two techniques Map and Reduce, where the Map() function is the master node that takes the input data and divides it into small parts to be distributed in different nodes. On each node a recursive operation is performed which leads to the tree structure to multiple levels. The result of treatment is returned to the master node. In the Reduce() function the master node collects the solutions from all the little nodes and merges them together to form the output result. It allows manipulation of data without a prior defined pattern. These are semi-structured data or unstructured. This system avoids the schema definition and the loading of new data after cleaning. Its main features are: the horizontal extension of the system to perform simple operations distributed and data partitioning and replication on multiple servers, a more flexible concurrency model than traditional transactions, the indexing and distributed memory storage and easy changes in the data structure. It allows system to analyze hundreds of variables simultaneously, with their interconnections, to form patterns. It is well suited to complex problems involving multiple variables, it sticks very well with large and unstructured data, including images, text, audio, sensor data, etc. However, this approach is limited by its rules and does not provide options to well represent the processed data.
MapReduce
NoSQL
Machinelearning
Figure 2. Processing solution for massive and unstructured data generated by Big Data layer.
546
ICCMIT_Book.indb 546
10/3/2016 9:25:52 AM
exploitation of information will focus on a small part of data, which will take less time processing and will provide more relevant information.
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016
3.2
Towards a fictitious model for processing massive and unstructured data
Among the major challenges of the Big Data phenomenon, there is the treatment and representation of its massive and heterogeneous data. Since it is no longer question of process data in standard format and use generic models to represent their content, the trend today is the use of data representations that promote rapid handling, storage and recovery of raw data, distributed and heterogeneous (Adiba et al. 2016). The real challenge is not to capture this masse of various data, but to analyze these data and exploit the most valuable pieces of information. Table 2 shows the most used techniques for the treatment of massive and unstructured data. In addition to the comparison table just presented, we can say that NoSQL systems can provide too simple indexing strategies relative to RDBMS. They encourage the programming of queries using MapReduce model unlike declarative approach and optimized relational DBMS. In contrast, Hadoop can process large amounts of data directly, without defining a schema as for relational DBMS and it uses the MapReduce technique for that (Adiba et al. 2016). Thus, we can say that the most appropriate model for dealing with massive, heterogeneous and unstructured data resulting from online search, is implemented by the Hadoop-MapReduce technique. This technique allows to capture and store huge amounts of unstructured data in its native format. 4
4.1
TOWARDS AN APPROACH BASED ON HADOOP TO IMPROVE AND ORGANIZE MASIVES AND UNSTRUCTURED DATA
essentially consists of two parts: The distributed model MapReduce and the distributed file system HDFS (Chen et al. 2014). Consequently, we can represent the previous figure 1 about the architecture of a complete system for online search, by replacing the conventional data processing layer of the search engine by the Hadoop layer for massive data processing. Figure 3 below shows the new model architecture implemented by the HadoopMapReduce technique to treat the massive and heterogeneous data and the interfacing of this model with the online research process. This figure represents the typical architecture of a fully integrated online search system for interactive exploration of massive, heterogeneous data coming from several sources: websites clickstream, user generated content, content management system, external Web content, etc. This solution is based on the Hadoop Framework. Data sources of Big Data are stored in the Hadoop Distributed File System (HDFS). Using the MapReduce technique, these data will be processed, reduced and indexed to make them so well-suited for search, and therefore, provide a better platform for data mining than relational databases. It has to allow a natural language search using keywords and interactive navigation through its interface without additional training or advanced knowledge of programming. The main features should include scalable storage in HDFS, indexing batches via MapReduce to create scalable index for data stored in HDFS. This system should also allow the real-time indexing when collecting data and make it searchable when stored. 4.2
The use of a system based on hadoop for online search in big data environment
The Hadoop Framework is normally used to process massive and heterogeneous data. It can perform various operations such as data analysis,
The integration of the online search with the hadoop technique
Today, the major challenges of online search engines manifest in understanding capacity, speed and accuracy in the collection and evaluation of search results expressed by the user. Almost of the search engines are based on the PageRank algorithm to assess the significance of the sites covered, this enhances greatly search engine accuracy, but the value of the rendered content does not always match the needs of the user. In addition, the actual search engines need to deal with huge amounts of data and complex calculations that emerge on Internet daily. The Hadoop technology alone does not provide a complete data searching system, but it is a distributed system programming tool that
Figure 3. Integrating Hadoop-MapReduce with online search system.
547
ICCMIT_Book.indb 547
10/3/2016 9:25:53 AM
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016
analysis of results, etc. However, Hadoop can be combined with other techniques to offer an online search system to explore massive and disparate data coming from multiple internet sources including structured and unstructured data. Table 3 presents a set of solutions based on Hadoop technology and offering an online search system in an environment of massive and heterogeneous data. According to this comparative table of solutions, we can say that to meet our need for improvement of online research in the Big Data environment, the best solution is to combine Hadoop technology for the storage and processing of massive and heterogeneous data, with a research framework as Solr, in addition to an indexing engine data like Lucene. Moreover, a specific development is needed in terms of this system presentation layer to meet the ergonomic needs and formatting of search results returned by the online search system. Figure 4 shows the architecture of the new solution with the Table 3.
Online search solutions based on Hadoop.
Solution
Description
Solr
Is an open source search platform based on Apache Lucene project. It includes full text search. It uses the Lucene as search library for full-text indexing and search. It provides distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration. Is an open source Apache project that offers a text search engine library. It includes indexing, ranked searching, powerful query types like phrase queries, wildcard queries, proximity queries, range queries, fielded searching such as title, author, contents. Is an open source, distributed, analytics, real-time search engine. It uses Lucene internally to build its distributed search and analytics capabilities. Elasticsearch clusters detect and remove failed nodes, and reorganize themselves to ensure that the data is safe and accessible. Is an open source web search engine based on Lucene for the search and index component. It is a highly extensible and scalable web crawler software project. Is an open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), targets enterprise-class deployments of that technology.
Lucene
Elasticsearch
Nutch
Cloudera
Figure 4. Integrating Hadoop-MapReduce with the Solr search framework.
use of Solr framework via its search interface, for searching data generated by the Big Data layer and processed by the Hadoop technology. As presented in this figure, data coming from Big Data are intercepted by Hadoop layer that implements both a Distributed File System (HDFS) and an execution layer that supports the MapReduce programming model. Thus, data is loaded and transformed during the map phase, and then combined and saved during the reduce phase to write out Lucene indexes. The Lucene layer reads stored data from HDFS, and stores them using a Lucene Scheme, which in turn saves records as Lucene documents in an index. Once all of the files are indexed at Lucene layer, we can now perform queries against them. At Solr layer, we need to create a schema that matches the index that we are generating from Lucene layer. 4.3
Results and discussion
We can directly access Solr admin console by pointing our browser at: http://:8983/ solr/admin. And from here we can run queries against Solr like the Figure 5 shows. As shown in this figure, the response is in JSON format by default, and it found (in this example) 17 matches in 0ms for products that have the property inStock=true. The combination of Hadoop and Solr allows to easily explore a lot of data, and then provide the results quickly via a flexible search interface. Solr supports multiple style queries, we can say that it replaces the NoSQL system for traditional databases, especially when the data size exceeds what is reasonable with a typical RDBMS. However, Solr has some limitations such as: − Updating the index Lucene generates a new segment, which impacts performance; − The replication feature is not yet supported in its Cloud (SolrCloud); − Many SQL queries cannot be easily expressed with Solr queries.
548
ICCMIT_Book.indb 548
10/3/2016 9:25:53 AM
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016
REFERENCES
Figure 5. Executing queries from the Solr admin console.
5
CONCLUSION AND FUTURE WORK
The growth of unstructured data is a particular challenge for Big Data in addition to the volume and diversity of data types beyond the capabilities of older technologies such as relational databases. Companies and researchers are constantly exploring the next generation of technologies for the analysis of such data. One of the most promising technologies is Hadoop and Apache MapReduce technology for managing massive and heterogeneous data. Thus, to improve online search results used in the massive, heterogeneous data environment, we must think to design a system based on Hadoop-MapReduce technique. As we have already mentioned, Apache offers several solutions in this context, which include the Solr system, Cloudera, Nutch, etc. In this paper, we propose the search Framework Solr combined with the Lucene indexing engine to implement the online search in the Big Data environment. As perspective of this work, we intend to implement the scenario of the use of online search by integrating one of the solutions already mentioned. Then, in a second step, we will study the possibility of integration of our solution to improve teaching and scientific research for learners in e-learning environment, that was the subject of our previous study within the University of Abdelmalek Essaadi (Aoulad Abdelouarit et al. 2015).
Adiba, Michel, Castrejon-Castillo, Juan-Carlos, Oviedo & Javier Alfonso Espinosa et al. 2016. Big Data Management Challenges, Approaches, Tools and their limitations. Networking for Big Data. Aoulad Abdelouarit, Karim, Sbihi, Boubker, Aknin & Noura. 2015. Big-Learn: Towards a Tool Based on Big Data to Improve Research in an E-Learning Environment. International Journal of Advanced Computer Science and Applications (IJACSA) 6(10): 59–63. Chen, Jinchuan, Chen, Yueguo, DU & Xiaoyong et al. 2013. Big data challenge: a data management perspective. Frontiers of Computer Science 7(2): 157–164. Chen, Ning & Chai Xiangyang. 2014. Investigation on Hadoop-based Distributed Search Engine. Journal of Software Engineering 8 (3): 127–131. Donneau-Golencer, Thierry, Nitz, Boubker, Aknin & Kenneth C. 2016. Extracting and leveraging knowledge from unstructured data. U.S. Patent No 9,245,010. Gayathri, J. & Saraswathi, K. 2013. Extraction Of Data From Streaming Database. International Journal of Computer Trends and Technology (IJCTT) 4(10). Goher, S. ZerAfshan, Javed, Barkha, Bloodsworth & Peter. 2016. A Survey of Cloud-Based Services Leveraged by Big Data Applications. Managing and Processing Big Data in Cloud Computing: 121. Lakhani, Ajeet, Gupta, Ashish & Chandrasekaran, K. 2015. IntelliSearch: A search engine based on Big Data analytics integrated with crowdsourcing and category-based search. Circuit, Power and Computing Technologies (ICCPCT), 2015 International Conference on. IEEE: 1–6. Leeder, Chris, Shah & Chirag. 2016. Measuring the Effect of Virtual Librarian Intervention on Student Online Search. The Journal of Academic Librarianship 42(1): 2–7. Matei & Laura. 2014 Big Data Issues: Performance, Scalability, Availability. Journal of Mobile, Embedded and Distributed Systems 6(1):1–10. Nikhil, R., Tikoo, N., Kurle, S., Pisupati, H. S., & Prasad, G. R. 2015. A survey on text mining and sentiment analysis for unstructured web data. Journal of Emerging Technologies and Innovative Research (JETIR) 2(4). Sbihi, Boubker, El Kadiri, Kamal Eddine, Aknin & Noura. 2010. Towards a participatory E-learning 2.0 A new E-learning focused on learners and validation of the content. arXiv preprint arXiv:1001.4738.
549
ICCMIT_Book.indb 549
10/3/2016 9:25:53 AM
Downloaded by [KARIM AOULAD ABDELOUARIT] at 14:10 22 November 2016