Hierarchical Multi-Log Cloud-Based Search Engine

3 downloads 7549 Views 311KB Size Report
With the increasing popularity of cloud computing, users store large amounts of data ..... including REST-based calls, server monitoring values, time constraints ...
Hierarchical Multi-Log Cloud-Based Search Engine Ajitpal Singh Horacio Gonz´alez–V´elez National College of Ireland, Cloud Competency Centre, IFSC– Dublin 1, Ireland. W: www.ncirl.ie/cloud

Abstract—Having become the leading trend in IT infrastructure, service delivering, and multi-layered resource sharing, cloud services typically include SaaS (Software as a service), PaaS (Platform as a service) and Iaas (Infrastructure as a service). With the increasing popularity of cloud computing, users store large amounts of data as documents, text files, databases, and more relevant to this work, system logs. Current cloud services are getting more decoupled with each layer in the cloud stack generating different logs for network, applications, database, and programming interfaces on different machines. At any point in time, cloud providers, users, or application developers arguably require to understand the status of different components, monitor business processes, and analyse machine logs in real time. However, there are no specialised search engines for the systematic analysis of logs by different cloud providers. Hence, this paper presents Simha, an agent-based document search service for cloud platforms. It implements a proof of concept system to analyse user documents, logs, and folders in real time from different virtual machines. Based on an Elasticsearch server, our overall search process has been extended to distributively search data stored into cloud. So, we propose an application which looks for data in private cloud and public clouds. In this paper, we describe its design and implementation. We have obtained initial encouraging results, and we further discuss how to extend our scheme in several ways. Keywords-BigData; Cloud Computing; Hybrid Cloud; ElasticSearch; Log; Search Engine; Data

I. I NTRODUCTION Cloud computing has brought major advancements and innovations in the ICT services industry. In a very short period of time, the term cloud computing has defined a new style of computing in which resources are easily virtualised, providing scalable services that can be accessed over the network [1]. Nonetheless, in addition to the multiplicity of applicative data sources collected on clouds, efficiently running any application service in “The Cloud” requires insight into the on-going system processes and workflows. While enterprises tend to keep their most sensitive data inhouse, huge volumes of big data can be located externally. The decision on a cloud environment, public or private, constitutes an important decision for a CIO to mitigate risk and maintain control. System data—such as infrastructure logs, database audits, security checks, etc—is evolving as an asset in itself which offers the promise of providing valuable insight for organisations. Organisations are quickly moving beyond questions of what and how to store Big Data for applications into how to efficiently derive meaningful insights that respond to real business needs at optimal cost and energy consumption. As cloud computing continues to mature, a growing number of enterprises are building efficient and agile cloud environments,

and cloud providers continue to expand service offerings. It makes sense that any organisation looks to search capabilities in cloud as a means to analyse large amounts of applicative and system data. Nonetheless, a cloud-enabled search engine is required to distributively locate documents in cloud environments. It is in the best interest of users and cloud providers alike to seamlessly search documents, log files, SQL, and NoSQL databases. However, a cloud-based search application can become very complex as the ability to locate data on different cloud platforms depends on multiple protocols, vendors, instances, APIs, and technologies. We have developed Simha, a simple engine to efficiently search over unstructured and structured data on different cloud platforms without user intervention in real time. Simha provides a singular approach to locate unstructured data across multiple clouds using the Elasticsearch engine as back-end. Based on an Elasticsearch server [2], it stores and analyses data from different cloud platform, furnishing a distributed realtime engine for effectively searching and indexing document in cloud environments. Through Lucene, a Java application search engine, it provides full text-based searching capabilities using powerful query language, automatic search, and conflict management for document processing in real time. Simha has significantly extended the search capabilities of Elasticsearch by creating additional interactions for Linux and Windows file systems, distinct databases, and SharePoint files. This paper is organised as follows. Section II introduces the background for Simha and related work. Section III describes the application design and technology framework, followed by section IV where the Simha implementation and its workflow are reported. Section V presents a performance evaluation of the proposed proof of concept. Finally, section VI gives the concluding remarks. II. BACKGROUND Cloud computing [3] offers a new computing model that has rapidly become a widely adopted paradigm where resources can be shared as services over the internet. Its underpinning technologies include hardware optimisation, on-demand services, software as a service, elasticity, payment model, and flexible resource allocation [1]. Cloud computing provides ubiquitous access to data, improved scalability, performance and moreover a ’computing utility’ abstraction, like other utility services available in today’s society [4]. Rather than a new concept, Lee et al. [5] describe cloud computing as a fused-typed paradigm which includes network computing, utility computing, virtualisation and grid computing.

Recently more organisations and enterprises are seeking ways for achieving scalability, cost saving and resources utilisation in cloud. With such a rapid growth in industry, the management and provisioning of underlying infrastructure and services to the customer is becoming a challenging task [6]. Datacenters for cloud computing are evolving in terms of both software and hardware resources as well as traffic volume. Rochwerger et al. [7] acknowledge that cloud operation and its management are getting more complex. However, due to inherited complexity of services at various levels, Kc and Gu [8] argue that cloud systems are prone to runtime problem caused by hardware and software bugs. In this scenario, to manage the complexity and effectively managing these cloud platforms require accurate and fine grained log monitoring. An under-utilised resource in cloud provides, system logs provide an important piece of analytical data for monitoring and investigating the services running on any platform [9]. Logs provide access to historical information for forensic investigation and debugging such that application developer, service providers and users can monitor business process. Each layer in cloud application stack generate log for application, database, network, etc. Raffael [9] emphasises that logging framework is required for generating and collecting data in logs for forensic investigation on regular interval. Since cloud based infrastructure consists of numerous IT management services and operations, the volume of data generated by daily IT operation at various levels can provides in depth insights to assist IT management services. Song et al. [10] insists that traditional IT management solution cannot analyse and exploit the rich information due to very large volume of data, its velocity and lack of data mining as well as searching capability in real time. The log data need to be stored reliably over a period of time to make predictive analysis and derive meaningful result. Lee et al. [5] describes important problem about log management. Cloud computing is multi-tenant at various level, therefore every tenant environment generates large amount of the logs and analysing logs for multiple tenants in real time therefore becomes very complex. Further, Lee et al. [5] propose that multi-level IDS (Intrusion detection system) is required for protecting cloud computing environments from various security threats and attack patterns. Today major cloud service providers include capacity and resources for their services and QoS guarantees in their SLAs, specified in service level specification (SLS) including SLO (service level objective) [11], [12]. Logging can provide insights into Key Performance Indicators (KPI) about cloud provider SLA and QoS based on analysis. Moreover, troubleshooting issues in cloud environment represents a big problem due to the associated complex interactions. Any given cloud is modelled into several layers which can be controlled either by a cloud service consumer or cloud service provider, and diagnosing problem requires to search in all layers and its components [13]. Similarly, Spring [14] states that in order to do troubleshooting within complex cloud infrastructures, both cloud providers and consumer need

to know the exact location of logs and events in virtual machines at different layers. Du et al. [15] discuss how virtual machine profiling can be monitored inside and outside the cloud environment and propose system-wide and guest-wide measurements. Marty [9] list a few challenges associated with the cloud based log management and forensics: • Archival and retention of logs • Decentralisation of logs • Volatility and accessibility of the logs • Multiple tiers and logs • Random & non-compatible log format Fatos et al. [16] define a Generic Log Adapter (GLA) to address most of the challenges described by Marty [9]. Moreover, the IETF also proposes a universal format for logger messages, which is a collection of guidelines for any logger program to generate log files. Since most log files contain records about all system events, transaction, user activity, application and error messages. Frequently, log data is critically important for analysing business workflows and replicating events that occurs, but Ray et al. [17] even argue that logging will save user privacy information and can cause privacy breaches for users since it is accessible by everyone. They present a comprehensive framework for secure logging as a service in cloud based environment for log storage and retrieval. It stores logging information in MySQL database. It provides a novel framework to implement secure logging framework but it only provides operating system based logging and real time search with other cloud platform is not implemented yet. In very similar approach, Suakanto et al. [18] build a crawler engine for cloud infrastructure running on Eucalyptus cloud platform, current prototype search words and string only. It is important to mention that indexing is slow in current process due to waiting delay in visits [18]. Within the context of document searching, Elasticsearch provides real time indexing of data which can be distributed horizontally [19]. As the logging infrastructure grows, it is important to understand the accountability factor in log tracking and its management. Nakahara and Ishimoto [20] introduce the concept of accountability in cloud logging to resolve the disputes if any arises, thus it require strong identity to be held responsible for any accidents. To achieve accountability, 1) we need to track all the activities in the cloud; 2) keep track of each activity with person or entity responsible for changes and which can be traced backed later; and, 3) maintaining individual evidence with sufficient information. However, it represents a big challenge as misconfigured machine may result in incorrect accountability, virus or malware can pilferage valuable data and cloud data may not be available at desired point of time due to loss. As opposed to MapReduce [21] or structured parallel computing [22] in the cloud, we must consider two key characteristics of server log files. Firstly, the log file size depends upon the number of online users and services deployed. Secondly, data is dynamically generated and typically requires some preprocessing to be analysed. Caballe and Xhafa [23] present

a novel concept for evaluating massive log file data in a virtual campus using different distributed infrastructures, such as clusters and PlanetLab. Their approach includes data mining techniques to find various trends and patterns within a virtual campus. It presents how to filter and parse large log files effectively. Nevertheless, we need to see the challenges in present cloud services where log distribution varies across thousands of nodes with terabytes-size repositories. Kc and Gu [8] present a novel hybrid log analysis approach, it employs both coarse grained and fine grained log feature. It can detect both known and previous known anomalies in log search. These data anomalies provide valuable information about cloud logging. However, this framework requires reducing memory consumption in cluster and currently developing online log analysis system. Kwang [24] describes an agentbased service search engine that finds relationships between cloud services by consulting its cloud ontology. Moreover, this paradigm aid in development of software tools for service operation in cloud environment using agent based approach, which can be implemented in cloud log search. On the other hand, Yang et al. [25] implemented the intrusion detection system via inter-VM log analysis using the MapReduce algorithm under Hadoop but it only detects intrusion pattern, its integration and reporting to customers. Log pattern analysis is always a complex task and developing a standard framework is an ongoing process as we have shifted toward cloud. In fact, pattern identification has always remained the core issue in log analysis. Fatos et al. [16] identify that a typical problem consumes between 30% and 70% of any organisation resources, thus the basic approach is to cut down cost and implement a generic framework for log format and semantics. They have developed a novel framework by ‘gridifying’ the IBM Generic log Adapter (GLA) to increase performance and processing of the logs by transforming exiting log data into CBE format using Master-Worker strategy. GLA implements the Common Base Event format (CBE) for log data event, uses a rule based approach and is written in Java. CBE [26] is an XML based universal log data format defined in XML schema and based on IBM implementation. The user has to develop his own parser based on the basic framework elements for a log file format and compile it as a plug-in to the GLA. We need to consider, as moving toward cloud components get distributed and processing of log in sequential manner can become a performance bottleneck. As the infrastructure of any cloud gets more complex, it requires more effort to manage and monitor logs. Searching provide efficient way to retrieve information. Ichikawa and Uehara [27] proposed distributed search engine i.e cooperative search engine (CSE) for private cloud, for efficiently searching document distributed to many sites. It has developed distributed search engine for documents stored in the different virtual machine [27]. It implemented search engine using Apache and Cassandra. Nurseitov et al. [28] compare the results of two data exchange formats (XML and JSON) and their results indicate that XML is slower than JSON due to its

counterpart. Elasticsearch provides backup of index in format of JSON in case of recovery or transferring results. A. Contribution Cloud based search implies some challenges or trade-offs such as latency, communication, sampling and data analysis delay. During shorter sampling interval, there is frequently a smaller delay between a condition happened and when it is captured into a log. Thus to obtain such up-to-date information, we need to analyse system-wide resources. There is inevitably delay in such analysis to get necessary information and computing resources required for completing the task, since any cloud entails a system architecture unevenly distributed across several layers. Communication delay play a significant role, as information needs to travel across several processing node and links. Despite the growing interest in cloud computing, scant research has been devoted to inter-VM search on different cloud platform and operating system. As enterprises are continuously integrating multiple cloud service from different providers and platforms, a comprehensive searching and monitoring system is required to troubleshoot system-wide irregularities using system logs and services, while preserving isolation among different tenants. However, uninterrupted indexing of system data files often becomes an unbalanced task, as one cannot accurately predict how/when troublesome logs will be generated. Furthermore, such indexing does not provide any information about inter-VM search ‘per se’. From this perspective, Simha services enable customers to search their cloud environment and integrate other cloud platform services to meet cloud search management goals. Customer can create indexers for collection, processing and storing the data in Elasticsearch engine for management and monitoring at specific intervals. Simha uses the multi-tenant features of the Elasticsearch cluster to maintain user data in different cluster with its reference to machine configuration [19] and backup data at regular interval to maintain accountability factor. Data is parsed and stored in JSON format in Elasticsearch server. JSON is human readable data exchange language which is easy to parse and use. It is directly supported by JavaScript, thus provides significant performance benefit over the XML which require extra library for parsing information in DOM format. III. D ESIGN The Simha application has been created to provide central search management across all virtual machines and sets of services defined by a user. It provides a simple user interface to create, delete, and update the search index for virtual machine on different platforms. The Simha application requires two virtual machines on the user cloud platform. The first VM hosts the Elasticsearch server installed with the required river plugins and running as a daemon. This Elasticsearch virtual machine will have 5 shards with a replication factor of 1. A user can easily scale

the Elasticsearch nodes since it is horizontally scalable. The second VM deploys the web frontend package which integrates the user web interface with the Elasticsearch engine. Both virtual machines are required for the Simha application. Advance configuration and virtual machine settings are administered via a web interface. A. Simha Overview Firstly, we shall explain the basic idea of Simha. A user can create and manage the directory index required for a virtual machine via a web interface. The Elasticsearch server listens to a user request using a REST-based API. To minimise resource utilisation, every indexer works on specific time intervals. Moreover, each indexer parses the system directory and sends the information in JSON to the Elasticsearch server. Once the information is collected, the indexer closes the connection with the virtual machine until next interval is called. All Simha activities and logs are saved in a MySQL database accessible to the system administrator for debugging purposes. To gather usage metrics, all user requests are saved in a metering database along with the Elasticsearch health API. When searching data is available within a server, any client can send a query to the node to retrieve the results. The Simha service includes following components (see Figure 1). • Web console (WC): Users access the Simha application to configure virtual machine index via the WC. Each user manages different virtual machine indexes and all configurations are saved in XML files. The WC interacts with MySQL database to authenticate users and logs all user activities. The WC can manage indexers for Linux & Windows machines; SharePoint via REST based API and Database via external db river plugin. All user requests are forwarded to the Elasticsearch node. • Elasticsearch Server (ES): Elasticsearch is an open-source server which provides distributed and real-time search capabilities, also known as a document database. It implements Lucene as its backend for document parsing and structuring [2]. Elasticsearch is API-driven and users can manage almost any actions using a simple RESTful API with JSON to monitor server health, search data, etc. WC requests are mapped to the Elasticsearch server and initiate the virtual machine index process. Elasticsearch enhances its core capability by adding plugins and rivers. Any changes in plugins and rivers require to restart the search node or cluster [2]. • Metrics: Simha stores all Elasticsearch server activity including REST-based calls, server monitoring values, time constraints, system performance etc. in the Metrics database. It records statistics for user cloud platforms to monitor its performance and usage from a business perspective. • Indexers: An indexer in Simha creates an index for the virtual machine directory for gathering and parsing the documents in JSON. The JSON format is structured according to a mapping defined by the river plugin

Fig. 2.

River overview

and the data is sent back to the Elasticsearch server. Data is transferred using sftp in Linux and JCIFS API in Windows. All indexers collect data on pre-defined scheduled intervals. A user can define the document type such as .txt, xls, .doc files etc. For indexers to collect data from public clouds, the system administrator should add the Elasticsearch server IP and open the required ports in its security configuration. B. Elasticsearch Plugins and Rivers Elasticsearch provides a way to enhance its core capabilities by adding custom function in form of plugins. Plugins can vary from river to analyser, mapping types, native scripts, etc. Plugins can be installed both manually and automatically under the plugins directory. In manual mode, plugins must be copied to /elasticsearch/plugins/plugin name/plugins folder. In automatic mode, a user can install the plugins using following command plugin –install //. A river is a service that be plugged in and out of Elasticsearch cluster for pushing or pulling the data that can searched as well as indexed into single or different cluster. Every river has a unique name and type. The name identifies the river within an Elasticsearch cluster and its type defines the full data type to the cluster. Figure 2 presents an overview of rivers in Elasticsearch. Next, we describe the FS-River plugin implemented in Simha. C. FS-River plugin The FS-river plugin helps users to send the file system documents and logs from source to the Elasticsearch cluster. It indexes the local file system, Linux file system by SSH and Windows file system using SMB protocol libraries. Indexers parse specific directories at particular time intervals, so users can schedule the indexers to read data at specified time intervals. Figure 3 illustrates the file system plugin installed in a single Elasticsearch node. Once the indexer is started, it identifies the river for reading a file in a directory, database and SharePoint document library. Each river plugin establishes a one-to-one relationship with the file directory. A river reads the data in byte stream from directory and then byte stream is passed to

Fig. 1.

Simha overview.

Fig. 4.

SSH JSON script

filesystem. D. Sample index

Fig. 3.

River plugin internal components

Elasticsearch node in JSON format by FS-River. Every index has its own Lucene backend knows as shards or replica in Elasticsearch terms. FS-River allows indexing multiple documents and folder structure. Shards can contain all documents index or can distribute data in-between to give better performance and routing capabilities. Index is built on Lucene, which can be divided into variable segment at any given point. Each shard can contain single or multiple documents and provides very high fault tolerance and distributed search capabilities. In next section, we describe the sample index created for Linux

We define a SSH JSON index for Elasticsearch node required to parse directory in Linux based VM. Figure 4 represents sample JSON script to send SSH request to FSRiver plugin. • Create machine index name sshlogs and using curl command to send JSON request to Elasticsearch node — curl -XPUT localhost:9200/sshdemo/ • Configure the sshlogs index properties: – url: path of the directory to parse – server: Server name or IP address of the machine – username: Machine Username. – password: Machine password. – update-rate: Time interval in Milliseconds. – includes: Specify document type eg. txt, pdf etc. – Domainname: Required in Windows VM. – protocol: SSH, SMB or SP(SharePoint).

IV. I MPLEMENTATION In this section, we discuss the implementation of ES plugin and Simha frontend. Simha extends the Elasticsearch river plugins capabilities to read data over the network. The Simha application contains Java Servlets to configure the indexes and search pages in the Elasticsearch node. The Simha application is divided into two parts: • FS-River plugin for Elasticsearch. • Simha Website A. FS-River plugin Development FS-River plugin accesses data over the network using JSch library for Linux machine, jCIFS library for windows machine and httpasynclient for reading document in SharePoint library using REST API. The plugin integration with Elasticsearch presents a number of challenges, as Elasticsearch does not support automatic data parsing over the network. We extended the FS-plugin to read the files over the network, convert them to JSON mapping format and send the data back to Elasticsearch node using its API. The Elasticsearch node automatically indexes the data. We have defined a fixed mapping style for formatting data. A plugin provides a schema mapping which define our index structure to Elasticsearch node. The FS-river plugin depends on the following software packages: 1) Elasticsearch.jar V(0.90.10). 2) JSch.jar (0.1.50) for SSH. 3) jCIFS.jar (1.3.17) for SMB. 4) httpasynclient (4.0) for REST/HTTP. In order to index the database, Simha installs the JDBC plugin in the Elasticsearch server. The JDBC River enables fetching data from JDBC sources in order to index data into Elasticsearch. The relational data is internally converted into structured JSON objects and pushed to Elasticsearch nodes. B. Simha Frontend The Simha interface is implemented using Java servlets which act as an intermediate layer between an http client and requests coming from browser. Every user request is processed using servlets and in response servlets generate the html pages. For testing purposes, the web application runs the Jetty webserver. During deployment the application creates a .war file stored in the Jetty web server folder. To interact with the REST-based interface of Elasticsearch, applications integrate the Jersey client to send and receive the request from server. Jersey helps in the development of the RESTful web services using its own Java API that extends the JAX-RS toolkit. The frontend implements following framework for the application: • Framework – Jersey Restful API. – Jetty Http web server. – Java servlets. • Dependency libraries – Jetty-server (9.0.0) for embedding http web server

Fig. 5.

Manage Index

– javax-servlets API (3.1) for servlets implementation – slf4j-api (1.7.2) for logging purpose – gson (2.2.2) Google client library for JSON parsing • Supporting language and library – Java, Ajax, JavaScript, Html & CSS Next, we describe the functionality of Simha frontend. In the web console, it provides two tab views containing a search and indexing tab. A user can add a new index for a machine as shown in figure 5. When user requests are processed by servlets, it authenticates the server and, if credentials are correct, the index request is sent to the Elasticsearch server. Once an index is verified, its entry is shown in a grid. Users can start and stop the indexer at any time. To verify that an index is created at the Elasticsearch node, one can browse the URL http://esnode:9200/plugin/head/. This command shows all the indexes in a node and its respective shards as shown in figure 6. Searching indexed data can be performed via the search tab. A search page gives details about total results found and time took by server to find the data as shown in figure 7. Every result show the relevant scoring based on Lucene search algorithm. We can search the entire document repository indexed by Elasticsearch cluster which includes documents from Linux machines, Windows platforms, SharePoint servers, and MySQL databases. To test the capabilities of the plugin, we have performed some evaluations by reading large text files over the network and checking the system processing and network speeds across different cloud platform. In the next section we present the results of such evaluations. V. E VALUATION In this paper, we evaluate our implementation of the Simha application on private cloud environments created on top of the VMWare platform. Additionally, we have enabled Simha index virtual machines in both Amazon and Rackspace public cloud platforms in order to evaluate under hybrid clouds. The evaluation configuration is shown in Table I.

Fig. 6.

Elasticsearch Head

Fig. 7.

The Elasticsearch server runs in virtual machine esserver01, the configured node has 2 x 2.54Gz Xeon processor and 8 GB RAM. Furthermore, we have installed the plugins and the web frontend for the Elasticsearch node to monitor its internal processing and clusters. The Simha application is configured and deployed on another virtual machine name Simhawebserver. The node has 1 X 2.54Gz Xeon processor and 2 GB RAM. We have installed the Jetty standalone webserver and deployed the frontend .war package. For reproducibility purposes, we have not added any cache or proxy server in front of the Simha website. A. Functional Evaluation During the first evaluation, we have created a common folder name demoshare with different document types on all virtual machine to be indexed with a common file structure. We have 5 virtual machines running on all cloud environments and the overall configuration is described in Table II. We have

Search Tab

created 4 indexers (2 for local VMs and another 2 for public cloud VMs). Each indexer runs at different time intervals. Indexers have successfully collected the index data from local VM as well as from public cloud VM. Currently, the interface does not support uploading private key to login in public cloud VM, so we have copied the private keys into the specific folder. We intend to eventually update the interface to upload public-private keys to the Simha application. The directory structure data has been indexed in the search cluster so that users can perform search query from web interface. For reproducibility purposes, the index mapping is the default one for all the VM, but clearly users can create custom mappings for every VM. B. Performance Evaluation For the second evaluation, we have downloaded 2 text log files with sizes 1.28GB and 2.70GB respectively, in order to test the indexer speed over the network and CPU utilisation

TABLE I Simha CONFIGURATION Component Elasticsearch version 0.90.10 Webserver

Virtual machine name esserver-01 Simha-webserver

TABLE II VM CONFIGURATION Cloud Environment Private Cloud - VMWare Rackspace Cloud Amazon EC2

Virtual machine Linux and Windows VM (m1.medium) Linux VM (m1.medium) Linux VM (m1.medium)

over time. After the setup is completed, we have created a testing ssh index for fs-river and sent the request to the Elasticsearch server. Parsing large files in the server has required to increase the default heap size for JVM to properly load data into memory. We have started server with the following configuration to proceed with our test. elasticsearch -f -Xmx1g -Xms3g -Des.index.storage.type=memory. Next, we present our findings and evaluations. First, we have evaluate ssh index creation by parsing 1GB file with fs-river. In this evaluation, fs-river parses the 1 GB text file in 61 seconds using ssh protocol. Figure 8 shows the CPU usage over the time for 1 minute. As we can see from the graph, the average CPU utilisation for reading 1 GB is approximately 58.04%. However, cpu utilisation has fluctuated with a mean value (58.04%) and with the standard deviation of 19.94. That is to say, we can say that cpu utilisation varies between 77.98% and 38.04%. Next, we have evaluated the network usage result for reading 1GB file over the network. Figure 9 shows average network utilisation for reading 1GB file is 13.81 MB/s. The network utilisation varies between [8.95 - 19.05]MB/s with standard deviation of 5.22. We have also evaluated the file parsing of 2.57 GB file using SSH protocol, but fsriver throws error not enough heap memory, so we increased the memory filter of Elasticsearch

Fig. 8.

CPU usage for 1 GB file

Fig. 9.

Network utilisation for 1 GB file

Fig. 10.

CPU utilisation for 2.57 GB file

cluster as -Xmx1g -Xms5g, as result, we were able to read 2.57 GB file with following results. Fs-River took 3.1 minute to parse the file. Figure 10 shows cpu utilisation for 2.58 GB file. The cpu utilisation was keep on fluctuating with its mean value (58%) with the standard deviation of 13.51. In short we can say that cpu utilisation varies between the interval [71.52% - 44.5%]. As file size has grown the cpu utilisation remain almost stable over the time. Whereas the average network utilisation as shown 11 in 15.54 MB/s with standard deviation of 3.63. We have described the evaluation results for different file size but reading files larger than 3 GB typically causes memory issues in Elasticsearch. Elasticsearch requires more RAM to parse larger files, which is not possible in every case. Since Elasticsearch server does not have multi part indexing feature for same file, loading heavy log files cannot lead to memory issues within server. This means that it is possible to define a custom ETL (Extract Transform Load) process for storing data in Elasticsearch cluster and maintain the mapping at multi-part level. The ETL process can definitely improve the memory consumption in the search cluster. At the moment, this constitutes our future research direction.

Fig. 11.

Network utilisation for 2.57 GB file

VI. C ONCLUSION In this paper, we propose Simha, an application to search logs in hybrid cloud platforms. We present how we can extend the core capabilities of Elasticsearch which has provided great insight into data-intensive processing. Specifically, the ETL (Extract, Transform and Load) processing for Elasticsearch has proved to be difficult to implement for every data source which qualifies as big data in cloud environments. As presented in the evaluation section, indexing very large data sets into Elasticsearch nodes requires very carefully designed ETL processes, otherwise it can lead to system exceptions. Additionally, given the fact that Elasticsearch is still under heavy development, some features may or may not be supported in the future. Future work is required to improve the application searching capabilities using complex analysers, facets, and term query. Furthermore, we intend to develop plugins for different data sources including NoSQL and Blobs and implement custom dashboard to get better data visualisation. R EFERENCES [1] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, “A view of cloud computing,” Communications of the ACM, vol. 53, no. 4, pp. 50–58, 2010. [2] R. Kuc and M. Rogozinski, ElasticSearch Server, 1st ed. Birmingham: Packt Publishing, 2013. [3] P. Mell and T. Grance, “The NIST definition of cloud computing,” National Institute of Standards and Technology (NIST), Gaithersburg, USA, Tech. Rep. 800-145, Sep. 2011. [Online]. Available: http: //csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf [4] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, “Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility,” Future Generation Computer Systems, vol. 25, no. 6, pp. 599–616, 2009. [5] J.-H. Lee, M.-W. Park, J.-H. Eom, and T.-M. Chung, “Multi-level intrusion detection system and log management in cloud computing,” in ICACT’2011. PyeongChang: IEEE, 2011, pp. 552–555. [6] J. Shao and Q. Wang, “A performance guarantee approach for cloud applications based on monitoring,” in COMPSACW ’11. Munich: IEEE, Jul. 2011, pp. 25–30. [7] B. Rochwerger, D. Breitgand, A. Epstein, D. Hadas, I. Loy, K. Nagin, J. Tordsson, C. Ragusa, M. Villari, S. Clayman, E. Levy, A. Maraschini, P. Massonet, H. Muoz, and G. Tofetti, “Reservoir - when one cloud is not enough,” Computer, vol. 44, no. 3, pp. 44–51, 2011. [8] K. Kc and X. Gu, “ELT: Efficient log-based troubleshooting system for cloud computing infrastructures,” in SRDS 2011. Madrid: IEEE, Oct. 2011, pp. 11–20.

[9] R. Marty, “Cloud application logging for forensics,” in SAC ’11. ACM, 2011, pp. 178–184. [10] Y. Song, G. Alatorre, N. Mandagere, and A. Singh, “Storage mining: Where IT management meets big data analytics,” in BigData Congress. IEEE, 2013, pp. 421–422. [11] A. Undheim, A. Chilwan, and P. Heegaard, “Differentiated availability in cloud computing SLAs,” in GRID’2011. Lyon: IEEE, Sep. 2011, pp. 129–136. [12] P. Hasselmeyer and N. d’Heureuse, “Towards holistic multi-tenant monitoring for virtual data centers,” in NOMS Wksps 2010. Osaka: IEEE/IFIP, Apr. 2010, pp. 350–356. [13] G. Aceto, A. Botta, W. de Donato, and A. Pescape, “Cloud monitoring: A survey,” Computer Networks, vol. 57, no. 9, pp. 2093–2115, 2013. [14] J. Spring, “Monitoring cloud computing by layer, part 1,” IEEE Security Privacy, vol. 9, no. 2, pp. 66–68, 2011. [15] J. Du, N. Sehrawat, and W. Zwaenepoel, “Performance profiling of virtual machines,” in VEE ’11. Newport Beach: ACM, Mar. 2011, pp. 3–14. [16] F. Xhafa, C. Paniagua, L. Barolli, and S. Caballe, “Using grid services to parallelize IBM’s generic log adapter,” Journal of Systems and Software, vol. 84, no. 1, pp. 55–62, 2011. [17] I. Ray, K. Belyaev, M. Strizhov, D. Mulamba, and M. Rajaram, “Secure logging as a service - delegating log management to the cloud.” IEEE Systems Journal, vol. 7, no. 2, pp. 323–334, 2013. [18] S. Suakanto, S. Supangkat, Suhardi, R. Saragih, and I. Nugraha, “Building crawler engine on cloud computing infrastructure,” in ICCCSN’2012. Bandung: IEEE, 2012, pp. 1–5. [19] R. Kuc and M. Rogozinski, Mastering ElasticSearch, 1st ed. Birmingham: Packt Publishing, 2013. [20] S. Nakahara and H. Ishimoto, “A study on the requirements of accountable cloud services and log management,” in APSITT 2010. Sarawak: IEEE, 2010, pp. 1–6. [21] H. Gonz´alez-V´elez and M. Kontagora, “Performance evaluation of MapReduce using full virtualisation on a departmental cloud,” Applied Mathematics and Computer Science, vol. 21, no. 2, pp. 275–284, 2011. [22] S. Campa, M. Danelutto, M. Goli, H. Gonz´alez-V´elez, A. M. Popescu, and M. Torquati, “Parallel patterns for heterogeneous CPU/GPU architectures: Structured parallelism from cluster to cloud,” Future Generation Computer Systems, 2014, in Press. DOI: 10.1016/j.future.2013.12.038. [23] S. Caballe and F. Xhafa, “Distributed-based massive processing of activity logs for efficient user modeling in a virtual campus,” Cluster Computing, vol. 16, no. 4, pp. 829–844, 2013. [24] K. M. Sim, “Agent-based cloud computing,” IEEE Transactions on Services Computing, vol. 5, no. 4, pp. 564–577, 2012. [25] S.-F. Yang, W.-Y. Chen, and Y.-T. Wang, “ICAS: An inter-VM IDS log cloud analysis system,” in CCIS’2011. Beijing: IEEE, Sep. 2011, pp. 285–289. [26] P. Niblett and S. Graham, “Events and service-oriented architecture: The OASIS web services notification specifications,” IBM Systems Journal, vol. 44, no. 4, pp. 869–886, Oct. 2005. [27] Y. Ichikawa and M. Uehara, “Distributed search engine for an IaaS based cloud,” in BWCCA’2011/3PGCIC-2011, Barcelona, Oct. 2011, pp. 34– 39. [28] N. Nurseitov, M. Paulson, R. Reynolds, and C. Izurieta, “Comparison of JSON and XML data interchange formats: A case study,” in CAINE’09. San Francisco: ISCA, Nov. 2009, pp. 157–162.