Prioritized Prefetching and Caching Process of the Web Log Database

1 downloads 129 Views 125KB Size Report
Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD) is the process that attempts t
IJRIT International Journal of Research in Information Technology, Volume 3, Issue 4, April 2015, Pg. 476-481

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

Prioritized Prefetching and Caching Process of the Web Log Database Wasim Research Scholor, Deptt. Of IT Doon Valley Instiitute of Engg. & Tech., Karnal Dinesh Kumar Assistant Professor, Deptt. Of CSE/IT Doon Valley Instiitute of Engg. & Tech., Karnal ABSTRACT This research paper is focused on providing solution by enhancing web prefetching process. For experimentation we have used database with various web entries and have done cleaning process on the database. Data cleaning is the first step that is applied to the web mining and any other web searching technique. In data cleaning all the images, jsp pages and the user shown data is removed. We will be removing the php link, and user details that is friend. After the cleaning of the data whatever the data is required by the user will be displayed on the basis of the A priori algorithm. In this algorithm we will assume the confidence level near about 60% and the term that appears less than 60% will be removed and the more combination is applied to take the proper frequent set of the given data. Keywords: Web Mining, Prefetching, Caching, Database, Web.

I. INTRODUCTION Introduction to Web Mining World Wide Web is a huge repository of data. It has become one of the most important media to store, share and distribute information. The expansion of web is very rapid which has provided a great opportunity to study user and system behavior by exploring web access [5]. Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD) is the process that attempts to discover patterns in large data. WEB MINING can be defined as applying data mining techniques on web data to discover knowledge [3]. Some researchers have applied mining techniques on the web logs maintained by the servers to discover user access and traversal path [3]. The fast growth of the Internet and the World Wide Web results in network congestion and server overloading. Web has become today not only an accessible and searchable information source but also one of the most important communication channels, almost a virtual society. Web mining is a challenging activity that aims to discover new, relevant and reliable information and knowledge by investigating the web structure, its content and its usage [1]. Web mining includes web usage mining, web content mining and web structure mining as shown in figure 1.1 Web caching is used as one of the effective techniques to reduce network traffic, thereby decreasing user access latencies. However, the cache storage space is limited. Some pages need to be removed when the cache is full. As a result, the efficiency is dropping from what supposed to be, because the deleted page may be requested again [4].

Wasim, IJRIT-476

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 4, April 2015, Pg. 476-481

It is the type of data mining process for discovering the usage patterns from web information for the purpose of understanding and to provide the requirements of web-based applications. Web usage mining imitates the actions of human as they interact with the internet. Examination of user actions in communication with web site can offer insights causing to customization and personalization of a user’s web practice. As a result of this, web usage mining is of extreme attention for e-marketing and e-commerce professionals. Web usage mining itself can be classified further depending on the kind of usage data considered.

Figure 1.1 Web mining User logs are collected by the web server and typically include IP address, page reference and access time. New kinds of events can be defined in an application, and logging can be turned on for them generating histories of these events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the above the categories. Web content mining is the process of retrieving the information from WWW into more structured forms and indexing the information to retrieve it quickly. It focuses mainly on the structure within a document i.e. inner document level. Web content mining is related to data mining because many data mining techniques can be applied in web content mining.Web structure mining is the process by which we can discover the model of link structure of the web pages. We catalog the links; generate the information such as the similarity and relations among them by taking the advantage of hyperlink topology. Page rank and hyperlink analysis also fall in this category. The goal of web structure mining is to generate structured summary about the websites and web pages. It tries to discover the link structure of hyperlinks at inter document level.This can be further divided into two kinds based on the kind of structure information used.

II - LITERATURE REVIEW I. Dzitac et al [1] explained that the World Wide Web has evolved in less than two decades as the major source of data and information for all domains. Web has become today not only an accessible and searchable information source but also one of the most important communication channels, almost a virtual society. Web mining is a challenging activity that aims to discover new, relevant and reliable information and knowledge by investigating the web structure, its content and its usage. Though the web mining process is similar to data mining, the techniques, algorithms and methodologies used to mine the web encompass those specific to data mining, mainly because the web has a great amount of unstructured data and the changes are frequent and rapid. This paper is structured into two sections. The first one briefly discusses the different web mining tasks and the second one is focused on advanced artificial intelligence (AI) methods for information retrieval and web search, link analysis, opinion mining and web usage mining.

Wasim, IJRIT-477

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 4, April 2015, Pg. 476-481

Alexandros Nanopoulos et.al [3] explained that improving the performance of the Web is a crucial requirement, since its popularity resulted in a large increase in the user perceived latency. In this paper, authors describe a Web caching scheme that capitalizes on pre-fetching. Prefetching refers to the mechanism of deducing forthcoming page accesses of a client, based on access log information. Web log mining methods are exploited to provide effective prediction of web-user accesses. The proposed scheme achieves coordination between the two techniques (i.e., caching and prefetching). The prefetched documents are accommodated in a dedicated part of the cache, to avoid the drawback of incorrect replacement of requested documents. The requirements of the web are taken into account, compared to the existing schemes for buffer management in database and operating systems. Experimental results indicate the superiority of the proposed method compared to the previous ones, in terms of improvement in cache performance. Siddharth Jain et al, Ruchi Dave et al, Devendra Kumar Sharma et al [4] explained that web mining allows you to check for patterns in data through content mining, structure mining, and usage mining Two dynamic areas of today’s research are data mining and the WWW. A combination of the two areas sometimes referred to as Web mining. Association rule mining is a very important data mining model studied extensively by the database and data mining community. Frequent set mining was motivated by the problem of analyzing transaction data in any organization like educational Institute. The purpose to find out frequent patterns in web log data is to find information about the navigational actions or performance of the users. By using the resultant log file we have seen that the server has a best web-log data for making a good database for web-log mining. Research summarized the web log data with introducing algorithm and found better results and also for improving the quality of the web-log file. P. Somrutai et al [6] explained that Proxy servers have been used widely to reduce the network traffic by caching frequently requested web pages by using web caching. Proxy server acts as an intermediary between the web server and the web user requesting the web page. The proxy servers try to serve as many requests at the proxy server level. Proxy servers first fetch the requested web pages from the original web servers and store the web pages in the proxy server’s cache. If a user makes a request to a web page already stored in the cache, the proxy server accesses the local copy of the web page stored in the cache and serves it to the user who requested the web page. The proxy server’s cache has limited capacity in terms of size of web pages that can be stored in the cache at any given time. Once the cache capacity is reached, the temporally stale web pages in the cache are discarded and replaced by newly requested web pages. The web pages stored in the proxy server cache are managed by the cache replacement algorithms. This approach of caching is called as web caching. Josep Dome`nech, Ana Pont et al [9] explained about web prefetching evaluation according to user focused manner. Authors proposed a web prefetching mechanisms to benefit web users by hiding the download latencies. Nevertheless, to the knowledge of the authors, there is no attempt to compare different prefetching techniques that consider the latency perceived by the user as the key metric. The lack of performance comparison studies from the user’s perspective has been mainly due to the difficulty to accurately reproduce the large amount of factors that take part in the prefetching process, ranging from the environment conditions to the workload. This paper is aimed at reducing this gap by using a cost-benefit analysis methodology to fairly compare pre-fetching algorithms from the user’s point of view. Yin-Fu Huang et al and Jhao-Min Hsu et al [11] explained about mining web logs to improve hit ratios of pre-fetching and caching. Authors explained that in the internet, proxy servers play the key roles between users and web sites, which could reduce the response time of user requests and save network bandwidth. Basically, an efficient buffer manager should be built in a proxy server to cache frequently accessed documents in the buffer, thereby achieving better response time. Authors developed an access sequence miner to mine popular surfing sequences with their conditional probabilities from the proxy log, and stored them in the rule table. Then, according to buffer contents and the rule table, a predictionbased buffer manager also developed here will make appropriate actions such as document caching, document prefetching, and even cache pre-fetch buffer size adjusting to achieve better buffer utilization. Ray-I Chang, Jan-Ming Ho et al [19] proposed Domain-Top (DT) proxy pre-fetching method. DT uses the popular pages in the same popular domain to model users’ future demands. If there is a request for any one of the pages in the popular

Wasim, IJRIT-478

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 4, April 2015, Pg. 476-481

domain, the popular pages in the same domain are considered as its future demands and will be pre-fetched. The development of DT prefetching is based on a hypothesis that the browse-behaviour is always domain-preferential. However, clients may explore the Internet aimlessly and will access different domains in the near future. Analyzing proxy logs without considering diverse browse-behaviour may acquire wrong anticipation in prefetching. Here authors proposes the DTC (DT prefetching with Classification) method that tries to improve DT prefetching by removing unreliable logs. DTC adopts the concept of entropy to discriminate the browse-behaviour from "domain mode" and "exploratory mode". Only access logs in domain mode are considered in calculating the popular domains. Different from DT that considers a constant number of popular pages in pre-fetching, we assign each domain a suitable number of popular pages. Experiments on real traces show that the proposed DTC method can achieve higher hit ratio than that of the DT method. As DTC utilizes only the historical logs to offline decide the popular pages and the popular domains for prefetching, only few function modules on the present proxy need to be revised.

III. Proposed Algorithm Design Caching is an important technique for improving the performance of web based applications with help of web caching techniques. Web caching provides great features like traffic reduction, less load on servers, user-end retrieval delays by replicating popular content on proxy caches that are strategically placed within the network. Web pre-fetching schemes have also been widely discussed where web pages and web objects are pre-fetched into the proxy server cache. In our research we will work on integration of web caching and web pre-fetching approaches to improve the performance of proxy server’s cache. In Domain Top approach for web prefetching, combination of knowledge of most popular domains and most popular documents is done by proxy server. In this approach proxy is responsible for calculating the most popular domains and most popular documents in those domains, and then prepares a rank list for prefetching. In Dynamic web pre-fetching technique, each user can keep a list of sites to access immediately called user’s preference list. The preference list is stored in proxy server’s database. Intelligent agents are used for parsing the web pages; monitoring the bandwidth usage and maintaining hash tables, preference list and cache consistency. It controls the web traffic by reducing pre-fetching at heavy traffic and increasing pre-fetching at low traffic. In this research the concept of preference list from Dynamic technique into Domain Top approach is brought. Optimized Domain Top approach will consist of preference list along with the rank list.

Wasim, IJRIT-479

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 4, April 2015, Pg. 476-481

Figure 1.2 Architecture of Proposed Scheme Architecture for proposed scheme is shown in below figure1.2. Web users firstly requests to the internet for some documents of their need. In the background intelligent agents are continuously reading the web log files and on the basis of information provided by web log files, they fetch the top domain entries from the web they give it to data preprocessing module. Data preprocessing module performs the cleaning process and then save them in a buffer. On the other side smart browsers or say applets and java scripts play the role of knowing user behavior by parsing the user cookies? They also send the parsed cookies data to the buffer. Now some less used data from cookies is fetched using Apriori algorithm.

IV- Conclusion The integration of the pre-fetching and caching techniques using web logs has been discussed and developed according to the user of the different domains. The main focus was to show the performance of pre-fetching process with priority. Database of university process can be taken for experimentation and data cleaning process has been done on the database so that the useful data can be fetched and unwanted and repeated data can be removed. We have done with cleaning of the URLs according to the domains which means all the domains com, in, ac.in, edu, has been considered. We have fetched all the top 10 entries of the all the domain because our approach is domain based. After fetching all the Top 10 entries of the entire domain, we have collected the data that is of user interest because we have used the dynamic approach. To consider user requirements we have collected the data from the user cookies. We have taken the user cookies data from three users and put that data in to the database. Now user cache has the data from the user cookies and the data from the top 10 domains that we fetched. Here we are taking the top entries of from the user cookies considering also the data that is used by all three users and appear less in the cookies. We have fetched less used data by using Apriori algorithm so that we can get the data that is used by the all three users.

V- References [1] Sonia Setia, Dr. Jyoti, Dr. Neelam Duhan, ” Survey of Recent Web Prefetching Techniques”, International Journal of Research in Computer and Communication Technology, Vol 2, Issue 12, December- 2013 [2] Siddharth Jain, Ruchi Dave, Devendra Kumar Sharma, "An approach using Association Rule Mining Technique for frequently matched pattern of an Organization’s web log data”, International Journal of Engineering Sciences & Research Technology, Vol.1, No.5, pp 297-300, 2012. [3] Rudeekorn Soonthornsutee, Pramote Luenam,“ Web Log Mining for Improvement of Caching Performance ”, Proceedings of the International Multi-conference of Engineers and Computer Scientists, Vol- 1, pp 14-16, March 2012. [4] Greeshma G. Vijayan1 and Jayasudha J. S,” A Survey on web perfecting and caching in A Mobile Environment”, Natarajan Meghanathan, et al. (Eds): ITCS, SIP, JSE-2012 [5] Greeshma G. Vijayan and Jayasudha, ”A Survey on Web Pre-fetching and Web Caching Techniques in a Mobile Environment”, ITCS, SIP, JSE-2012, CS & IT 04, pp. 119–136, 2012 [6] Dushyant Rathod, ”A Review On Web Mining”, International Journal of Engineering Research and Technology (IJERT), Vol. 1 Issue 2, pp.34-37, April – 2012. [7] Wei Kong, ”Exploring Health Website Users by Web Mining”, Master of Science in Health Informatics, Dissertation Report, Indiana University, May. 2012.

Wasim, IJRIT-480

IJRIT International Journal of Research in Information Technology, Volume 3, Issue 4, April 2015, Pg. 476-481

[8] P. Somrutai, “Improving the Performance of a Proxy Server using Web log mining,” M.S. thesis, San Jose State University, 2011. [9] V. Sathiyamoorthi, V. Murali Bhaskaran, “Improving the Performance of Web Page Retrieval through PreFetching and Caching using Web Log Mining”, European Journal of Scientific Research, Vol.66, No.2, pp. 207-218, 2011. [10] Waleed Ali , Siti Mariyam Shamsuddin, and Abdul Samad Ismail,” A Survey on web perfecting and caching “;Int. J. Advance. Soft Comput. Appl., Vol. 3, No. 1, March 2011 [11] J. B. Patil and B. V. Pawar, “Improving Performance on WWW using Intelligent Predictive Caching for Web Proxy Servers,” IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 1, January 2011. [12] Ms. Dipa Dixit,” Preprocessing of web logs”, International Journal on Computer Science and Engineering(IJCSE), Vol.2, No.7, 2010. [13] Stratis Ioannidis, Laurent Massoulie, Augustin Chaintreau, “ Distributed Caching Over Heterogeneous Mobile Networks”, ACM SIGMETRICS Performance Evaluation Review, Vol. 38, Issue 1, 2010. [14] JaideepSrivastava, PrasannaDesikan, Vipin Kumar, ”Web Mining – Accomplishments& Future Directions”, Department of Computer Science, 200 Union Street SE, 4-192, EE/CSC Building, University of Minnesota, 2009. [15] Yin-Fu Huang, Jhao-Min Hsu,” Mining web logs to improve hit ratios of prefetching and caching”, KnowledgeBased Systems, Science Direct, Vol- 21, pp 62-69, 2008. [16] Chung-Sheng Li, Yung-Chih Tseng, Han-Chieh Chao &Yueh-Min Huang, (2008)“A neighbour caching mechanism for handoff in IEEE 802.11 wireless networks”, The Journal of Supercomputing , Volume 45, Number 1, pp. 1-14. [17] I. Dzitac, “Advanced AI techniques for web mining,” Proceedings of the 10th WSEAS international conference on Mathematical methods, computational techniques and intelligent systems. Corfu Greece. 2008. [18] Wenzhong Li, Edward Chan & Daoxu Chen,(2007) “Energy Efficient Cache Replacement Policies for Cooperative Caching in Mobile Ad-hoc Networks”, Proceedings of IEEE International Conference on Wireless Communications and Networking, pp. 3349-3354. [19] Beihong Jin, Sihua Tian, Chen Lin, Xin Ren & Yu Huang, (2007)“An Integrated Prefetching and Caching Scheme for Mobile Caching System”, Proceedings of IEEE International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, pp. 512-527. [20] Josep Dome` nech, Ana Pont, Julio Sahuquillo, Jose´ A. Gil,” A user-focused evaluation of web prefetching algorithms”, Computer Communications, Science Direct, Vol- 30, pp 2213-2224, 2007. [21] Junichiro Mori1, Yutaka Matsuo, Koichi Hashida, and Mitsuru Ishizuka, ”Web Mining Approach for a User-centered Semantic Web”,http://swoogle.umbc.edu, University of Tokyo, Japan, 2006.

Wasim, IJRIT-481