International Journal of Computer Engineering & Technology (IJCET) Volume 8, Issue 4, July-August 2017, pp. 12–18, Article ID: IJCET_08_04_002 Available online at http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=4 Journal Impact Factor (2016): 9.3590(Calculated by GISI) www.jifactor.com ISSN Print: 0976-6367 and ISSN Online: 0976–6375 © IAEME Publication
WEB USER IDENTIFICATION: A REVIEW OF APPROACHES AND ISSUES Sunny Sharma Research Scholar, Department of Computer Science & Engineering, Arni University kathgarh, Indora, HP, India Vijay Rana Department of Computer Science & Engineering, Arni University kathgarh, Indora, HP, India ABSTRACT Web usage mining has become a popular research area, used for capturing web user interests. This information can be used for several purposes such as web structure enhancement, web navigation prediction, web personalization etc. One of the key issues in the web mining is to identify web users. Identifying users based on web log files is a straightforward problem, thus various methods have been developed. There are several difficulties that have to be overcome, such as client side caching, changing and shared IP addresses and so on. This paper presents three different methods for identifying web users. As a consequence, a mass number of systems have been developed. In these systems, the available personal information about a user such as user’s preferences, user’s state is stored. We discuss some algorithms for user identification for the purpose of web personalization so that World Wide Web sites on a given topic would be interesting to a user. Key word: User Identification, IP Address, Web Cookies, Web Personalization, Web Usage Mining. Cite this Article: Sunny Sharma and Vijay Rana, Web User Identification: A Review of Approaches and Issues. International Journal of Computer Engineering & Technology, 8(4), 2017, pp.12–18. http://www.iaeme.com/ijcet/issues.asp?JType=IJCET&VType=8&IType=4
1. INTRODUCTION Web usage mining [6] is the application of data mining techniques to extract useful information from web data. Web usage mining has many advantages on the web such as web structure enhancement, web navigation prediction, web personalization, e-commerce, elearning, tourism and cultural heritage, digital libraries, travel planning and interaction in instrumented environments. Personalization is gaining a great momentum on the web. Web personalization [7] is the process of customizing a website according to specific users, making dynamic recommendation to the user based on his/her interests. User profiling [8] is
http://www.iaeme.com/IJCET/index.asp
12
[email protected]
Web User Identification: A Review of Approaches and Issues
an excellent approach for achieving the goal. Web User profiling can be done either explicitly or implicitly. Information about a user can be gained explicitly through registration or any other form which may contain the fields like user’s preferences, username, password, address and other. The information is captured explicitly is static but sophisticated. Implicitly method of gathering information about a user is dynamic which exploits cookies on the user’s browser. The first task in user profiling is web user identification. Web user identification is one of the most difficult and standout steps in the process of web usage mining. In this step the unique users are identified and distinguished. The notation is to distinguish or trace an individual’s identity either alone or when combined with other public information that is linkable to a particular person. Moreover, it is important to know which log file is linked to which user. The same user can access multiple computers or a single computer can be accessed by multiple users. Furthermore, proxy servers can hide the identity of users as there are multiple computers on the internet having same IP address through the proxy server. The user can be identified by the cookies, IP address, or login authentication. Common uses for cookies are authentication, storing user’s preferences, shopping cart items, and server session identification. One of the most common privacy issues involves web cookies. So the user might delete the cookie. Though the cookie may be harmless but the privacy problem emerges when an unprincipled user gets hold of this cookie that divulges information regarding the site that you have entered sensitive facts. Unlike cookie and IP, Login provides better accuracy and consistency. User Identification (UI) proposed in existing work is to classify the Web user profiles dataset periodically and to know the users behaviors as well as their interests based on temporal pattern analysis. The temporal data stored in the database follows interval stamping of tuples where the start–time and end-time for the temporal attributes are provided as two separate attributes. Each tuple in the database is uniquely identified by a composite key in which the temporal start-time is one of the attributes. In the fuzzy logic, it converts the quantitative information into qualitative information using a test score semantics and fuzzy rules. Fuzzy logic is used for intelligent classification in which relevancy is increased by enhancing semantics in addition to the relevancy measures provided by the conventional syntax based approaches. User interface is one of the most important parts of this architecture because through this user interface, user can interact with the system. User Interface helps and allows users to effectively perceive and express information. In particular, this user interface provides formats and languages that can present information to users with more accuracy and a higher level of control by providing computer means for effective communications on the Web[11]. In many cases, the temporal constraints are used because the different user groups accessing the internet are in different time periods. Therefore, the user temporal data is stored and are also analyzed classified and the relevant rules are extracted. Using this, the relevant Web Pages are retrieved after matching the pages with user’s interest even though the user’s accessing time varies.
2. LITERATURE REVIEW Literature survey plays an imperative role in our research work. It is the documentation of a comprehensive review of particular theme, which holds the information of past and present development of the topic. Thus it motivates to develop innovative techniques and models. This work describes the work of eminent researchers and highlights the challenges, which still require to be addressed. Krishnamurthy et.al in [1] defined “Personally identifiable information” (PII) which can be used to distinguish or trace an individual’s identity either alone or when combined with http://www.iaeme.com/IJCET/index.asp
13
[email protected]
Sunny Sharma and Vijay Rana
other information that is linkable to a specific individual. He used long term data to present a longitudinal analysis of privacy diffusion on the Web. This is the first study to measure the diffusion over an extended period of time. (On the Leakage of Personally Identifiable Information Via Online Social Networks). Ivancsy et.al in [2] presented three different methods for identifying web users. Two of them are the most commonly used methods in web log mining systems, whereas the third one is novel approach that uses a complex cookie-based method to identify web users. To demonstrate the efficiency thay developed an implementation called Web Activity Tracking (WAT) system that aims at a more precise distinction of web users based on log data. Furthermore, they presented some statistical analysis created by the WAT on real data about the behavior of the Hungarian web users and a comprehensive analysis and comparison of the three methods(Analysis of Web User Identification Methods) Carmagnola et.al in [3] described the conceptualization and implementation of a framework that provides a common base for user identification for cross-system personalisation among web-based user-adaptive systems. However, the framework can be easily adopted in different working environments and for different purposes. Furthernore the framework represents a hybrid approach which draws parallels both from centralized and decentralized solutions for user modeling. (User identification for cross-system personalisation). Pazzani et.al [4] discussed algorithms for learning and revising user profiles that can determine which World Wide Web sites on a given topic would be interesting to a user. They described the use of a naive Bayesian classifier for this task, and demonstrate that it can incrementally learn profiles from user feedback on the interestingness of Web sites. Furthermore, the Bayesian classifier may easily be extended to revise user provided profiles.( Learning and Revising User Profiles:The Identification of InterestingWeb Sites) Peacock et.al described [5] that researchers have turned their focus to keystroke biometrics, which seeks to identify individuals by their typing characteristics. In this article, we’ll address several issues while surveying recent developments, comparing results from the field with both well known and newly proposed metrics, and examining the potential roadblocks to widespread implementation of keystroke biometrics.( Typing Patterns: A Key to User Identification)
3. ISSUES Web user identification is one of the most challenging steps in the process of web usage mining. The main issue for the web masters is how unique users are to be identified and distinguished. Current approaches for web user identification have some limitations these are:
Existing efforts for user identification are not light-weight.
The user can use the multiple terminals for his needs.
The multiple users can use the same terminal of the office, college etc.
Proxy server can hide the individual’s information like IP address.
Multiple terminals appear on the web having same IP address.
The Web user can disable cookies that website use to track user’s behavior or delete the cookies from the browser.
http://www.iaeme.com/IJCET/index.asp
14
[email protected]
Web User Identification: A Review of Approaches and Issues
4. METHODS THODS OF USER’S IDENTIFICATION IDENTIFICATION A user on the web can be identified by numerous ways. Some of the sophisticated methods are IP address, cookie based, and User registration. IP address: This is eextremely xtremely regular heuristic method for user identification. IP address is unique address of our pc in the Internet. Using the IP address we can identify the user.. Cookies based: Cookies are the piece of information which stores in the client’s computer for specific amount of time. Cookies are basically made for fast access to web site. That means cookies can stores user’s information; so using cookies we can extract the information of user user. User registration: User registered information like, user name, address, address, contact no, etc, comparatively more reliable source for user identification. If we considered all the information filled by user is correct.
4.1. IP address Each computer on internet can be identified by it address known as Internet Protocol Address or IP address. The IP address is personally identifiable information that is automatically captured by another computer when any communication is made over the www. The other computer may be a web server or any other computer. This includes browsing a web site, sending requests to server or receiving response, sending or receiving e-mail e mail etc.
4.2. Cookies Based Identification dentification A web cookie, or Internet cookie, is a small piece of data sent from a web server to a web browser and stored on the user's comput computer er while the user is browsing. The browser stores the message in a content document. The message is then sent back to the server each time the user requests a page from the server. The main purpose of a cookie is to ide identify ntify users and possibly personalized web pages for them [9].. When you browse a website using cookies, means you are providing some information information, or interests, and preferences such as language, country etc. This information is bundled into a cookie and sent back to your browser which stores it for later use. The next time you go to the same website, your browser will send the cookie to the server. The server can use this information to personalize web content for individual users. For example, example, instead of seeing a generic welcome page you mi might ght see a page that welcomes a user by his/her name or shows the pages he preferred most.
http://www.iaeme.com/IJCET/index.asp
15
[email protected]
Sunny Sharma and Vijay Rana
Figure 1 Working of Cookies
4.3. User Registration egistration It prompts user to authenticate him/her with username and password pair. For it, w websites need to maintain a database to store each user’s information. Each time user opens a site, he needs to enter username and password. It is one of the most sophisticated methods for identification.
http://www.iaeme.com/IJCET/index.asp
16
[email protected]
Web User Identification: A Review of Approaches and Issues
5. CONCLUSIONS CONCLUSION The main notation of this paper was the analysis of different user identification methods of web log mining. We presented three algorithms for collecting data about the user’s activity. We showed some interesting measurement results about the popularity of the different content provider, and about the visit behavior of the web users. As well as user identification, we are working on the definition definition of a optimum web personalization system using semantic annotation.
REFERENCE REFERENCES S [1]
B. Krishnamurthy and C.E. Wills, (2001), “On the Leakage of Personally Identifiable Information tion Via Online Social Networks Networks”,, Proceedings of the 2nd ACM workshop on Online social networks, pp (7 (7-12). 12).
[2]
R Ivancsy, and S Juhasz, (2007), Analysis of Web User Identification Methods, World Academy of Science, Engineering and Technology.
[3]
F Carmagnola, Carmagnola F Cena, (2008), User identification for cross-system cross system personalization, Information Sciences 179 (2009) 16–32. 16
[4]
M pazzani and D billsus billsus,, Learning and Revising User Profiles: The Identification of Interesting Web Sites, Machine Learning 27, 313 313–331 331 (1997).
[5]
A Peacock, X Ke, and M Wilkerson, (2004), Typing Patterns: A Key to User Identification, IEEE Security & Privacy, Privacy Vol 2 (5), pp(40 pp(40--47).
[6]
Wu, K., Yu, P. S., & Ballman, A. (1998). Speed Speed- tracer: A web usage usage mining and analysis tool. IBM Systems Journal, Vol.37, No.1.
[7]
R. Lixandroiu and C. Maican, “Personalization in E-Commerce E Commerce using profile similarity”, Bulletin of Transivilania university of Brasov series V, vol. 8(57), pp. 11-6, 2015.
[8]
Z. MA, G. PANT, and O. R. L. SHENG, “Interest “Interest-Based Based Personalized Search”, ACM Transactions on Information Systems, Vol. 25, No. 1, Article 5, February 2007.
[9]
A Cahn,, S Alfeld, P Barford, (2016), An Empirical Study of Web Cookies, Proceedings of the 25th International Conference oon n World Wide Web Pages 891-901. 891 901.
http://www.iaeme.com/IJCET/index.asp
17
[email protected]
Sunny Sharma and Vijay Rana [10]
Sharma S, Rana V. Web Personalization through Semantic Annotation System. Advances in Computational Sciences and Technology. 2017;10(6):1683-90.
[11]
Mahajan, Sunita, Sunny Sharma, and Vijay Rana. "Design a Perception Based Semantics Model for Knowledge Extraction." International Journal of Computational Intelligence Research 13, no. 6 (2017): 1547-1556.
[12]
Purvi Dubey, Asst. Prof. Sourabh Dave. Effective Web Mining Technique for Retrieval Information on the World Wide Web. International Journal of Computer Engineering and Technology (IJCET), 4(6), 2013, pp. 156–160
[13]
Prof. Sindhu P Menon, Dr. Nagaratna P Hegde. Research on Classification Algorithms and its Impact on Web Mining. International Journal of Computer Engineering and Technology (IJCET), 4(4), 2013, pp. 495–504
http://www.iaeme.com/IJCET/index.asp
18
[email protected]