Jun 26, 2008 - â50% of visitors who accessed URLs /index.php and coed.php ... Web Usage Mining: Discovery and Applicat
1 of 18
Web Usage Mining: A Review
Presented By: Urvek Shah (SVNIT) M.A.Zaveri (SVNIT) A National Conference on “Emerging Trends in Computer Technology” ,SCET,SURAT 26/6/2008
2 of 18
Web Mining The Extraction of Unknown Interesting Knowledge from WWW Web Data Content Structure Usage
Web Mining
A Taxonomy of Web Mining Content Mining
Structure Mining
Usage Mining
3 of 18
Web Mining – Categories
Content Mining The discovery of useful information from the Web contents
Structure Mining The discovery of useful information from the Web Structure
Web Usage Mining The discovery of interesting user access patterns from Web server logs
4 of 18
WUM – Server logs 123.456.78.9 - - [25/Apr/2008:19:13:44 –0400] “GET /depht.php/coed.php /depht.php/coed.php HTTP/1.0” HTTP/1.0” 200 1849 http://www.svnit.ac.in http://www.svnit.ac.in// “Mozilla/4.51 [en] (Win98;I)” (Win98;I)” IP Address
Time
Method/URL/Protocol
Sta tus
Size
Referred
Agent
123.456.78.9
[25/Apr/2008:03:04:41 –0500
GET A.html HTTP/1.0
200
3290
-
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:03:05:34 –0500
GET B.html HTTP/1.0
200
2050
A.html
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:03:05:39 –0500
GET L.html HTTP/1.0
200
4130
-
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:03:06:02 –0500
GET F.html HTTP/1.0
200
5096
B.html
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:03:06:58 –0500
GET A.html HTTP/1.0
200
3290
-
Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
[25/Apr/2008:03:07:42 –0500
GET B.html HTTP/1.0
200
2050
A.html
Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
[25/Apr/2008:03:07:55 –0500
GET R.html HTTP/1.0
200
8180
L.html
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:03:09:50 –0500
GET C.html HTTP/1.0
200
1820
A.html
Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
[25/Apr/2008:03:10:02 –0500
GET O.html HTTP/1.0
200
2270
F.html
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:03:10:45 –0500
GET J.html HTTP/1.0
200
9430
C.html
Mozilla/3.01 (X11, I, IRIX6.2, IP22)
123.456.78.9
[25/Apr/2008:03:12:23 –0500
GET G.html HTTP/1.0
200
7220
B.html
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:05:05:22 –0500
GET A.html HTTP/1.0
200
3290
-
Mozilla/3.01 (Win95, I)
123.456.78.9
[25/Apr/2008:05:06:03 –0500
GET D.html HTTP/1.0
200
1680
A.html
Mozilla/3.01 (Win95, I)
5 of 14
Web Usage Mining (WUM)
Possible Data Sources Server side collection client side collection proxy side collection
6 of 14
WUM – Three Phases
Pre-Processing
Raw Sever log
Pattern Discovery
User session File
Pattern Analysis
Rules and Patterns
Interesting Knowledge
7 of 18
WUM – Pre-Processing
Pre- Processing includes the tasks of: Data Cleaning removes log entries that are not needed for the mining process
User Identification identify different users (IP Address )
Session Identification groups user’s page references into user sessions
9 of 18
WUM – Issues in User Session Identification A single IP address is used by many users
different users
Proxy server
Web server
Different IP addresses in a single session
ISP server
Single user
Web server
Missing cache hits in the server logs
10 of 18
WUM – Solutions Remote Agent A remote agent is implemented in Java Applet It is loaded into the client only once when the first page is accessed The subsequent requests are captured and send back to the server
Modified Browser The source code of the existing browser can be modified to gain user specific data at the client side
Heuristics use a set of assumptions to identify user sessions and find the missing cache hits in the server log
11 of 18
WUM – Heuristics
The session identification heuristics Timeout: if the time between pages requests exceeds a certain limit, it is assumed that the user is starting a new session IP/Agent: Each different agent type for an IP address represents a different sessions Referring page filed: If the referring page file for a request is not part of an open session, it is assumed that the request is coming from a different session
12 of 18
WUM – Pattern Discovery Pattern discovery process applies data mining techniques to generate rules and patterns
Data Mining Techniques Association Rule Generation Clustering Sequential patterns
13 of 18
WUM – Association Rule Generation Discovers the correlations between pages that are most often referenced together in a single server session Provide the information What are the set of pages frequently accessed together by Web users? What page will be fetched next? What are paths frequently accessed by Web users?
Association rule A B [ Support = 60%, Confidence = 80% ] Example “50% of visitors who accessed URLs /index.php and coed.php also visited coed_faculty.php”
14 of 18
WUM – Clustering Groups together a set of items having similar characteristics User Clusters Discover groups of users exhibiting similar browsing patterns Page recommendation User’s partial session is classified into a single cluster The links contained in this cluster are recommended
Page clusters Discover groups of pages having related content Page recommendation The links are presented based on how often URL references occur together across user sessions
15 of 18
WUM – Sequential Patterns sequential patterns (SP) are highly similar with mining association rules. Time element (order of event) is taken in to account is the only difference. Example 15% of access “page a.html then b.html than c.html.
Application Web site structure modification Web personalization Web access pattern
Algorithms: AprioriAll, GSP ,WAP tree.
16 of 18
Conclusion Web Usage Mining is one of the top research area today User access patterns found from WUM process can be helpful for Predict users’ next request which is used for prefetching Web site restructuring Web personalization
17 of 18
References Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern discovery on the world wide web. In International conference on Tools with Artificial Intelligence, pages 558-567, Newport Beach, 1997. IEEE. B. Mobasher, N. Jain, E. Han, and J. Srivastava. Web mining: Pattern discovery from world wide web transactions. Technical Report TR-96050, Department of Computer Science,University of Minnesota, M inneapolis, 1996. (TR 96-050), 1996 Boris Diebold and Michael Kaufmann. Usage-based visualization of web localities. In Australian symposium on Information visualisation, pages 159–164, 2001. R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. PhD thesis, University of Minnesota, 2000. Bettina Berendt, Bamshad Mobasher, Miki Nakagawa, and Myra Spiliopoulou. The impact of site structure and user environment on session reconstruction in web usage analysis. In Proceedings of the 4th WebKDD 2002 Workshop, at the ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002),2002. WANG Tong HE Pi-lian. Web Log Mining by an Improved AprioriAll Algorithm. In proceeding of world academy of science, engineering, and technology,2005 p.p 97-100. Y. Lu, and C. I. Ezeife. Position Coded Pre-order Linked WAP-Tree for Web Log Sequential Pattern Mining. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Seoul, Korea, 2003, pp. 337-349.
18 of 18
Thank You