Mar 1, 2002 - University Heights. Newark, New Jersey 07102-1982. U.S.A. email: ... home pages available on the web for marketing purposes. ... Advanced.
Mining the Web for Target Marketing Information James Geller, Richard Scherl, and Yehoshua Perl Department of Computer Science College of Computing Sciences New Jersey Institute of Technology University Heights Newark, New Jersey 07102-1982 U.S.A. email: geller, scherl, perl @cis.njit.edu
Abstract This paper describes the architecture of a system designed to mine home pages available on the web for marketing purposes. Today there are millions of home pages on the web. People post their likes and dislikes on their homepages. We argue that these web pages constitute a valuable source of information both to narrowly identify individuals as potential customers for particular products and as the basis for drawing conclusions about the relation between interests and demographic categories. The system described here is our initial attempt to address this problem. We also lay out the numerous topics for further research; topics that have developed out of the initial work.
1
Introduction
Today there are millions of home pages on the web. Not only academics have home pages. Regular people have them too. They freely express their likes and dislikes on their homepages. By posting these home pages, they make themselves available on line and create their own identity on the web. The information is posted there for all to see. There are no privacy issues. We propose that these pages are a valuable source of data for marketing purposes. One approach is to use the contact information for direct marketing. For example, if someone is interested in music, then he or she might want to buy CDs.
Thus the marketing can be directed towards a very narrow niche. If someone is interested in Stevie Ray Vaughan, then he or she can be targeted for a particular CD containing recordings of the musician or for a biography that may be of interest to a fan. A second important use of this data is for research relevant to marketing. The data may be mined for useful correlations between interests and also between demographic categories. If someone is interested in Stevie Ray Vaughan, what is the likelyhood that he/she is interested in the music of another musician? What age groups are interested in particular types of music? The available data can be used for such investigations. The results may again be useful for marketing. For example, it may be determined that people interested in Stevie Ray Vaughan are generally interested in certain other musicians. But additionally, the results may be of general sociological interest. This paper describes a prototype system that we have implemented to test our ideas. In Section 2, we outline the architecture of the system. Related work is surveyed in Section 3. Issues for further work are discussed in Section 4. The bulk of the system is implemented in Java.
2
The Project
To test our ideas, we have implemented a prototype system. The architecture of the system is described in Figure 1 on Page 1. Each aspect of the system depicted in the diagram is described in turn.
2.1 Data Extraction There are a number of web portals that allow participants to post their home pages with a structured or semi-structured format. We began our work with Yahoo 1 home pages because they have a particularly simple structure that is well suited to our purposes since the home pages are organized so that they are accessible through an interest tree. Figure 2 shows a typical Yahoo web page. We are also currently working with ICQ2 and Tripod3 pages. The structured information includes the name, age, sex, marital status, occupation, email address, hobbies, likes, dislikes, interests, links to favorite websites, and a free-text section. Information from the structured sections is easily extracted although it is necessary to write a separate extractor program for each portal. 1
http://www.yahoo.com http://web.icq.com 3 http://www.tripod.lycos.com 2
Web Extraction Component Advanced Extraction Component Data Filtering Component Ontology
Web Relational Customer/ Interest DB
Front End Data Mining Component
Web Browser USER
Figure 1: Architecture We have not yet, in our working system, made use of the unstructured data or the links to additional sites. The issues involved in making use of this information are similar to those that come up in utilizing unstructured web pages (discussed below). Needless to say there are also many “non-portal” home pages. These do not follow any specific format and so are more difficult to mine. The needed techniques are similar to those found in text classification [MS01] and information retrieval [BYRN99]. Currently we are writing rules in the expert system shell language Jess4 to categorize web pages (and also to gain added information from the freetext sections of structured web pages) with regard to the interests and demographic characteristics of each page owner. The issue of gathering “non-portal” home pages is one that we have not yet 4
http://herzberg.ca.sandia.gov/jess/
My Profiles - Acct Info - Help - Sign Out For quick access to this page, bookmark: http://profiles.yahoo.com/bluesisblood666 Find anyone’s phone number or email address with Yahoo! People Search Yahoo!· Games Personals Address Book · Alerts · Auctions · Bill Pay · Bookmarks · Briefcase · Broadcast · Calendar · Chat · Classifieds · Clubs · Companion · Greetings · --Select State-Start your search:· Mail · Maps · Member Directory · Messenger Home Pagesjob · Invites · My Yahoo! · News · People Search · Personals · Photos · Shopping · Sports · where millions of singles meet! enter Stock keyword Search Quotes · TV · Travel · Weather · Yahooligans · Yellow Pages · more...
bluesisblood666’s profile My Email Private
Copyright © 2002. Yahoo! Inc. All rights reserved. Privacy Policy - Terms of Service - Guidelines - Help
Last Updated: March 01, 2002
Basics
ADVERTISEMENT
Yahoo! ID: bluesisblood666 Real Name: Randy
My Interests
Location: Canada · Hard Rock Age: 36 · Blues Marital Status: No Answer · Blues Brothers, The · Vaughan, Stevie Ray Gender: Male · Winter, Johnny first mate on the · Classic Rock Occupation: s.s. · Guitar Send me a message minnow/bluesman · Davis, Miles · Baker, Chet More About Me · Parker, Charlie Hobbies: Listening to and playing music (blues,jazz and · Deep Purple classic rock).Going to concerts and reading.Training little · Santana · Canadian Football League (CFL) blues guys(see profile pic.) · National Football League (NFL) Latest News: been livin the blues,now i’m lookin for the · Dallas Cowboys light. · Hockey · Montreal Canadiens · Ottawa Senators Favorite Quote · Horror "been down so goddamn long,that it looks like up to me " · King, Stephen · Rice, Anne · Mystery Create your own home page at GeoCities! Links · Christie, Agatha · Home Page: No home page specified · Doyle, Sir Arthur Conan · Allman Brothers Band · Cool Link 1: http://www.led-zeppelin.com · Black Crowes · Cool Link 2: http://www.bluesboymusic.com · Led Zeppelin · Cool Link 3: http://www.live365.com/stations/52559 · Lynyrd Skynrd · Clapton, Eric · Reading Groups
On Yahoo! Add to friend list · Messenger
Figure 2: Sample Yahoo Page addressed. Certainly there are many pages available on university web-sites. These are portals of a sort, but they generally do not enforce a structured type of home page. Many internet providers work in a similar fashion. What is needed is a search engine that gathers home pages. This is a topic for future research. The Web Extraction component is written in a language called WebL 5 , an interpreted language implemented in Java. The language has special facilities to extract information from web pages.
2.2 Front-End Additionally, we have built a web-accessible front-end 6 for purposes of testing the system. A user can sit at a terminal and retrieve the email addresses of individuals who have particular interests and satisfy certain demographic requirements. A 5 6
http://research.compaq.com/SRC/WebL http://web.njit.edu/challeng/
Figure 3: User Interface screen dump of the front-end is given in Figure 3. As illustrated in this screen dump, the user of the system is provided with a pop-up menu to assist in selecting interests that correspond to our interest ontology, the next topic to be discussed.
2.3 Ontology The system is organized around an ontology of interests. Our ontology of interests organizes concepts in a generalization hierarchy. An illustration is given in Figure 4. We began by using the interest hierarchy of the Yahoo web portal, discussed below. Therefore our current ontology is closely related to the Yahoo ontology. The interest ontology is quite large. It contains 31,531 interests. There are 11 levels. These are stored with their unique Ids in the Oracle database. Currently we have 31,531 rows in the ontology. We also have developed ontologies of people, i.e., demographic categories. We are starting to work on an ontology of products; things that people would want to
MUSIC IS-A
IS-A
ROCK
JAZZ
IS-A IS-A HARD ROCK
SOFT ROCK IS-A
IS-A METAL
GRUNGE
Figure 4: Ontology Example buy. Some of this work is discussed in [XGPH02]. Ultimately, we want to have rules mapping interests to items that a person with those interests would likely buy.
2.4 Data Filtering Component It is necessary to take into account unreasonable data. We have found ages above 100, names like The Destroyer, and offensive 4-letter words, and racial epithets. Many pages are missing names and ages. There are a number of issues here. One is when does a page become so questionable that we should certainly throw out the page and not enter it into our database? The other is under what circumstances is the page not useful as a source of data; even though there is nothing to indicate that the page is questionable. We have had to develop policies on each of these. For example, if any from the usual set of 4-letter words are found in the page, we simply ignore it. Also, if the age is unrealistic, e.g., above 100, we ignore the page. We do accept pages that have obviously false names. Wee ignore pages that do not have either gender or age, as then we have very little demographic information to work with.
2.5 The Database We use a standard relational database (Oracle) to store the data gathered from the home pages. This includes a table for the demographic data (age, gender,name etc.) and an interest table (interests, and user ids). Currently we have 1,113,723 rows in our user interest table and 230,846 rows in our demographic table.
2.6 Data Mining Component Now that the data is available in a relational database, standard data mining algorithms can be applied to uncover potentially interesting or useful generalizations concerning the data. We are using the software package WEKA [WF99] which implements a number of data-mining algorithms. We have concentrated on the APRIORI algorithm for learning association rules. Five sample rules learned by the system from our data are as follows. 1. interest=Wargaming 183 ==> age=20-29 84 conf:(0.46) 2. interest=Gellar_Sarah_Michelle 1310 ==> age=10-19 846 conf:(0.65) 3. interest4=Role_Playing_Games 5 ==> interest1= Wargaming 3 conf:(0.6) 4. interest1=Wargaming 17 ==> interest3= Role_Playing_Games 3 conf:(0.18) 5. interest1=Skateboarding 204 ==> age=10-19 146 conf:(0.72) With a high degree of confidence the system determined that if someone is a fan of Sarah Michelle Gellar then he or she is between the ages of 10 and 19. With a somewhat lower degree of confidence an individual interested in wargames is between the ages of 20 and 29. With a high degree of confidence, people interested in Role Playing Games are also interested in Wargaming, but the association is not as strong the other way around. Not surprisingly, with a high degree of confidence, those interested in skateboarding are relatively young. The Data Mining Component makes use of the Ontology since in areas where the data is sparse, we move up the interest hierarchy and make use of more general interests but with data that is less sparse.
3
Related Work
A closely related piece of work is the study of Ling and Li [LL98] on the use of data mining for direct marketing. Their approach is rather different as they are in effect attempting to learn the characteristics of people who buy particular products so that others with those same characteristics may be contacted. We are beginning with a much richer set of characteristics (i.e., interests) and then are able to use a set of common-sense rules to deduce that an individual would be likely to be interested in particular products. To the best of our knowledge, no one else has investigated the mining of indivdual home pages for marketing purposes. There has been some interest in web mining[CHMW01] and in extending data-mining to cover the web as well [HZCC00], and also to query material available on the web [ACHK93, KMA 02, KLMM02]. Viewed very broadly, this work is an example of a recommender system on which there is a substantial literature. An example is [TH01]. There also is literature on building specialized home-page search engines. An example is Ahoy [SLE97], the home page finder developed at the University of Washington.
4
Summary and Further Research Issues
Our initial experimentation has yielded positive results. It is indeed possible to mine personal home-pages on the web and obtain useful data. We have built a database containing both interest information and demographic information from more than 200,000 web pages. This database can be used for two purposes. One is for direct niche marketing of products, matching interests of people. Another is for data-mining research on marketing topics. We have demonstrated that the latter is feasible. Currently, we are extending the size of our database by considering web pages from a larger set of portals and other sources. A larger database will enable us to obtain more generalizations about the data through the use of data-mining techniques. Another topic to consider in the future is business to business marketing through mining company home pages. In the process of carrying out this work we have begun to investigate a variety of issues that are the subject of our on-going research. Some of these are listed below:
Non-Portal Web Pages, free text: We are actively working on a set of expert system rules (coded in Jess) that operate on a set of features (primarily noun phrases) taken from a page and categorize it in terms of the interests of the author. Another possible approach is to use machine learning techniques to
classify categorized pages in terms of the interests of the authors and then use these to categorize new pages. Another possible approach is to write high-level plans for obtaining interest information from web pages. We are investigating using the agent programming language GoLog [LRL 97] for this purpose.
Searching for non-portal web pages: This issue was discussed in the body of the paper. How can we effectively search for home pages located in nonportal sites.
Ontology Integration: Different portals utilize different ontologies for interests as well as demographic data. Integrating these ontologies by hand is quite difficult. Is it possible to develop automated techniques to work with a human on specifying a method for integrating the ontologies; a method which would then work automatically after specification?
Data Mining Algorithms: As of now we have only experimented with the generation of association rules. In the future, we will try other algorithms as well; e.g., decision trees, Bayesian techniques.
Acknowledgements This work has been supported by the New Jersey Commission for Science and Technology through the New Jersey Center for Software Engineering at the Stevens Institute of Technology. We also acknowledge the contribution of the numerous NJIT students who have worked on different aspects of this project.
References [ACHK93] Yigal Arens, Chin Y. Chee, Chun-Nan Hsu, and Craig Knoblock. Retrieving and integrating data from multiple information sources. International Journal on Intelligent and Cooperative Information Systems, 2(2):127–158, 1993. [BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, Harlow, England, 1999. [CHMW01] George Chang, Marcus Healey, James McHugh, and Jason Wang. Mining the World Wide Web: An Information Search Approach. Kluwer, Boston, 2001. [HZCC00]
J. Han, O.R. Za¨ıne, S. Chee, and J.Y. Chiang. Towards on-line analytical mining on the internet for electronic commerce. In Electronic Commerce Technology Trends: Challenges and Opportunities. IBM Press, 2000.
[KLMM02] Craig A. Knoblock, Kristina Lerman, Steven Minton, and Ion Musea. Accurately and reliably extracting data from the web: A machine learning approach. Data Engineering Bulletin, 2002. [KMA 02] Craig A. Knoblock, Steven Minton, Jose Luis Ambite, Naveen Ashish, Ion Muslea, Andrew Philpot, and Sheila Tejada. The ariadne approach to web-based information integration. International Journal on Intelligent and Cooperative Information Systems, 2002. [LL98]
Charles X. Ling and Chenghui Li. Data mining for direct marketing: Problems and solutions. In KDD-98, 1998.
[LRL 97]
Hector Levesque, Raymond Reiter, Yves Lesp´erance, Fangzhen Lin, and Richard B. Scherl. Golog: A logic programming language for dynamic domains. Journal of Logic Programming, 1997.
[MS01]
Christopher D. Manning and Hinrich Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 2001.
[SLE97]
Jonathan Shakes, Marc Langheinrich, and Oren Etzioni. Dynamic reference sifting: A case study in the homepage domain. In Proceedings of the Sixth International World Wide Web Conference, pages 189–200, 1997.
[TH01]
L.G. Terveen and W. Hill. Human-computer collaboration in recommender systems. In HCI in the New Millennium. Addison, 2001.
[WF99]
Ian H. Witten and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufman, Los Altos, California, 1999.
[XGPH02]
X.Zhou, J. Geller, Y. Perl, and M. Halper. Design of a marketing ontology, 2002. Submitted for publication.