[SCDT00]Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web usage mining: Discovery and applications of usage patterns from ...
Enhancing Socioeconomic Surveys by Data about Internet Usage Tomas Aluja2 and Gerhard Paaß1 and Albert Prat2 and Ingo Schwab1 1
Fraunhofer Institute for Autonomous Intelligent Systems, St. Augustin Germany 2 Universitat Polit´ecnica de Catalunya, Barcelona Spain
Abstract. One phenomenon is the continuously extending use of the Internet. The DIASTASIS project attempts to measure and assess Internet usage and to define the profile of Internet users. This is very important for many actors, as public and private policy makers, organizations and individuals. Internet Usage has several dimensions. One aspect is the frequency of internet usage as well as different types of internet usage, e.g. Email or web. With respect to web usage, the distribution of different topic is important. In Diastasis we first recognize the language of a we page using character n-grams. Then we categorize the content of different web pages into a number of categories by language-dependent text mining methods. Finally we use socioeconomic properties of the user to combine this with existing surveys by statistical grafting. This allows to develop new indicators to quantify Web usage and explore the Web behavio r of different socioeconomic groups.
1
Introduction
One of the main aspects of the current technological innovation is the continuously extending use of the Internet. There exist, however, virtually no reliable statistics about the composition of downloaded web pages and about the socioeconomic features of web users. To fill this information gap the DIASTASIS (Digital Era Statistical Indicators)3 project attempts to measure and assess Internet usage and to define the profile of Internet users. This is very important for many actors, as public and private policy makers, organizations and individuals. The activity of internet users generates an enormous amount of data which may be analyzed by statistical methods. A recent overview about possible approaches is given by [SCDT00]. Internet Usage has several dimensions. One aspect is the frequency of internet usage as well as different types of internet usage, e.g. Email or web. With respect to web usage, the distribution of different topic is important. In Diastasis we categorize the content of different web pages into a number of categories and combine this information with the socioeconomic properties of the user. This allows to develop new indicators to quantify Web usage and explore the Web behavior of different socioeconomic groups. National statistical offices have collected a wealth of data containing details information on the members of the population. On the other hand the amount of 3
EU Project IST-2000-31083
Sat Feb 1 00:51:47 2003 010.010.010.043 http://arc6.msn.com/ADSAdClient31.dll? Sat Feb 1 00:51:47 2003 010.010.010.043 http://svcs.microsoft.com/svcs/mms/ads.asp? Sat Feb 1 00:53:06 2003 010.010.010.043 http://www.msn.es/40231/ULI-e0a57f63.gif Mon Feb 3 14:22:37 2003 010.010.010.016 http://www.bcn.es/imatges/flecha3azul.gif Mon Feb 3 14:22:37 2003 010.010.010.016 http://www.bcn.es/imatges/punto.gif ...
Fig. 1. Sample from a web log file. detail which can be gathered on the actual web users is limited. In Diastasis we use the technique of statistical grafting (merging, matching) to enhance a survey from a National Statistical Institute (NSI) by web mining data. It basically combines survey records with the data of a specific web user, if some important common variables existing in both datasets have similar values. This should preserve the correlation between the variables of interest to a large extent and allow the evaluation of the full set of variables with the usual statistical methods.
2
General Evaluation Strategy
The project is conducted in several steps 1. A sample of internet users is selected, which should be greater than the survey sample and should contain all groups of the population covered in the survey sample. 2. For each user basic socioeconomic data is collect by an electronic questionnaire, which describes age, profession, marital status, income group, household composition, etc. These features are called common variables, as they also exist in the survey data. 3. Given the consent of the participants, their internet usage for a certain period is recorded by a “sniffer”, a device which stores the URL and the access time for each web page the user downloads. 4. The recorded URLs are downloaded again and the language of the web page is classified. 5. Using a language-dependent text mining system the pages are assigned to one or more meaningful classes, e.g. politics, sports, newspaper, banking, shopping, etc. 6. For each user the information about downloaded classes are aggregated and combined with the common variables. 7. In a final step the internet user data is statistically grafted with a comprehensive and representative survey from the statistical office (Institut d’ Estad´ıstica de Catalunya) to extend the number of variables. Merge criterion is the distance on the values of common variables. To show the viability of the approach the project the internet usage of students and staff from the Universitat Polit´ecnica de Catalunya will be monitored
Regional B2B, Finance, Shopping, Jobs...
Countries, Regions, US States...
Society & Culture Internet, WWW , Software, Games...
People, Environment, Religion...
News & Media
Education
Newspapers, TV, Radio...
College and University, K-12...
Entertainment
Arts & Humanities
Movies, Humor, Music...
Photography, History, Literature...
Recreation & Sports
Science
Sports, Travel, Autos, Outdoors...
Animals, Astronomy, Engineering...
Health
Social Science
Diseases, Drugs, Fitness...
Languages, Archaeology, Psychology...
Government
Reference
Elections, Military, Law, Taxes...
Phone Numbers, Dictionaries, Quotations...
Fig. 2. Yahoo top level directory entries. for three months during summer 2004. For the classification of internet pages a number of classifiers have been trained which can recognize Yahoo-categories. This is done for Spanish, Catalan and English. Statistical grafting of web pages and survey data uses a novel nearest neighbor strategy, which leads to an optimal assignment of records. The methods required have already been implemented and currently are tested. After the project it is planned to implement the approach for other European national statistical offices. In the following sections we describe these steps in more detail.
3
Classification of Web Pages
Information about web activity is stored in web log files containing the sequence of web events. Log files represent a record of what requests were issued from a Web browser. Typically the events are requests (e.g., html-pages, pdf-documents, script execution, applet downloading, etc.) or responses (e.g., containing documents, output from scripts, applet code). They can be collected by a network monitor (also called a network sniffer). It listens passively to all information that travels over the network, identifies packets containing parts of HTTP messages, and constructs a log of URL requests in those packets. Each line in the log file represents a requests of the client. It is composed of a set of different fields. As shown in figure 1 these are the date, the IP address or host name of the client and the URL of the requested information. The task of text mining in the Diastasis project is the association of URLs from web logs to meaningful text categories. Text classification requires a training set with given categories. These may be determined from existing web directories like Google or Yahoo. These directories consist of a hierarchy of cate-
Press Quotes Press can e-mail press @ ryze.com
New York Times, 2/16/03
At Ryze.com, a networking group, members get a free profile page where they list their vitals, favorite quotations, hobbies, previous jobs and future . . .
Press Quotes Press can e mail press ryze com New York Times 2 16 03 At Ryze com a networking group members get a free profile page where they list their vitals favorite quotations hobbies previous jobs and future . . .
Fig. 3. Sample code of a web page (left) and cleaned version (right).
gories and are available for different languages, e.g. English (www.google.com/, www.yahoo.com/ c.f. figure 2), Spanish (www.google.com/intl/es/, es.yahoo.com/), or Catalan (www.google.com/intl/ca). Across different languages these directories have a nearly uniform structure, which allows to classify documents from different languages into categories of similar meaning. Using the documents linked in these directories we may train text classifiers to classify different categories of this hierarchy, e.g. all categories to the second level. Of course the selection of categories should be adapted to the domain, e.g. for users from the computer science field there should be an extended number of information technology categories. In addition there are other classes of documents, e.g. pictures, music, pdf-documents. These might be categorized using the corresponding text in the documents where they appear. Pdf- or postscript documents may be classified from their own text. The effort for classifying a single documents is less than a CPU-second. Therefore the online classification of a large number of documents is possible. Observing the users in this way can generate a lot of data. That is because when someone visits one web page, the browser sends multiple requests to the server for information. One type of request is for HTML files, but that is just one of many request types that will be sent. Additional requests types are sent for the individual elements that make up the web page – including graphics files, audio files, Flash files, and so on. Therefore, one page view can generate many requests. To be able to identify them they should be grouped together. 3.1
Language Classification
An important subtask of our approach is to identify the language of the observed web pages. Therefore we use an n-gram-based approach. An n-gram is an n-character slice of a longer string. Typically, the string is sliced into a set of overlapping n-grams of several different lengths simultaneously. In this approach also blanks to the beginning and ending of the string are added. It helps
with matching beginning-of-word and ending-of-word situations. Thus, the word “BOOK” would be composed of the following n-grams: Bi-grams: B, BO, OO, OK, K Tri-grams: BO, BOO, OOK, OK , K Quad-grams: BOO,BOOK,OOK ,OK ,K Experiments have shown that the 300 top ranked n-grams are almost always highly correlated to the language. That is, a long English passage about system architecture and a long English text about philosophy tend to have a great many n-grams in common in the top 300 entries of their respective profiles. On the other hand, a long passage in German on almost any topic would have a very different distribution of the most frequent 300 n-grams. This system works very well for language classification, achieving in one test 99,8 percent correct classification rate on Usenet newsgroup articles written in different languages.
4
Text Classification
A well-known approach to assign nominal features to text documents is text classification. Assume we have a collection of documents d1 , . . . , dn , each of which is assigned to one or more of k categories. It is important, that the categories usually are not exclusive, i.e. a document may belong to several categories. Let ci ∈ {0, 1}k be the indicator vector of these categories for document di . 4.1
Preprocessing of Documents
A text classification method has the aim to ”learn” the stochastic relation between the text of document di and the corresponding class vector ci . To apply automatic statistical methods we first have to code the words of the text into a representation which can be readily processed by statistical estimation procedures. Surprisingly it is sufficient for many applications simply to count the frequency of occurrences of each word in a text. For this ”bag-of-words” SM83 representation the sequence of words is ignored. Note that the resulting vectors of counts are sparse and very long, as they comprise the counts for each of the possibly more than 100000 words of a language. Variations of the feature selection include removing the case, punctuation, infrequent words and extremely frequent words (stop words). In addition similar words may be mapped to a common stem by reducing words to their morphological roots (stemming) as well as by splitting composita. For example the words ”information”, ”informing”, ”informer” and ”informed” would be stemmed to their common root ”inform”, and only the latter word is used as a feature instead of the former. The features could be reduced further by applying some other feature selection techniques, such as information gain, mutual information, cross entropy, or odds ratio. Other preprocessing includes principal component analysis (PCA) that seeks to transform the the original document vectors to a lower dimensional space by analyzing the correlational structure of terms in
class 1 w*x + b > 0 margin
class −1 w*x + b < 0
separating hyperplane w*x + b = 0
Fig. 4. Separating Hyperplane of a Support Vector Machine. the document collection [Seb02]. Each reduction step involves loss of information. Therefore the optimal amount of information reduction cannot be known beforehand but has to be determined for the specific domain and corpus by experimentation and cross-validation. 4.2
Text Classification Algorithms
During the last years a number of estimation methods have been proposed and evaluated which may be used for text classification. A recent overview is given by [Seb02]. – Decision trees recursively divide the input space according to the values of single variables. They may be extended to ensemble of trees by Bayesian methods or boosting. – Na¨ıve Bayes methods model the relation between classes cij and the xi = (xi1 , . . . , xiN ) input vector (bag of words) by a probability model assuming independence between words given a specific class. Na¨ıve Bayes classifiers do surprisingly well instead of their restrictive assumptions DPHS98 , Joa98a. During the last years a new procedure, the support vector machine (SVM) was applied to text classification [Joa98a]. SVMs take the input space and map it on to a high-dimensional space induced by a kernel (see figure 4). Complexity is restricted by the fact that the decision boundary in the high-dimensional space is the maximum margin hyperplane i.e. the separating hyperplane with maximum
Table 1. First Results for Webmining for 13 Yahoo Classes in English. class precision recall F -measure recreation 49.3 44.2 46.6 reference 91.2 76.1 83.0 news & media 62.6 32.7 43.0 arts 39.0 52.7 44.8 government 62.5 41.6 50.0 science 55.1 44.7 49.4 social science 55.6 35.6 43.4 entertainment 21.7 66.9 32.7 computers & internet 75.3 50.2 60.2 education 63.2 55.9 59.3 society & culture 45.3 33.0 38.2 business & economy 61.3 40.5 48.8 health 77.6 47.5 58,9
distance (margin) to the data points. The data points which are closest to the hyperplane and therefore define it, are the support vectors. Because the margin of the decision boundary is maximum, the Vapnik-Chernovenkis-(VC-) dimension can be controlled independently of the dimensionality of the input space. This implies a good generalization. Since not all problems are linearly separable Cortes and Vapnik CV95 proposed a modification — SVM with soft margins — to the optimization formulation that allows, but penalizes, examples that fall on the wrong side of the decision boundary. Instead of restricting the number of features, support vector machines reduce the number of sample points used and thus arrive at good generalization. Several authors [Joa98b] [Seb02] showed that SVMs perform substantially better at classifying text documents into topic categories than the currently best performing conventional methods (Na¨ıve Bayes, Rocchio, decision trees, -nearest neighbor) in such a context. In Diastasis we have applied SVMs to web pages linked to the English Yahoo tree. These pages were downloaded and preprocessed as described above. The following table 1 shows the performance of SVM classifiers on a separate training set which was not used for training. The precision and recall values vary around 50%, showing that the classification of web pages is a difficult task. Currently we conduct experiments to improve the performance by using the information in linked web pages. For Diastasis we estimate classifiers for different languages using Yahoo directories for these languages (cf. figure 5). To classify a web page, we first use the language classifier to determine its language (cf. figure 6). Subsequently the corresponding text classifier is employed to determine the class (or classes) of the web page. Basically the probability of each class is estimated.
Training docs Yahoo English
Count vector English
Training docs Yahoo Spanish
Count vector Spanish
English Model
Spanish Model
Fig. 5. Training of Classifier models using Web pages from the Yahoo tree for different languages.
5
Merging Web data and survey information
The final result of the data collection effort will be a representative database of individuals simultaneously having all variables of the survey as well as internet usage features. This would allow to apply usual statistical evaluation and modelling technologies to this database. We are assuming that the survey-data is a representative sample of the population, collected by an Official Institute of Statistics (the IDESCAT for the pilot study) or an accredited polling institute, with a fairly standard questionnaire for the ICT society, with a long list of information about socio-demographics, habits, equipments, opinions, etc. Also, with a standard sample design, usually a stratified two stage sample, with a list of substitutes. This information is currently collected in a household basis, but it can be transformed to an individual basis if desired. On the other hand, the web-data is collected following a protocol assuring national and European rules on confidentiality of internet navigation. This information includes the log data collected for a period of three months, the associated web document classes estimated by text mining, and a certain number of specified common variables with the survey data. We are also assuming that we have enough number of participants to cover all segments of the population under study although not in a representative way. Merging the information from independent files can be achieved in different ways. We only mention here two basic approaches: – The first consists in finding an explicit model connecting the specific variables with the common variables in the complete file and applying this model in the incomplete file. Modelling can be performed by simultaneous multiple regression, neural networks, principal component regression (PCAR), PLS regression, multivariate adaptative spline regression (MARS), etc. – The other (hot-deck, implicit modelling) consists in finding for each individual of the recipient file one or more donor individuals as similar as possible, and then in some way transferring the values of the specific variables to the recipient individual. Hot deck can be done using the k-nearest neighbors, or
Count vector Spanish
Spanish Model
Spanish class probabilities
activate Web Document
letter ngrams
language Classifier activate
Count vector English
English Model
English class probabilities
Fig. 6. Classification of web pages by first detecting the language and than applying the appropriate classifier model. by a preliminary classification of the donor and receptor individuals or by a segmentation tree. In Diastasis we employ the hot-deck approach. We are assuming the canonical situation for file grafting (see Figure 7); that is, we have two files: Donor File: the web-data, containing complete information, with common variables x and specific variables y. Receptor File: the survey-data, with common variables x and additional socioeconomic variables z. The common variables are a specified list of questions with good predictive power of the internet usage, whereas the specific variables will be the internet profile for each user. The interest of this methodology as alternative to the single source data stems from the fact that splitting the complete information into two sources facilitates the collection of data. It allows having a representative sample of households, which otherwise will be fairly difficult, it does not increase the burden for the participants in the internet data, since the number of common variables is very few and we avoid to ask personal or private questions, the participants will feel less checked out. Moreover, it allows greater detail in the collected internet data, allowing the different kinds of web usage be collected, increasing thus, the overall quality of the obtained results. The quality of the grafting operation depends indeed on the predictive power of common variables respect to the internet behavior of users. These common variables have been selected from the analysis of the IDESCAT household IT
common variables x Receptor file
demographic data
Donor file
demographic data
socioeconomic data
web data
Fig. 7. Sketch of the merging on the basis of common variables to impute missing web data.
survey of November 2002 and they have been adapted to the specific case of university members for the pilot study. They include: 1. 2. 3. 4. 5. 6. 7. 8. 9.
Speed of Internet access. Frequency of connection to internet. Age. Typology of professor/staff. Department or center. Years of internet use. Years of being a professor/staff. Average number of e-mails received per day and Average time of connection per day.
Only italic questions need to be asked directly to web user, the remaining will be obtained directly and without error from the ISP. Of course, all questions should appear in the questionnaire of the survey data, with the same wording and list of possible answers and care should be taken to assure that these questions were answered for respondents to the survey study. All these variables will be measured in a categorical scale. The specific variables to be transferred to the receptor file are the vector of classified usages of internet, aggregated by IP address. This vector will contain the connection time and frequency per each class. Classes will be obtained directly from the Yahoo or Google categories. Every request to internet will be classified into one category. As alternative we can consider probabilistic assignment to more than one category, then the time spent in that page shall be distributed into the different categories according their probability. Finally and for the whole period, the complete set of assignments will be aggregated by IP address, producing a vector of categories with their time spent per each one. This is the most important result to be grafted, also it will be interesting to have their frequency of connection and eventually their probability per category.
5.1
The grafting procedure
File grafting is a k-nn hot deck imputation methodology. It is based in a simple idea, everyone behaves like his/her similar, it contains the idea of finding a proxy for each receptor, it allows the usage of the same neighbor table for different grafting operations and it allows to keep confidentiality on the specific variables, since the imputation stage can be done apart. That is, it is not necessary of having the web-data to compute the neighbor table, we just need to have their corresponding common variables. This can be used to keep confidentiality on the surfing of participants in the web. File grafting is divided in the following steps: 1. Positioning donors and receptors in the same subspace, defined from the common variables. 2. Performing an instrumental clustering of donors. 3. Computing the neighbor table, relating each receptor to its k-nearest neighbor donors. 4. Imputing the specific variables. 5. Validation of the imputation. We explain these stages in the following. The positioning donors and receptors in the same factorial subspace is achieved by Multiple Correspondence Analysis of the common variables measured in the donor file. Then a sensitivity analysis is conducted to define the stable axes of the obtained solution. The stable axes define the common subspace. Upon these axes the receptor individuals are projected as supplementary points. Finally the factorial coordinates on the stable axes are stored for the sequel. For the instrumental clustering of donors we use a hierarchical clustering algorithm, we obtain a dendrogram of donors with a large number of final classes (say 100). This simply means defining a partition (the final classes) of the common subspace and having the donors organized in a tree. The k-nn module implements a branch and bound algorithm for finding the k nearest neighbor donors per each receptor, following the Fukunaga-Narendra FN75 algorithm. The search of the k-nn for a given receptor x0 is performed in two steps. First we perform a search upon the nodes of the dendrogram and second we perform the search within the selected nodes of the tree. With the first step we eliminate the nodes (clusters) far away from the actual receptor x0 . In the second step we eliminate the individuals of the node too distant of x0 to become its nearest neighbor. This procedure simply can be extended to k nearest neighbors. The objective of the imputation stage is to estimate for every receptor the specific variables, in our case the profile of internet usage. At this stage we need to assume a probabilistic distribution for the specific variables. In particular, we assume that for each receptor it exists a normal multivariate distribution of specific variables. We estimate the parameters of this distribution from the neighbors of the receptor.
Then, according the general objectives of the grafting procedure, which they focus to preserve the association among variables rather than to maximize the accuracy of the imputation, we perform a multivariate imputation by random draws from the local distribution. In this way, imputation is done conditionally to the common variables, thus, it reduces bias, and preserves association among specific variables.
6
Summary
In this paper we have described a procedure to enhance available survey data by web usage variables. For a representative sample of web users we collect the web pages they have downloaded during some period together. In addition we record the time of downloading as well as a number of informative common variables about their socioeconomic status. By statistical grafting we link this information to data from a national statistical office yielding a comprehensive dataset. It can be used to estimate the dependency between many socioeconomic variables and web usage information and potentially of interest for government, administration, education and economy.
References [CV95]C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:273– 297, 1995. [DPHS98]S. Dumais, J. Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In 7th International Conference on Information and Knowledge Managment, 1998. [FN75]K. Fukunaga and P-M. Narendra. A branch and bound algorithm for computing k-nearest neighors. IEEE Trans. Comput., C-26:917–922, 1975. [Joa98a]T. Joachims. Making large-scale SVM learning practical. Technical report, Uni Dortmund, 1998. [Joa98b]T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In C. Nedellec and C. Rouveirol, editors, European Conference on Machine Learning (ECML), 1998. [SCDT00]Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1:12–23, 2000. [Seb02]F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47, 2002. [SM83]G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw Hill, New York, 1983.