search engine and the Web pages visited afterwards, and finally a special ... but the ranking algorithm will place in the first places the documents that best.
Understanding how people use search engines: a statistical analysis for e-Business Fidel Cacheda, Ángel Viña Departamento de Tecnoloxías da Información e das Comunicacións Facultad de Informática, Universidad de A Coruña Campus de Elviña s/n, 15071 A CORUÑA, SPAIN Telephone: +34-981-167000 e-mail: {fidel, avc}@udc.es
Abstract. In this paper we describe the analysis realised of the users’ behaviour of a Spanish portal named BIWE (http://www.biwe.es). This portal offers to his users a high variety of services, although our study was mainly focused in the search services: a Web directory and a search engine. During a period of two weeks (from 3 May 2000 to 18 May 2000) we logged all the accesses realised by the users to the search services. Next, we realised a statistical study focused mainly in four points: the categories visited by the users in the Web directory, the queries asked to the search engine and the Web pages visited afterwards, and finally a special analysis of the users’ sessions. Through this study we will be able to answer questions like these: how far will a user browse the categories of the directory?, how many words does a common query include?, how complicated are these queries?, how many searches will a user perform during a session? All these questions (and their answers) will help us to build a profile of a common Web user and to have a better knowledge of how Web users behave, which is fundamental for e-Business.
1
Introduction
With the great increase in usage of the Web there has been a growing interest in the study of variety of topics in issues related to the use of the Web. However, to date there has been few studies of Web search users. At the beginning of 98, Kirsch presented some search statistics of Infoseek usage in [3]. But the main studies are [1], developed by Jansen et al., and [2], developed by Silverstein et al., where they demonstrate that Web search users differ significantly from users of traditional Information Retrieval systems. On the first one, they study queries taken from the query logs of Excite and on the second one they study a very large number of queries taken from the Altavista logs. Our objective is to study not only the users’ queries, but also the navigation through the results once the search has been performed and the documents visited. Our research will be done using the transaction log of a Spanish Internet Directory, called BIWE (Internet Searcher of Spanish Webs), through several days. The fact that we are using a directory and not a search engine will give us also useful information about the categories browsed by the users in order to find information. Consequently, our research will study the whole search process of Web users including categories, webs documents and of course, queries. Some of the questions about web use we choose to answer are related with queries, such as which queries are most common, what is the average number of words per query, how many queries are included in the average user session. However, we will focus our research also to the categories and the result documents visited by the users. Thus we will obtain information about how many categories visits a user per session, which categories are most visited, and which is more important, which webs are most visited, how many webs visits a user per session and the
relationship among queries, webs and categories in a session. In the next section we describe the BIWE search environment and the transaction log. The following sections analyse the queries, the categories browsed and the documents examined by the users respectively. In Section 6, we focus our attention in user’s sessions and finally, in Section 7, we describe the conclusions applied to e-Business. 2
The search environment
2.1 The directory and the search engine BIWE is a Web directory whose contents are composed of web pages written in Spanish. A Web directory is a hierarchical taxonomy that classifies the information [4] that we can find in the World Wide Web and, in our case, restricted to the information written in Spanish. There are two ways of obtaining information from a directory: by browsing its categories or by searching its content. Documents are located through a graph that forms the hierarchical taxonomy of the directory. The search engine is based on a weighted vectorial search. The simplest query consists of a collection of words which are ORed together, but the ranking algorithm will place in the first places the documents that best match the AND. There are some simple operators that could be used instead of Boolean expressions: -example instructs the search engine to ignore documents containing example; +example instructs the search engine to ignore documents not containing example. Also, the “ operator can be used to enclose phrases. Similarly, the common Boolean operators can be used in a query. Therefore, the operators and, or, not (and their translation into Spanish) are interpreted as Boolean operators. All these operators (simple and Boolean) can be combined just using the parenthesis to divide the simple and the Boolean operators. By default, the search engine will examine all the documents of the directory, although it is possible to restrict the search to some categories (and its descendants) just by selecting this option before searching. Moreover, BIWE has a detailed search where the user can control some special aspects of the search. For instance, the user can select the searched fields of the documents (four fields are stored: title, description, URL and some keywords). A pull-down menu allows the user to choose the number of results per page and a radio button selects how to group the results: by categories, without grouping but with categories information or just without grouping (option used by default). We use the term search string to denote the exact query that the user has entered and the expression search term to denote each of the words of the search string once the Boolean operators, the reserved words and the stopwords have been removed. After a search string is entered and the search terms are processed, BIWE returns a screen filled with 10 URLs (or the number previously selected by the user) showing the title and the description of each URL. All the URLs are ranked in order of relevance to the query. The user may click on any URL to explore the associated web page. 2.2 The transaction log The BIWE transaction log is divided into two parts. On the one hand there is a real transaction log that stores a series of requests and on the other, there is a web server log that stores the IP address of the client and the action performed. There are three types of requests: a query, a category exploration and a document exploration (obtained searching or browsing). For each type of request different information is stored. If the request is a query the following fields are stored: the search
string (exactly as submitted), the search terms (once the stopwords, the reserved words and all the operators have been removed), a timestamp indicating when the query was submitted and the rest of parameters of the search. If the request is a category exploration just two fields are stored: the category identification (each category of the directory is identified by a unique number) and a timestamp indicating the moment of the request. For the third type of request, a document exploration, the fields stored are similar to the former: the document identification (each document of the directory is identified by a unique number) and a timestamp indicating when this URL was visited. Likewise, the web server log stores the IP addresses of all the clients and all the actions they have done while they have been connected. This log will be used to obtain the information about users’ sessions. A session is a series of requests by a single user made within a small range of time. With the information available in the transaction logs it is difficult to obtain exact information about all the users’ sessions. Consequently, we have assumed that each different IP is associated with a different user and that if a user does not perform any action for 30 minutes the session has expired (we will note in the next section that this assumption will have been right) in order to get the statistics about users’ sessions. Some of these assumptions are weak, but we consider them strong enough to obtain a good estimation, since other factors could alter the number of sessions, such as the fact that a user may try to fill two or more information needs in one sitting. The requests were collected during 16 days, from 3 May 2000 at 15:00 to 18 May 2000 at 7:00. During this period of time, we obtained 324,503 requests: 105,786 queries, 61,050 visited categories and 157,667 viewed documents, in 57,259 different user sessions. 3
Analysis of queries
3.1 Analysis of search strings and search terms In order to analyse the queries we have performed two different studies. In the first one, we have considered the search string of each query to obtain the statistics. At the same time, we have repeated the same study but considering the search terms of all the queries instead of the search strings. In the first part of our analysis we have worked with 105,786 search strings, 35,518 of which were different. The most repeated query is the empty query that has been done 2173 times, and on average each query is repeated 3 times, although there is a high standard deviation (15.61). But if we study in detail the first part of Table 1 we can find out more about the statistical distribution of the search strings. As concluded in [2], we infer that a small set of queries are repeated many times. In fact, the half part of all the queries searched is only 8.18% of the most searched strings. On the first part of Table 1 we show the percentages of query duplications considering two different values: in relation to the total amount of queries and in relation to the unique queries only. We consider that two queries are duplicated if they have exactly the same words, in the same order, but the capitalization is not contemplated (this also includes the queries with many result screen requests). In [2], Silverstein et al. conclude that two-thirds of all the queries are asked only once, just considering the percentage in relation to the unique queries. However, we have calculated that 20% of all the queries are asked only once, and the main amount of queries are searched many times, as the median indicates. The second part of our analysis studies the search terms of all the queries. All the queries are composed of 173,128 terms, but only 23,707 are unique words. It is important realizing that the total amount of search terms is higher than the total amount of search strings, and
on the contrary the number of unique terms is smaller than the number of unique search strings. Therefore, we can infer that the subjects the users search are quite similar although they use similar, but not exact, queries, i.e. two users can search for Madrid +museum and +culture +Madrid. In this example, the search strings are different, but the subjects are really similar. So, the analysis of the search terms can give more realist information about the users’ behaviour. Table 1: Search strings and search terms duplication statistics
Search strings Total Query occurs 1 time 20,27 % Query occurs 2 times 11,25 % Query occurs 3 times 7,58 % Query occurs >3 times 60,9 % Search terms Total Query occurs 1 time 6,33 % Query occurs 2 times 4,22 % Query occurs 3 times 3,41 % Query occurs >3 times 80,45 %
Unique 60.39 % 16.75 % 7.53 % 15.33 % Unique 46.26 % 15.39 % 8.3 % 30.05 %
Figure 1: Search terms and search strings incremental percentage
In fact, the most repeated search term has been searched more than twice times than the same search string. Furthermore, on average each search term is repeated 7.3 times, but the standard deviation is quite high (33.31). Only 3.86% of the most searched search terms produce half the searched terms of all the queries, which is even a smaller percentage than the previous case. Similarly, if we take into account Table 1, the percentage of search terms which are asked only once is only 6.33% versus 20% of the search strings. This fact confirms our hypothesis that many queries ask for the same subject, although using some different words, but the main search terms are common to many queries. In Figure 1 we show the incremental percentage of search strings and search terms, where the differences between them are clear . We can conclude that most web users have the same needs of information. Nevertheless, we cannot forget that there is an important percentage of queries and subjects searched few times, probably because they are very specific. 3.2 Analysis of individual queries It is clear that searches on the web tend to have many fewer search terms than searches in traditional information retrieval contexts [1]. In our research we have found that the average words per query is a bit smaller (1.63) that the values found by Jansen et al. and Silverstein et al., probably due to linguistic reasons of the Spanish language. With regard to the number of operators used in queries, in Table 2 we can find some statistics. In a general meaning we can conclude that logic operators are little used in queries, and this is more evident if we consider that 2.89% of the queries use a combination of logic operators with parenthesis (nearly all the queries that use parenthesis). This underutilization is also present in [1] and [2], and suggests that web users do not have the basic knowledge of Boolean logic. The most used operator is the Spanish AND operator, Y, and in general, all the Spanish operators are more used than the English ones. This points to the idea that when users are searching in a language, they tend to use the logic operators of the same language, probably
without realizing it. Therefore, it is essential that a search engine interprets also the logic operators of its mother language. The next used operators are the “ operator, and the + and – operators. This fact illustrates that users can use easily non logic operators, although they are not as flexible as the logic ones, but for the type of the common web user query they seem to be more appropriate. Finally, only 0.36% of all the queries have got syntactical errors or do not have significant words (i.e. because all the search terms are stopwords). An important feature of web directories is the ability to limit a search to a specific category and its descendants. In BIWE this characteristic is available close to the search box but however it is only used in 2.1% of all the queries. This could happen because users have to browse through the categories and then do a search, but most of the times it is easier to check directly the accessible documents. Other search parameters are available in BIWE, but users have to access to the detailed search in order to change their values. In Table 3 can see the usage statistics and the conclusion is clear: very few searches use these parameters probably because they have a “difficult” access (just click over a link), therefore most of the queries just use the default values for these parameters. Table 2: Logic operators usage statistics
Spanish AND operator: Y AND operator Spanish NOT operator: NO NOT operator Spanish OR operator: O OR operator parenthesis ( ) + operator - operator " “ operator
3.539% 0.266% 0.016% 0.007% 0.033% 0.018% 2.94% 1.962% 0.862% 2.471%
Table 3: Search parameters in the detailed search
Default document grouping 99.768% Document grouping. Type I 0.199% Document grouping. Type II 0.033% Default searched fields 99.72% Default searched fields + URL 0.198% Other searched fields 0.082% 10 documents per screen (default) 87.679% 20 documents per screen 0.049% 30 documents per screen 12.081% 40 documents per screen 0.004% 50 documents per screen 0.068%
The number of documents viewed per screen is also an important value in the analysis of the queries. In our case, most of the users will check 10 documents per screen, but the value 30 has got a higher percentage than expected, which could be motivated by a meta-search engine or other automatic search agents that activate this parameter when it uses our search engine. But if we study in depth these statistics and the information about the most used intervals, we can conclude that in 67.88% of the queries the first screen was checked and in 13.24% also the second one was checked (these values are quite similar to[1], and a bit smaller than in case [2]). Consequently, of the 81.12% users will just check the first or the second results screens and do not keep examining more screens. This fact shows how important are ranking algorithms in web search engines. 4
Analysis of categories
The next part of our analysis will study the categories viewed by users. First of all, a category can be accessed directly while the user is browsing the directory or, as a result of a search, some categories are also listed and the user can access them directly. Firstly, we have analysed 61,050 accesses to categories and 898 unique categories were visited (actually, BIWE has got 911 categories in all). Each category was visited on average 67.98 times, with a standard deviation quite high: 192.82. In Table 4 we describe some information about the distribution of the visited categories.
Table 4: Categories statistics
Categories viewed 1 time 2 times 3 times > 3 times
All categories 0,03 % 0,07 % 0,11 % 99,998 %
Unique categories 1,78 % 2,23 % 2,45 % 93,54 % Figure 2: Percentage of categories visited versus categories depth
But this distribution is slightly different from the queries one. In this case, only 5.68% of all the categories have got half all the visits (which is a quite similar value to the queries). Nevertheless, a smaller percentage of categories are viewed less than 3 times, which is significantly different from the queries case. This implies that there are also a set of very viewed categories; but the set of little viewed categories is smaller, which is very important for a web directory because this means that its hierarchy has not got useless components. In Figure 2 we can observe the percentage of categories visited in relation to their depth. In this part of the study we have removed the root category, because all the users enter BIWE in this category. From the figure, it’s clear that the most accessed categories are in a middle depth, probably because the categories of depth one are too general and the deepest categories are too specific. Also, we can conclude that most of these categories are accessed after a search and not browsing the directory as it was supposed because, on the contrary, the categories of depth one would have also a great number of accesses. 5
Analysis of documents
There are two ways of opening a document: as a result of a search or browsing a category and viewing the associated documents. During our analysis 157,667 documents were accessed, but only 30,341 were unique, with an average frequency of 5.2 visits per document and a standard deviation of 16.12. The distribution of the documents viewed is quite similar to the queries distribution, especially to the distribution of the search terms. In this case, the 5.75% of the most viewed documents produce half the accesses to documents through BIWE. And if we consider the documents inspected only one, two or three times, the resemblance with the search terms is very high. In both cases there is a small amount of items that collect a high percentage of accesses, and also an important percentage of items is accessed very few times. At this point we have to emphasize the importance of the search terms versus the search strings. As we have mentioned in Section 3.1, most of the users will search the same subjects and therefore, they will obtain the same result documents and probably they will also open the same documents (remember that most of the users only view the first or the second result screen). This is the reason why both distributions are so similar. 6
Analysis of sessions
In our research we have studied 57,259 sessions with an average life of 9 minutes and a half. Unfortunately, neither Jansen nor Silverstein in their research papers show information about session life that could be contrasted with ours. Anyway, this fact confirms the assumption that a user’s session will not last more than 30 minutes. The average number of queries per session is 1.75 (a bit smaller than in [1] and [2]).
Thus, a user goes to the directory looking for specific information and will perform a query. If he considers that the results have not got a good quality, he will probably modify the query trying to retrieve better documents. Obviously, there are users who modify the query many times, but it is not the common case. During a session a user will check 1.3 categories. In order to study this value, we have to bear in mind that all the users enter BIWE directly to the main category; therefore all the users will check at least this category. And except this one, very few categories are accessed. This and the fact that the most accessed categories have a middle depth show that most categories are checked directly as a result of a search, without browsing the directory. But the most interesting value is that only 3 documents are inspected during a user’s session. If we consider also that most users only check one or two result screens, it is clear that users operate in a very restrictive way: just opening the documents which title and description suit more their requirements. Consequently, if a user views many result screens it does not implies that he will open many documents. Silverstein et al. deduce that sessions are often simple. In our case, studying different parameters we can obtain the same conclusion: a user will connect to BIWE trying to resolve a lack of information, perform a couple of queries and open some documents in order to obtain the solution and then disconnect. Of course, we hope that the disconnection is done after the solution is found. Many social aspects seem to be involved in this series (Internet connection while working) but its study is out of the objectives of this research. 7
Industrial benefits and conclusions
All the analyses realized can help to build a profile of a common Web user which is fundamental for e-Business nowadays. We confirmed the idea that web users seem to differ significantly from users of traditional Information Retrieval systems. Web users are not comfortable with Boolean operators and, in any case, they tend to use them in their mother language (probably without realizing it). They neither do not change any search parameters. Furthermore, most of the users just check the first two result screens and they do not usually browse the categories but they access them directly after a search. In our query analysis we observed that most of the users search for the same subjects, although they can use slightly different queries. Consequently, the documents retrieved by users are also the same. Anyway, there is an important percentage of queries which are asked very few times and so a significant amount of documents are viewed few times. All these aspects and the fact that an average user’s session does not last more than 10 minutes indicates that web users just use web directories or search engines to obtain, as soon as possible, some kind of information without wasting time in browsing categories or exploring some searching options. This special behaviour of the Internet users is creating a new way of doing business where we have to attract the attention of the user in some way (f.e. using customisable banners) during the few minutes he remains connected and also offer a high quality service (in our case, search service) in order to gain the fidelity of the user, because nowadays the most important worth of an Internet enterprise is its users. 8
References
[1] B. Jansen, A. Spink, J. Bateman, T. Saracevic, “Real Life Information Retrieval: A Study Of User Queries On The Web”. SIGIR FORUM Spring 98. [2] C. Silverstein, M. Henzinger, H. Marais, M. Moricz. “Analysis of a Very Large Web Search Engine Query Log”. SIGIR FORUM Fall 99. [3] S. Kirsch, “Infoseek’s experiences searching the Internet”. SIGIR FORUM Fall 98. [4] R. Baeza-Yates, B. Ribeiro-Neto, “Modern Information Retrieval”, Ed. Addison Wesley, 0-201-39829-X.