Measuring search engine bias - Semantic Scholar

59 downloads 51847 Views 425KB Size Report
the search terms chosen in a given subject domain have little influence on bias. .... cially available search engines is used in the current study to define the norm ...
Information Processing and Management 41 (2005) 1193–1205 www.elsevier.com/locate/infoproman

Measuring search engine bias Abbe Mowshowitz *, Akira Kawaguchi

1

Department of Computer Science, The City College of New York, Convent Avenue at 138th Street, New York, NY 10031, USA Received 25 October 2003; accepted 14 May 2004 Available online 2 July 2004

Abstract This paper examines a real-time measure of bias in Web search engines. The measure captures the degree to which the distribution of URLs, retrieved in response to a query, deviates from an ideal or fair distribution for that query. This ideal is approximated by the distribution produced by a collection of search engines. Differences between bias and classical retrieval measures are highlighted by examining the possibilities for bias in four extreme cases of recall and precision. The results of experiments examining the influence on bias measurement of subject domains, search engines, and search terms are presented. Three general conclusions are drawn: (1) the performance of search engines can be distinguished with the aid of the bias measure; (2) bias values depend on the subject matter under consideration; (3) choice of search terms does not account for much of the variance in bias values. These conclusions underscore the need to develop ‘‘bias profiles’’ for search engines.  2004 Elsevier Ltd. All rights reserved. Keywords: Bias; Search engines; Retrieval performance; Statistical analysis of bias measure; Search engine profiles

1. Introduction This paper reports on an investigation of a measure of bias introduced by Mowshowitz and Kawaguchi (1999, 2002a, 2002b) and Kawaguchi and Mowshowitz (2001). Bias is compared with the classical measures of retrieval performance in an effort to show what the bias measure can and cannot do. Several experiments designed to test the measure and to demonstrate its utility are discussed and analyzed. The experiments aim to resolve questions about performance differences between search engines, and the possible influence of

*

Corresponding author. Tel.: +1 212 650 6161. E-mail addresses: [email protected] (A. Mowshowitz), [email protected] (A. Kawaguchi). 1 Tel.: +1 212 650 6015. 0306-4573/$ - see front matter  2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2004.05.005

1194

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

subject areas and keywords on the measure. The findings suggest that (1) there are significant differences in the performance of search engines, (2) the measure is sensitive to the subject domain being searched, and (3) the search terms chosen in a given subject domain have little influence on bias. The bias measure is designed to capture the degree to which the distribution of URLs, retrieved by a search engine in response to a query deviates from an ideal or fair distribution for that query. This ideal is approximated by the distribution produced by a collection of search engines. Like traditional measures of retrieval performance (i.e., recall and precision), bias is a function of a system’s response to a query, but it does not depend on a determination of relevance (Meadow, 1973; Saracevic, 1975; Wishard, 1998). Instead, the ideal distribution of items in a response set must be determined. Using the output of a collection of search engines to approximate the ideal makes the measurement of bias computationally feasible in real time. Bias is a relative concept. A search engine is being weighed against its peers, not against an absolute norm derived from features of the universe. It might be desirable to adopt the latter approach, but it just is not feasible given the enormous size of the World Wide Web. Until there is a method of ascertaining, in real time, the distributions of items that might be retrieved in response to any given query, other approaches––such as the one taken here––to defining an ‘‘ideal’’ distribution are needed. Whether bias in a search engine is intentional or not––see Mowshowitz and Kawaguchi (2002a) for a discussion of sources of bias––it is important for designers to know a system’s bias profile. If intentional, designers need to know how effective the engine is in realizing a particular bias (e.g., prominently listing items dealing with one particular subtopic or viewpoint in response to a query concerning a research topic). If unintentional and the designer wishes to minimize bias, measurement is likewise essential to gauging performance. Users would benefit by knowing whether or not the URLs retrieved on a given topic are representative of that subject area. This is very different from the question of relevance. Consider a search on the topic ‘‘euthanasia’’. The Websites retrieved by a particular search engine may all be judged relevant to the topic ‘‘euthanasia’’ by a user, but the selection may be biased in the sense that the Websites retrieved are uniformly pro-euthanasia. A cluster analysis of URLs retrieved in a search can be used to clarify the bias in results. This is one way of showing the structure of a response set of URLs obtained by one or more search engines for a given query (Flake, Lawrence, Giles, & Coetzee, 2002). A subset of URLs on the Web (and the Web itself (Kleinberg, Kumar, Raghavan, Rajagopalan, & Tomkins, 1999)) can be interpreted as a directed graph whose nodes represent the URLs, and in which there is a directed edge from node x to y if the URL corresponding to x has a hyperlink pointing to the URL corresponding to y. In the case of a subset of URLs on the Web, hyperlinks to URLs that are not in the subset are ignored in the current analysis. With this interpretation clusters of URLs can be determined quite naturally as either weakly or strongly connected components of the directed graph (Harary, 1969). Fig. 1 shows the results of such an analysis of the results of a search on the keyword ‘‘euthanasia’’ conducted in late 1999. The example is meant to underscore the importance of bias, but it should be noted that clustering requires interpretation of the contents of Websites, which bias measurement itself does not do. Although some of the search engines used and some of the URLs obtained are now defunct, the analysis illustrates a general method for interpreting bias values. Ten URLs retrieved by each of five search engines (AltaVista, Excite, Google, NorthernLight and Yahoo) were included in the analysis. Thirty-nine of these 50 URLs are distinct. As shown in the figure, the directed graph corresponding to the response set consists of two non-trivial, weakly connected components. Not shown in the figure are 28 isolated nodes. The division into two clusters corresponds to the split between groups that respectively accept (e.g., soros.org/debate/Euthan.htm, bitsnet.com/choicebyright/euthnews.htm) and reject (e.g., euthanasia.com/, iaetf.org/, priestsforlife.org/euthanasia/euthanasia.html) the practice of euthanasia. The substantially larger cluster represents factions opposed to euthanasia, while the smaller one represents groups that accept the

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

1195

Fig. 1. ‘‘Euthanasia’’ clusters.

practice. This particular example made use of the collective results of several search engines, but the same analysis could be performed on the results of a single engine with a view to clarifying a bias value computed for a single engine. Cluster analysis might reveal even tighter connections between URLs. A search producing a large number of nodes might exhibit non-trivial, strongly connected components. The example shown here, with only 39 nodes, did not reveal any such strong components. Notice of possible bias may be especially important for the naı¨ve user who might be able to judge relevance, but would not be in a position to determine whether or not the retrieved Websites contain a disproportionate amount of material favoring one point of view over others. It is of course possible that a user would want a search engine to turn up only those Websites espousing a particular viewpoint, in which case a high bias value suggesting a weighted selection might be desired. Since there could be different interpretations of the views expressed in a Website, other users in this situation might be better served by the results of a search exhibiting a low bias value.

2. Bias, recall and precision The definition of bias used in this research is described briefly below. For more discussion of the definition and its justification see Mowshowitz and Kawaguchi (2002a); for a detailed example of a bias computation see Mowshowitz and Kawaguchi (2002b). The bias of search engine E is defined as the one minus the similarity between a vector representing the set of responses generated by E and a vector representing the response set of a collection C of search engines used to approximate an ‘‘ideal’’ set of responses. In both cases the responses are elicited by a set of queries related to a given subject. More precisely, suppose t queries qi ð1 6 i 6 tÞ are to be processed by a given search engine. Let Ri,j denote the response sequence of URLs generated for query qi ð1 6 i 6 tÞ by search engine Ej ð1 6 j 6 nÞ in the collection C; and let Si,j be the set of URLs included in Ri,j. Then Ri;j ¼ ðu1 ; u2 ; . . . ; uli;j Þ and S i;j ¼ fu1 ; u2 ; . . . ; ulij g where uk is the kth URL in the list retrieved by search engine Ej in processing query qi and li,j is the size of the list. For simplicity the subscript lij is taken to be a fixed value m for all i and j. Now suppose the union of the tn sets Si,j consists of the set A ¼ a1 ; a2 ; . . . ; aK ; and let Rði1Þnþj ¼ Ri;j , i.e., the response sets listed in row major form. The number of times each URL occurs among the Rk ð1 6 k 6 ntÞ is tabulated by computing a t n · K matrix whose klth element is 1 if Sk,l contains al and 0 otherwise. The sum Pl of the lth column is the number of times URL al occurs among the response sets

1196

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

R1,. . .,Rtn. Without loss of generality, the URLs of A are given in non-increasing order of frequency. The vector X ¼ ðP 1 ; P 2 ; . . . ; P K Þ ¼ ðX l Þ is called the response vector for the collection of search engines. For simplicity in describing the computation, E is assumed to be in C so that the response set of E is a subset of the response set of C. The response vector for a particular search engine E is determined as follows. First the union V of response sets is computed. In this case there are t sets Ri making up the union, one for each set of responses generated by E for query i ð1 6 i 6 tÞ. Suppose B ¼ fb1 ; b2 ; . . . ; bN g is the union of the Ri. Now form the membership matrix Ti,j whose i,jth element is 1 if bj belongs to Ri and 0 otherwise. Let pl be the sum of the elements in column l ð1 6 l 6 N Þ. The response vector for E is given by x ¼ ðp1 ; p2 ; . . . ; pN Þ ¼ ðxl Þ. The components of x are ordered so as to correspond to those of X. For simplicity of presentation, we will use the notation X ¼ ðX l Þ and x ¼ ðxl Þ to represent the response vectors and assume that each is appropriately ordered with the requisite number K of components. The similarity s(v, w) of vectors v ¼ ðv1 ; . . . ; vn Þ and w ¼ ðw1 ; . . . ; wn Þ is computed using a measure that is well-known in information retrieval research, namely, P vi wi where all the summations are from i ¼ 1 to i ¼ n: sðv; wÞ ¼ P P 1=2 2 w2i g f vi The bias of E with respect to the collection of engines C for queries q1,. . ., qt is given by bðE; q1 ; . . . ; qt ; E1 ; . . . ; En Þ ¼ 1  sðx; XÞ: Characterizing performance has become an important issue for researchers as well as search engine designers. Much of the research has focused on statistical analysis of Web coverage by search engines (Gordon & Pathak, 1999; Lawrence & Giles, 1998, 1999). Typically, search engines are compared according to the percentage of the indexable URLs on the Web that they actually cover (Schwartz, 1998; Xie, Wang, & Goh, 1998). The research reported here aims to complement these studies by establishing procedures for assessing bias as a characteristic of search engine performance. In particular, measurement of bias is intended to complement the measurement of recall and precision, i.e., to help establish performance benchmarks for search engines. The reference distribution (or norm), with which the performance of a particular search engine is to be compared, depends on the search engines in the collection used to define the norm. The selection of search engines comprising the norm should take account of changes in the search engine industry (Sullivan, 2003a). New companies may appear on the scene, existing ones may disappear or be absorbed by other companies, resulting in part from changes in business strategies. Interdependencies among commercial search engines and lack of detailed information about the indexing procedures and retrieval algorithms used by them should also be taken into account in defining the norm. Sullivan (2003c) has compiled a table showing that some search engines are ‘‘powered’’ by other engines. Google, for example, powers AOL, Yahoo and Netscape as well as Google itself. The meaning of the verb ‘‘to power’’ is not entirely clear, but it appears that there is considerable overlap in the sets of URLs retrieved by different engines powered by the same one. However, a good norm for bias measurement is one that approximates the universe of search responses seen by users. This suggests choosing the most popular engines. To approximate the distribution likely to be seen by a typical user, a sizable sample of commercially available search engines is used in the current study to define the norm for purposes of investigating properties of the bias measure. Bias can be interpreted in very different ways. On the positive side, the skewing of results may mean that an engine picks up interesting items not found by the others; on the negative side, it may be that the engine simply fails to find the most interesting items retrieved by the majority. Bias values, like those computed for most performance measures, simply indicate the level of bias in the system; they cannot pinpoint a partic-

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

1197

ular source of bias. Further analysis, such as undertaken in determining the significance of recall and precision is required to account for the computed values. Bias captures an aspect of a retrieval system that is not covered by the classical measures (Becker & Hayes, 1963; Salton, 1968; Salton & McGill, 1983) of recall and precision. The differences between bias and the classical measures stand out sharply in the extreme cases for recall and precision values. Two aspects of bias measurement must be considered, namely, emphasis and prominence. The former can be analyzed by treating the responses to a query as a set; the latter by interpreting the responses as a list of items in which order of presentation matters. There are four cases to examine. Case 1. High recall, low precision. This case obtains when most of the relevant items in the given database have been retrieved but these items are overwhelmed by the inclusion of irrelevant ones. Taken as a set of items, the response could exhibit high or low bias depending on the norm. If most of the other engines making up the norm have lower recall values for the given query, bias would be high. If recall is also high for most of the other engines, bias would be low. Taking order of presentation into account, the bias value would be low if most of the relevant items are at the top of the list; it would be high if the relevant ones were closer to the bottom of the list. Case 2. High precision, low recall. This could occur when relatively few of the relevant items are retrieved from the database, but even fewer irrelevant ones are retrieved in response to the given query. Once again bias could be high or low when the responses are treated as a set. If most other engines retrieve a different subset of relevant items, bias could be high; it would be low if most of the engines retrieved the same subset of relevant items as the engine being measured. When order of presentation is considered, further variations in bias values become possible. That is to say, even if the engines defining the norm agree on the set of items, these items may be presented in different orders by the different engines. Case 3. Low recall, low precision. When both recall and precision are low, relatively few relevant items have been retrieved from the database and of those retrieved most are irrelevant. Under these conditions bias could be high or low but is most likely to be in the mid-range and roughly the same for all the engines making up the norm. Low recall and precision may indicate a poorly constructed query and differences between any two engines’ deviation from the norm (with or without taking account of presentation order) are likely to be small. Case 4. High recall, high precision. If almost all the relevant items in the database are retrieved and very few irrelevant ones are included among the responses to a query, bias (treating the responses as a set) would be high unless most of the engines in the norm also score high on recall and precision. Taking account of presentation order complicates the picture. Even if most of the engines achieve high recall and precision, they may order the results differently, in which case the bias of a particular engine could be high or low. Bias of engine E would be high if most of the other engines order their results in the same way but differently from E’s results; bias is likely to be low if no one presentation order is dominant. The foregoing analysis shows how bias differs from recall and precision under various conditions. Even in the case of high recall and high precision, the bias value is illuminating. A high bias score for an engine that does well on the classical measures may indicate superior performance in retrieving useful items or it may reveal an idiosyncratic ordering of results relative to other engines. These two possible outcomes can be resolved by comparing the results produced by the different engines. Consistently high bias for different queries, coupled with high recall and precision that is not caused by idiosyncratic presentation, gives evidence of superior search engine performance.

3. Experiments on the bias measure To facilitate empirical investigation, the authors have developed a specialized system that acts as a metasearch engine (Glover, Lawrence, Birmingham, & Giles, 1999; Liu, 1998) capable of automatically

1198

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

computing bias in search results. The system together with explanatory details are accessible at http:// wikiwiki.engr.ccny.cuny.edu/IntelSearch. A brief description of the system is also given in Mowshowitz and Kawaguchi (2002a). At the time of writing this paper, the system could invoke 22 commercially available search engines, namely, About, Ah-ha, AltaVista, AOLSearch, FastSearch, FindWhat, Galaxy, Google, InfoTiger, Jayde, LookSmart, Lycos, Msn, Netscape, OpenDirectory, Overture, Sprinks, Teoma, TrueSearch, WiseNut, Xuppa, Yahoo. Almost all of these are among the listings of major search engines in Sullivan (2003b) and Search Engines.com (2003). The results of earlier experiments have been reported in Mowshowitz and Kawaguchi (2002b). These experiments examined the influence on bias measurement of the following three variables: (i) subject domain; (ii) search engines; (iii) search terms employed to represent a given subject domain. The subject area ‘‘computer science’’, represented by the classification scheme adopted by the Communications of the ACM for computing literature was used in these experiments. The research reported here is meant to check the validity and generalizability of these results in other subject domains. The same variables are investigated further here. For consistency and simplicity, since choice of search terms is known to influence search results (Spink, Jansen, Wolfram, & Saracevic, 2002), domains and search terms have been chosen from tree-structured classification schemes. Two such schemes are used in the current experiments, namely, ‘‘Outline of the Law’’ (West Publishing, 2002), and the Library of Congress classification system (http://www.loc.gov). The experiments reported here use the same procedure as in Mowshowitz and Kawaguchi (2002b). Reliance on tree-structured classification schemes simplifies the identification of relatively disjoint subject domains. Two domains (defined by terms in the classification) can be viewed as independent if their respective terms do not lie on the same path to the root of the classification tree. A term covers the nodes (subject areas) in the maximal subtree of which it is the root. For purposes of experimentation, at least three different domains should be chosen within each classification system. The choice of domains is dictated by independence and coverage (say at least half of the subject areas in the classifications system). Sixteen popular commercial search engines were selected to compute bias values and collectively to define a norm in the current experiments. These search engines (i.e., About, Ah-ha, AltaVista, AOLSearch, FastSearch, Google, LookSmart, Lycos, Msn, Netscape, Overture, Sprinks, Teoma, TrueSearch, WiseNut, and Yahoo) were chosen because they are either well-known or heavily used (Sullivan, 2003b). Popular commercial search engines are more likely to be well maintained and upgraded when necessary and to keep pace with the growing Web. In each search session (i.e., the processing of a set of search terms), the top 30 URLs returned by the 16 search engines for each of the terms were used to compute the bias values. Thus the norm for each bias calculation was based on 480 (not necessarily distinct) URLs. ‘‘Outline of the Law’’ experiments. The ‘‘Outline of the Law’’ classification scheme distinguishes seven main categories of law, namely, Persons, Property, Contracts, Torts, Crimes, Remedies, and Government. These main categories are further subdivided. For example, the Persons category has five subdivisions, containing 86 terms in all. Five distinct categories have been chosen from ‘‘Outline of the Law’’ and five terms randomly selected from each of the five categories. These are as follows:  Persons category: personal relations subdivision (adoption, husband and wife, labor relation, parent and child, master and servant).  Contracts category: particular classes of agreements subdivision (ailment, bonds, guaranty, joint adventures, pensions).

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

1199

 Crimes category (adultery, kidnapping, perjury, robbery, suicide).  Remedies category: means and method of proof subdivision (acknowledgment, affidavit, oath, witness, evidence).  Government category: judicial powers and functions subdivision (security regulation, taxation, federal courts, judges, social security). In the first experiment (examining bias across subject domains), 25 bias values were computed for each search engine, one value for each of five search terms per subject domain. Thus, 16 5 · 5 tables were constructed whose rows correspond to search terms and whose columns correspond to subject domains. Table 1 shows the results of a (one-way) analysis of variance (one-way ANOVA) of bias values across the five subject areas. Two-way analysis of variance (two-way ANOVA) is unwarranted in this case since each set of five keywords corresponds to a given subject area, i.e., the keywords are subject-specific. The software system used for the statistical analysis in all the experiments reported here is Minitab13 (Minitab Inc.). The p-values (probability values) in the table measure the credibility of the null hypothesis, i.e., they indicate whether the sample could have been drawn from the population being tested given the assumption that the null hypothesis is true (Wonnacott & Wannacott, 1984). The null hypothesis is rejected if the computed p-value is less than or equal to the widely accepted figure of 5% (0.05) error. The table shows that except for AltaVista, Fast, Google, Lycos, and Yahoo the p-values are smaller than 0.05. With 0.05 as the rejection threshold, this means that the null hypothesis is rejected for all but these five search engines. Thus the computed bias values for 11 of the 16 search engines, namely, About, Ah-ha, AOL, LookSmart, MSN, Netscape, Sprinks, Overture, Teoma, TrueSearch, and WiseNut, exhibit sensitivity to the subject area being searched. Table 2 shows the results of tests on the two other variables mentioned above, i.e., ‘‘search engine’’ and ‘‘keyword selection’’. Both of these tests used one-way analysis of variance. In the ‘‘search engine’’ test, exactly one bias value was computed for each keyword. Similarly, in the ‘‘keyword selection’’ test exactly one value was computed for each search engine. The ‘‘search engine’’ test was designed to determine whether or not the bias measure discriminates between search engines. Eighty bias values were computed for each subject domain, one value for each of five search terms per search engine. Thus, five 5 · 16 tables were constructed whose rows correspond to search terms and whose columns correspond to search engines. The p-values in the ‘‘search engine’’ row, all of which are 0.000, show that the null hypothesis must be rejected for each the subject areas shown. This means that the differences in bias values between search engines are statistically significant in each of the subject areas examined.

Table 1 Analysis of variance across subjects Engine

About

Ah-ha

p-value

0.001

0.025

AltaVista 0.128

AOL

Fast

Google

0.001

0.082

0.068

LookSmart 0.000

Lycos

MSN

0.120

0.00

Net scape 0.012

Sprinks 0.038

Overture 0.013

Teoma 0.004

TrueSearch 0.020

WiseNut 0.030

Yahoo 0.166

Table 2 Analysis of variance across search engines and keyword sets Legal subject

Search engine p-value Keyword selection p-value

Persons

Crimes

Contract

Remedies

Government

0.000 0.831

0.000 0.996

0.000 0.555

0.000 0.786

0.000 0.973

1200

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

Table 3 Two-way ANOVA (legal classification): bias vs. engine, subject Source

Engine Subject Interaction Error Total

DF

SS

MS

F

P

15 4 60 320 399

14.90857 0.15544 0.52533 0.77164 16.36098

0.99390 0.03886 0.00876 0.00241

412.17 16.12 3.63

0.000 0.000 0.000

A two-way analysis of variance that includes as factors both engines and subjects was also applied to obtain stronger evidence against the null hypothesis (the Minitab result is shown in Table 3). This test was performed on a 5 · 16 design whose rows correspond to subjects and whose columns correspond to search engines. The cell corresponding to subject S and engine E in this design has five bias values, computed by engine E for each of the five keywords associated with subject S. The p-values for engine and subject factors are 0.000, which clearly indicates that the bias measures computed for search engines are statistically different and such difference of bias values also comes from the choice of subject areas. The ‘‘keyword selection’’ test was designed to ascertain whether or not the bias measure discriminates between search terms associated with a given subject domain. Eighty bias values were computed for each subject domain, one value for each of 16 search engines per subject domain. Thus, five 16 · 5 tables were constructed whose rows correspond to search engines and whose columns correspond to search terms. The p-values in the ‘‘keyword selection’’ row of Table 2 are all greater than 0.05 indicating that the null hypothesis is not rejected for any of the five. This means that for all of the five subject areas selected for this experiment from ‘‘Outline of the Law’’, the distributions of the bias values computed for the 16 search engines exhibit no statistically significant differences, even though the search terms used to represent a given subject are different. Rating search engines based on their ‘bias performance’ may be especially useful for consumers. A critical question to answer is which search engines tend to perform with relatively high or low bias. Fig. 2 (with x-axis representing collections of search terms and y-axis representing bias values between 0 and 1) illustrates the separation between search engines on bias values computed for the keywords in the legal subject area Crimes and Government. The successive bias values plotted in the figure for each search engine were computed using a growing set of search terms. This is indicated by the label ‘‘number of search terms’’ on the x-axis. That is to say, the leftmost value is the bias of a search using the first search term, the second value is the bias computed for the first and second terms together, etc. One can see from inspection of the graphs that the bias measure does discriminate between search engines. For instance, Ah-ha’s results plotted as line segments in both graphs are consistently placed higher than those of Google. Statistical analysis confirms this informal observation: for each of the Crimes and Government experiments, the differences between Ah-ha and Google are statistically significant. Furthermore, Fig. 3 shows the bias values computed using all the 25 terms applied to the search. Comparing the confidence intervals indicates that, limited to these term sets, the engines AOLSearch, FastSearch, Google, Netscape, Yahoo all exhibit a low bias profile, whereas engines Ah-ha, Msn, Sprinks, Teoma, and TrueSearch have a high bias profile that is independent of the legal category. LC Subject ‘‘Philosophy’’ experiments. The second set of experiments analyzing the bias measure makes use of the ‘‘Library of Congress Subject Headings in Philosophy’’ (http://www.loc.gov). Subdomains and keywords for searches in the subject area ‘‘philosophy’’ are selected from the headings used

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

1201

1 0.9 About Ah-ha AltaVista AOLSearch FastSerach Google LookSmart Lycos MSN Netscape Overture Sprinks Theoma TrueSearch WiseNut Yahoo

Bias Measures

0.8 0.7 0.6 0.5 0.4 0.3 0.2 1

2

3

(a)

4

5

Search Terms 1 0.9 About Ah-ha AltaVista AOLSearch FastSerach Google LookSmart Lycos MSN Netscape Overture Sprinks Theoma TrueSearch WiseNut Yahoo

Bias Measures

0.8 0.7 0.6 0.5 0.4 0.3 0.2 1

(b)

2

3

4

5

Search Terms

Fig. 2. Bias values computed for the keywords in the legal subject area Crimes and Government.

in the Library of Congress classification. Roughly speaking, the Library of Congress classification differentiates subject areas according to disciplines, i.e., humanities, social sciences, fine arts, natural sciences, and physical sciences. The scheme divides knowledge into 21 classes, with each class further broken down from the general to the specific subject. The class entitled ‘‘General Philosophy’’ was chosen, and the following five subclasses (out of six possibilities) and search terms for each of these were selected for the experiment.

1202

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

Fig. 3. Dotplots of bias values computed for keywords in the legal classification.

 Ancient philosophy: assyria babylonia, modern thought, plato, nature philosophy, hedonism  Medieval philosophy: arabic philosophy, aristotle influence, platonism, islamic philosophy, albertus magnus  Renaissance philosophy: humanism, skepticism, montaigne, thomas more, galileo  Modern philosophy: realism, comparative philosophy, conservatism, individualism, alienation  General work: tradition, positivism, absurd, ideals, monism. As in the ‘‘Outline of the Law’’ experiments, five subject domains and five search terms for each domain have been used for the tests in the ‘‘General Philosophy’’ case, and the same 16 search engines were employed. Thus the tables of computed bias values have the same dimensions and their respective rows and columns represent the same elements as in the first set of experiments; the Minitab13 (Minitab Inc.) system was used for the statistical analysis. Table 4 shows the results of an analysis of variance of bias values across five subject areas. Only About’s p-values is larger than 0.05, which means that except for About the computed bias values are dependent on the choice of subject areas. The test for search engine differences showed the same pattern as in the ‘‘Outline of the Law’’ case. All of the p-values in the ‘‘search engine’’ row of Table 5 below are 0.000, which shows that the differences in bias values between search engines are statistically significant in each of the subject areas examined. As in the previous set of experiments for the ‘‘Outline of the Law’’ classification scheme, the result of a two-way analysis of variance indicates that bias differences are strongly related to differences in search engines and subject domains (Minitab results shown in Table 6). Table 4 Analysis of variance across subjects Engine

About

Ah-ha

p-value

0.533

0.000

AltaVista 0.003

AOL

Fast

Google

0.000

0.000

0.000

LookSmart 0.001

Lycos

MSN

0.000

0.00

Netscape 0.000

Sprinks 0.018

Overture 0.000

Teoma 0.000

TrueSearch 0.037

WiseNut 0.000

Yahoo 0.000

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

1203

Table 5 Analysis of variance across search engines and keyword sets Philosophy subject

Search engine p-value Keyword selection p-value

Ancient

Medieval

Renaissance

Modern

General work

0.000 0.997

0.000 0.996

0.000 0.998

0.000 0.876

0.000 0.976

Table 6 Two-way ANOVA (LC––philosophy): bias vs. engine, subject Source

Engine Subject Interaction Error Total

DF

SS

MS

F

P

15 4 60 320 399

17.84320 0.25768 0.41558 0.30631 18.82277

1.18955 0.06442 0.00693 0.00096

1242.72 67.30 7.24

0.000 0.000 0.000

The influence of keywords on bias reinforces the result obtained for the ‘‘Outline of the Law’’ scheme. All of the p-values in the ‘‘keyword selection’’ row of Table 5 are greater than 0.05, which means that for any of the five philosophy subject areas selected for this experiment, the distributions of the bias values computed for the 16 search engines exhibit no statistically significant differences, even though the search terms used to represent a given subject are different. Fig. 4 shows the bias values computed using all the 25 terms applied to the search. Although the distributions of the bias values are different, the relative order of their mean values is strikingly similar to the

Fig. 4. Dotplots of bias values computed for keywords in the legal classification.

1204

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

result obtained from the previous experiment using the ‘‘Outline of the Law’’ classification scheme. AOLSearch, FastSearch, Google, Netscape, Yahoo all exhibit a low bias profile, whereas engines Ah-ha, Msn, Sprinks, Teoma, and TrueSearch have high bias values.

4. Conclusion These two sets of experiments and the earlier results of Mowshowitz and Kawaguchi (2002b) provide support for several important conclusions about bias measurement and search engine performance. First, the bias measure adopted by the authors is useful for comparing the performance of search engines. Some search engines tend to retrieve items that are found by others, and some search engines do not. This difference in search performance can be determined with the aid of the bias measure as demonstrated. Second, the distribution of bias values depends on the subject matter under consideration. The experiment reported in Mowshowitz and Kawaguchi (2002b) using the CACM classification scheme showed no statistically significant difference in bias values over the subdomains searched except for two of the fifteen search engines included in that analysis. But in the current experiments (using ‘‘Outline of the Law’’ and the Library of Congress classification) the majority of search engines differ significantly in bias from one subject to another, strongly suggesting that such variations across subjects are likely to be the norm. In general, one search engine does not perform uniformly better than others in obtaining either popular or rare information on the Web. Moreover, within a given subject area (such as the ones examined, i.e., computer science, law, philosophy), choice of search terms relevant to that subject area does not account for much of the variance among the bias values. This is especially true when the items retrieved are the collective result of a series of searches with different but related search terms. Differentiating between engines on general performance calls for establishing a bias profile defined for a variety of well-chosen subjects. Third, all of these observations point to the need for further research to characterize search engine performance, analyze its sensitivity to subject areas, and to determine the significance of bias under specified search conditions. Bias measurement is one of the critical tools needed to evaluate today’s search services; bias profiling as part of a broader benchmarking procedure would be a logical extension of the results reported here. As argued above, assessing bias is an important problem since the Internet is already a major source of information for individuals and organizations, and its role as a source can be expected to increase in the future. Failure to take account of bias in the results of searches could be hazardous to Web users. The measures and procedures for assessing bias, described here, are intended as additions to the stock of tools for assessing the quality of information obtained on the new medium. In particular, practical measures can be implemented for use in detecting bias in Web search engines. References Becker, B., & Hayes, R. M. (1963). Information storage and retrieval: tools, elements, theories. New York: Wiley. Flake, G. W., Lawrence, S., Giles, C. L., & Coetzee, F. M. (2002). Self-organization and identification of Web communities. IEEE Computer, 35(3), 66–71. Glover, E. J., Lawrence, S., Birmingham, W. P., & Giles, C. L. (1999). Architecture of a metasearch engine that supports user information needs. In Proceedings of the eighth international conference on information knowledge management (CIKM-99) (pp. 210–216). New York: ACM. Gordon, M., & Pathak, P. (1999). Finding information on the World Wide Web: the retrieval effectiveness of search engines. Information Processing and Management, 35(2), 141–180. Harary, F. (1969). Graph theory. Reading, MA: Addison-Wesley. Kawaguchi, A., & Mowshowitz, A. (2001). Analyzing search engine bias. In Proceeding of the 1st international conference on computing and information technologies (ICCIT), Montclair, NJ (pp. 3–8).

A. Mowshowitz, A. Kawaguchi / Information Processing and Management 41 (2005) 1193–1205

1205

Kleinberg, J. M., Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). The Web as a graph: measurements, models, and methods. In Proceedings of the fifth international conference on computing and combinatorics, Tokyo, Japan, July 26–28 (pp. 1–17). Lawrence, S., & Giles, C. L. (1998). Searching the World Wide Web. Science, 280(3), 98–100. Lawrence, S., & Giles, C. L. (1999). Accessibility of information on the Web. Nature, 400(199), 107–109. Liu, J. (1998). Guide to meta-search engines. BF Bulletin (Special Libraries Association, Business and Finance Division), 10, 17–20. Meadow, C. T. (1973). The analysis of information systems (2nd ed.). Los Angeles: Melville. Mowshowitz, A., & Kawaguchi, A. (1999). Bias in information retrieval systems. In Proceedings of the ninth annual workshop on information systems and technologies (pp. 32–37). American Society for Information Systems. Mowshowitz, A., & Kawaguchi, A. (2002a). Assessing bias in search engines. Information Processing and Management, 38(1), 141–156. Mowshowitz, A., & Kawaguchi, A. (2002b). Bias on the Web. Communications of the ACM, 45(9), 56–60. Salton, G. (1968). Automatic information organization and retrieval. New York: McGraw-Hill. Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill. Saracevic, T. (1975). Relevance: a review of and a framework for thinking on the notion in information science. Journal of the American Society for Information Science, 26, 321–343 Reprinted in Sparck Jones and Willett (1997). Schwartz, C. (1998). Web search engines. Journal of the American Society for Information Science, 49(11), 973–982. Search Engines.com (2003). Available: http://www.searchengines.com/searchengine_listings.html. Sparck Jones, K., & Willett, P. (Eds.). (1997). Readings in information retrieval. San Francisco: Morgan Kaufman. Spink, A., Jansen, B. J., Wolfram, D., & Saracevic, T. (2002). From e-sex to e-commerce: Web search changes. IEEE Computer, 35(3), 107–109. Sullivan, D. (2003a). Where are they now? Search engines we’ve known and loved. The search engine report, March 4. Available: http:// www.searchenginewatch.com/sereport/article.php/2175241. Sullivan, D. (2003b). Major search engines and directories, April 29. Available: http://searchenginewatch.com/links/article.php/2156221. Sullivan, D. (2003c). Who powers whom? Search providers chart, May 5. Available: http://www.searchenginewatch.com/reports/ print.php/34701_2156401. West Publishing (2002). West’s analysis of American law. Eagan, MN: Thomson-West. Wishard, L. (1998). Precision among Internet search engines: an earth sciences case study issue. Science and Technology Librarianship, Number 18, Spring 1998. Available: http://www.istl.org/98-spring/article5.html. Wonnacott, T., & Wannacott, R. (1984). Introductory statistics for business and economics (3rd ed.). John Wiley & Sons. Xie, M., Wang, H., & Goh, T. N. (1998). Quality dimensions of Internet search engines. Journal of Information Science, 24(5), 365–372.