What are you searching for? A content analysis of intranet search engine logs Dick Stenmark Viktoria Institute, Hörselgången 4, 2nd floor, S-41756 Göteborg, Sweden
[email protected] Abstract. Web search engines have become often-used tools for many ordinary people today and a growing number of researcher are therefore studying how these lay-persons interact with such tools. However, an overwhelming majority of these studies have focused on the public web and not equally much effort has yet been devoted to the use of intranet search engines. This oversight is unfortunate since corporate intranets are where much business-related information today is kept, and to understand and improve the way technology may aid organisational members to find information is thus a worthwhile activity. In this paper we report from a longitudinal and comparative transaction log analysis of a large intranet’s search engine covering the period 2000-2004. We conclude that intranet users search for other topics than do users of the public web. Intranet searchers do not use operators to the same extent as Internet searcher, but seem to use a larger mix of languages. Finally, we note that there are information needs that appear to be persistent across time.
Introduction With the popularisation of the World Wide Web (hereafter the web) the amount of information made available to the average information seeker has increased manifold. This evolvement has propelled the development of web-based information retrieval (IR) tools that help people find the information they need in this gigantic repository of seemingly unordered information. Many of these so called search engines have come, and in some cases gone, during the last decade, and names such as AltaVista, Excite, Yahoo, and Google are known not only to a
1
Stenmark, D. : What are you searching for?
small bunch of computer-savvy techies but to a fairly large portion of the population. Studies of public search engines have revealed that web users behave very different from and apply a whole new set of search strategies compared to professionally trained information retrieval specialists (Jansen et al., 2000; Spink et al., 2001; Jansen & Spink, 2003). These observations have made researchers call for a better understanding of everyday web users’ information behaviour, since today’s search engines are sprung out of the needs and behaviours of traditional IR professionals. It seems plausible that the design of IR tools for the broad masses need to be different from design for a small group of specialists. In a parallel vein, other researchers have noticed that organisational internal webs – intranets – have their own specific requirements and that information seeking behaviour seen on the public web not necessarily can be expected to be repeated on intranets (Fagin et al., 2003; Stenmark, 2005). Intranet information is narrower in the sense that it is business oriented and more context specific. Whereas public search engines such as Google cannot be influenced by the user, intranet search engines may be customised to fit the specific requirements of the company hosting them. To understand the information need and behaviour of the organisational members is thus of vital interest. Although Stenmark (2005) has investigated how intranet users search their internal web, no-one has yet studied what they search for. In this paper, we contribute to the understanding of intranet search behaviour by making a longitudinal comparison of search topics from a large corporate intranet. Log files from three different years have been analysed and compared, and we have in particular studied the 100 most frequently used search terms, the most popular search topics, and if and how these have changed over time. The paper is organised as follows. In the next section we account for related research from intranet and public web studies and thereafter we present out research setting and research method. In section four the result of or work is accounted for and we subsequently discuss this in detail in section five. In section six, finally, we draw our conclusions and suggest design implications based on our findings.
Related work Relatively little work has yet been devoted to intranet searching and practically nothing to the content of intranet searching. Choo et al. (1998) studied corporate employees’ use of the web as an information resource to support their daily work activities, and found them engage in a range of complementary modes of information seeking, varying from undirected viewing to formal searching. Göker and He (2000) examined a week’s worth of log file data from Reuter’s intranet search engine in order to develop a method for automatic session boundary
2
Stenmark, D. : What are you searching for?
detection. Hawking et al. (2000) implemented a search engine on a university intranet in order to “reality test” an algorithm, and in a similar vein, Fagin et al. (2003) studied IBM’s intranet with a focus on technical matters. Stenmark (2005), finally, reported a time-based analysis of a week’s worth of intranet search engine behaviour but he only studied how users interacted with the technology; not what they actually searched for. The current study does thus make an explicit contribution to this field, but it also means that there is little previous work on which to build. We have thus had to compare and contrast or results to what is known about public web searching. Users’ interactions with IR systems have been studied and reported for many years, but initially these were all focusing on professional IR-personnel. It was not until quite recently that scholars began to show an interest in how laymen interacted with web search engines (Jansen et al., 2000). Amongst the early work Hölscher’s (1998) study of the German web search engine Fireball and Silverstein et al.’s (1999) report on Alta Vista usage can be mentioned. The most consistent examination of public search engine usage, however, has been carried out by Spink and Jansen, who over the last decade have established a useful research base of web searching behaviour (e.g. Jansen et al., 2000; Jansen & Spink, 2003; 2004; Spink et al., 2001; 2002). In this paper, we shall in particular use Jansen and Spink’s (2004) analysis of trends and differences in users’ search topics. By manually analysing a randomly selected sample of 2,600 queries, Jansen and Spink derived eleven non-mutually exclusive categories and found an overall trend towards using the web as a tool for information or commerce rather than for entertainment, although this trend was stronger in the U.S. Table I shows the distribution across the categories. Table I. Distribution of search topics in 2002 (Jansen & Spink, 2006) Category
pos
AlltheWeb
pos
Alta Vista
People, places or things
1
41.5%
1
49.3%
Computers or Internet
2
16.3%
3
12.4%
Commerce, travel, employment or economy
3
12.7%
2
12.5%
Sex and pornography
4
9.5%
7
3.3%
Health or science
6
4.5%
4
7.5%
Entertainment or recreation
5
4.9%
6
4.6%
Education or humanities
9
2.3%
5
5.0%
Society, culture, ethnicity or religion
7
2.6%
8
3.1%
10
2.1%
9
1.6%
8
2.5%
10
0.7%
11
1.1%
11
0.0%
Government Performing or fine arts Non-English or unknown
3
Stenmark, D. : What are you searching for?
Both in the European-based AlltheWeb and in the U.S.-based Alta Vista the People, places or things category ranked number one in 2002, way ahead of the opposition. AlltheWeb had the Computers or Internet category second and the Commerce, travel, employment or economy category in third, whereas Alta Vista had them in reverse order. These three categories accounted for more than 70% or the queries. The Sex or pornography category (ranking 4th and 7th, respectively) showed a noticeably decreasing percentage in both search engines compared to previous years. Another relevant related work is the analysis of the Utah state public web site search engine carried out by Chau et al. (2005). Whereas a general-purpose search engine such as Google spiders as much of the web as possible to create as big an index as possible, a web site search engine only indexes its own content and allows users to search for information only within this narrow context. Such search engines can be found at the homepages of many commercial web sites such as Microsoft Corporation (http://www.microsoft.com). These local search engines can be update more frequently than can general-purpose search engines and their interfaces may be customised and include domain-specific features (Chau et al., 2005). These local web site search engines have much in common with intranet search engines, we argue, and we shall use the results of Chau and colleagues as a point of reference for our own work. Chau et al. (2005) found both similarities and differences when comparing general-purpose search engine users and web site search engine users. The both groups were similar in the average number of terms per query and the average number of result pages viewed per session, but differed in terms of number of queries per session. The use of operators was also reported and compared to Spink et al.’s EXITE study. However, there are some calculation errors in Spink et al.’s (2001) table that have been reproduced without comments in Chau et al.’s work. Further Chau et al. also have some calculation error in their table. In our table II below we have recalculated Spink et al.’s and Chau et al.’s percentages using the authors’ data as presented in their articles. Table II. The use of operators in the Utah study and the Excite study (* marks corrected values) Feature AND
the Utah study
the EXCITE study
2.6%
2.8%
OR
0.08%
0.11%
NOT
0.09%
0.03%
+ (plus) - (minus) “ ” (phrase) ( ) (parenthesis) Any sort of operator
0.02% 0.05%
* *
* *
4.3% 2.1%
29.0%
5.1%
0.1%
(not reported)
34.4%
(not reported)
4
Stenmark, D. : What are you searching for?
As can be seen from table II there where both similarities and differences. The use of traditional Boolean operators (AND, OR, NOT) is similar, whereas + and – operators are much more common in the EXCITE study. The most significant difference, however, is the use of phrase search in the Utah study, where almost a third of all queries contained quotation marks. Another major difference was found in terms of search topics; Chau et al. compared the most frequently used query terms with those reported by Spink et al. (2001) and found that web site searchers had much more specific information needs. Web site search engine users submitted terms much more related to the specific domain and much less concerned with sex. The top queries to the Utah state search engine were dmv, tax forms, sex offenders, forms, and jobs. Only nine terms were found in both Spink et al.’s and in Chau et al.’s top-50 term lists; ‘of’, ‘and’, ‘for’, ‘sex’, ‘in’, ‘the’, ‘jobs’, ‘a’, and ‘to’, and as can be seen from this list all but two are functional words rather than semantic words.
Research setting and method This research is based on analysis of search engine log files from Volvo’s intranet. Volvo is a big manufacturer group with offices and production plants in many countries around the world and employs some +80,000 people. Volvo’s intranet was established in 1995 and quickly developed into a large information repository. In 1998, Volvo purchased and implemented a commercial search engine from Infoseek Inc, and when spidering the intranet little over 400,000 documents were indexed from some 450 web servers. These numbers continued to grow; at the end of the millennium the search engine had indexed 750,000 documents and found more than 700 web servers and in 2002 there were over 1,500 known web servers on the intranet, according to Volvo sources. The search engine generates a log file where every transaction the users have with the server is recorded. This log file contains the IP address of the user’s computer, the date and time (datetime) of the transaction (as logged by the server using Central European Time), the query string as entered by the users, information regarding which result page the users have requested, and some additional parameters not used in this particular study. The three log files used were collected in 2000, 2002 and 2004, respectively. The 2000 log file contains almost four week’s worth of transactions from January 31st to February 24th. The 2002 log file contains one week’s worth of transactions from October 21st to October 27th, and the 2004 log file, finally, contains one week’s worth of transactions from October 14th to October 20th. In all, the log files contain more than 128,000 activities from more than 23,000 users. Transaction log analysis (TLA) is a well-established method when examining search engine usage (Jansen, 2006). Still, commentators acknowledged that no standardised metrics have been agreed upon and interpretations and definitions
5
Stenmark, D. : What are you searching for?
differ between studies (cf. Jansen and Pooch, 2001; Spink et al., 2001). In our study, we extracted all query strings from the log files. These strings where thereafter split up in individual words and operators, and counted for frequency. We also counted all term pairs and term triplets. This included both “natural” pairs/triplets where users explicitly had submitted the two/three terms together (such as in human resources or volvo ocean race), and “derived” pairs/triplets where these were extracted from longer query phrases (e.g., the query volvo ocean race generates the two pairs volvo ocean and ocean race). To derive topics from the query terms we utilised card sorting, a technique to understand how users categorise items often used in information architecture (cf. Dong et al., 2001; Sinha and Boutelle, 2004). Open card sorting is typically used in explorative studies to generate candidate categories. We printed the top-100 terms from each year on small cards and let 24 randomly selected Volvo employees each sort 50 of these into categories at their own discretion. This gave us four individual categorisations for each term. We thereafter renamed and synthesised this set into 9 topics.
Results We first calculated the absolute frequency for every query term and year. For year 2000 we found 69,369 terms being submitted of which 17,390 (25.1%) were unique terms. For year 2002 we found 25,320 terms of which 8,021 were unique (31.7%). For year 2004, finally, we counted 30,719 submitted terms of which 9,037 were unique (29.4%). The 100 most frequently used terms (the top-100) accounted for between 22.9 and 24.0% of the total terms, as can be seen in Table III. Table III also accounts for the portion of the total that the top-50 and top-10 terms result in. The portion of not repeated terms (i.e., terms only used in one query) constituted between 15 and 19% of the total terms but nearly 60% of the different terms. Table III. Portions of the query terms being on the top-100, top-50, top-10 or being unique for each year. 2000
2002
2004
Total terms
69,360
25.320
30.719
Different terms
17,390
8.021
9.037
top-100
22.9%
24.0%
23.0%
top-50
16.8%
17.6%
16.0%
top-10
8.0%
7.7%
7.3%
10.377
4.772
5.179
not repeated terms out of total
15.0%
18.8%
16.9%
out of different
59.7%
59.5%
57.3%4
6
Stenmark, D. : What are you searching for?
Table IV below shows the top-100 terms for each year. No stemming or other linguistic techniques were used which, for example, can be seen in positions 2 and 3 for the year 2000, where “tjänstebilar” and “tjänstebil” occupies position 2 and 3, respectively. The former is simply the plural form of the latter and the two terms thus represent the same information need. Table IV. Top-100 terms and their rank listed per year (top-10 coloured). pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
2000 volvo tjänstebilar tjänstebil and kola sif standard memo rapid it job class service eddo truck ford quality lediga race vcc cars ocean car mcs v70 metall training vtc manual pc job parts sales business global system bil personal karta parma 2000 trucks leasing product pictures engine map iso c-bil download
2002 volvo kola outlook rapid pc standard tidinfo mail tjänstebilar web mailforms parma password it service eddo and parts gps forms business std access class mcs truck reseräkning quality hempc cats tjänstebil global gdi vtc image policy bil webmail utbildningspc email purchasing tdm gdp manual 3p sbgtools telefon hem plan product
2004 volvo kola rapid tidinfo outlook it service gps ebd password gdi business standard parts parma group web tdm reseräkning and pbp plan gdp global of impact map project change vinst sox support hr the 3p travel mail trucks karta teamplace order penta helpdesk manager management system truck product manual cats
pos
2000
2002
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 97 99 100
tdm std web management & cf blanketter of 1999 celero bilbiten group standards pv bilar project news motor the vhk vcom market sprint notes career report cst travel vit development tsu sörredsgården for technical bcd profile dator corporate vtv human homepage new ab start vtj bokenäs skövde policy gdi bus
change utbildnings support learning vinst profile travel volvobil training bilbiten vce nice utbildning visitkort group trucks tools process kvitto jobb 26400 nap order alviva vcom bilar engine bonus 26300 aqua map sörredsgården partner management development webx scs celero sap hr start personal form standards services iso design traktamente sif system
2004 class policy corporate form center test training lundby std protus mack dealer 2004 f2b template vmt mailforms scs standards printer way function vcom catia process alviva portal image air exter in sap glopps report team for vcd plant phoenix list pc johansson learning iso quality forms telesupport cars bil skövde
7
Stenmark, D. : What are you searching for?
There were a total of 185 unique terms amongst the 300 most frequently used search words. Thirty-two of these (representing 17.3%) were found across all three years. Another 49 terms (26.5%) were found in two of the years, and the remaining 104 terms (56.2%) were only used in one year. Looking specifically at the top-10 for each year, we found the distribution to be very similar. Three out of a total of 19 terms (representing 15.8%) were amongst the top-10 for all three years (volvo, rapid, kola), five terms (26.3%) were found in two of the years, and 11 terms (57.9%) were only found in one top-10 set. The frequencies of the terms appearing in Table IV were left out due to space limitations but to give the reader a flavour of the numbers we here present a few samples. Position #1 for the year 2000 (volvo) occurred 1,713 times, position #10 (it) 262 times, position #50 (download) 104 times, and position #100 (bus) occurred 70 times. Corresponding frequencies for 2002 were 414, 108, 43, and 27, and for 2004 655, 120, 51, and 35. As can be seen from these numbers, the frequencies drop radically with decreasing rank. This is a since long known phenomenon documented by Zipf (1932), who noted that a double-log rankfrequency plot generates a straight line with a slop of -1 for large English texts. Plotting the query words from our log data in such diagrams, we received the output illustrated in figure 1 below. Our lines were not as steep as Zipf’s prediction; the slops for the three years were -0.8895, -0.8133, and -0.8435, respectively. 4
4
4
3
3
3
2
2
2
1
1
1
0
0
0 0
1
2
3
4
5
0
1
2
3
4
5
0
1
2
3
4
5
Figure 1. Double-log rank-frequency for the years 2000, 2002, and 2004. The x-axis shows the rank and the y-axis the frequency, both logarithmically.
In contrast to table IV above, which lists the most frequently used terms, tables V and VI below show the most frequent pairs and triplets found amongst the query terms. The tables contain both naturally occurring pairs and triples and derived occurrences, i.e., pair and triplets extracted from longer text sequences. As we can see, tables V and VI differ from table IV. The term volvo is now combined with other words and appears in about one third of the pairs/triplets, and many of the frequent terms in table IV (such as kola, sif, rapid, and tidinfo) are not represented in tables V and VI.
8
Stenmark, D. : What are you searching for?
Table V. Term pair rank and frequency for all three years 2000 Pos
Freq
Terms
1
152
ocean race
2
131
3
104
4
96
2002 Freq
Terms
2004 Freq
Terms
51
web access
56
volvo it
lediga jobb
43
mail forms
43
volvo way
volvo tjänstebilar
42
hem pc
27
volvo trucks
volvo cars
40
utbildnings pc
27
the volvo
5
81
volvo it
37
volvo it
26
volvo cars
6
80
volvo car
34
standard parts
25
volvo group
7
77
volvo ocean
25
volvo bil
25
function group
8
63
volvo truck
23
business plan
25
outlook password
9
59
volvo nu
20
change password
24
business plan
10
53
volvo way
19
volvo way
23
business objects
11
45
Volvo trucks
18
image bank
22
air drier
12
45
human resources
17
function group
21
order it
13
42
lotus notes
15
volvo cars
20
volvo bil
14
37
hem pc
14
-vcc -vct
20
change password
15
35
no 4
14
sbg tools
19
hr partner
16
33
stock market
13
human resources
18
radio service
17
32
product development
13
-vct vtc
17
password change
18
31
business plan
13
volvo truck
17
holiday schedule
19
31
volvo if
13
lediga jobb
17
corporate standards
20
30
memo for
13
quality policy
16
active surface
21
30
attitude survey
12
vce quality
16
leif johansson
22
29
for windows
12
volvo car
15
net meeting
23
29
lediga platser
12
job postings
15
image service
24
29
ford dator
12
net meeting
25
28
cst newsletter
26 next: 2 pairs with 27 occurrences
next: 4 pairs with 11 occurrences
14
image bank
14
team place
14
volvo penta
next: 4 pairs with 13 occurrences
Out of the 75 term pairs in table V, only 4 pairs (volvo cars, volvo it, volvo way and business plan) were present in all three years, which corresponds to 5.3%. Another 11 pairs (14.7%) were present in two of the years, whereas the remaining 60 pairs (80.0%) only ranked amongst the top-25 in one year. Comparing tables V and IV, we see that although there are no term pairs in table IV many of the terms in table IV appear in the terms of table V. Little less than half of the pairs (often the highly ranked ones) consist of terms on the top-50 list. When examining triplets in an analogous way we produced Table VI. Out of the reported 78 term triplets in table VI, only one triplet (the volvo way) was represented in all three years, which corresponds to 1.3%. Another 4 triplet (5.1%) were present in two of the years, whereas the remaining 73 triplets (93.6%) only were present in one year.
9
Stenmark, D. : What are you searching for?
Table VI. Term triplets rank and frequency for all three years. 2000 Terms
2002
Pos
Freq
Freq
1
71
volvo ocean race
2
29
memo for windows
3
25
2000 1999 1998
7
4
24
aftermarket and service
6
5
23
no 4 1999
6
13 7
Terms
2004 Freq
Terms
-vcc -vct vtc
20
the volvo way
function group index localização das concessionarias who is who design building landscaping
10
code of conduct
10 9 8
volvo trucks plant volvo do brasil i-shift gear box regulations and certification engine data sheet
6
23
cst newsletter, no
6
the volvo way
8
7
23
newsletter, no 4
5
outlook web access
7
8
20
volvo tjänstebilar ab
5
vce quality policy
7
trucks plant in
9
18
volvo action service
5
-vct vtc -it
7
welding manual design
10
16
volvo attitude survey
5
one company vision
6
class for unix
11
16
the volvo way
5
ett friskt arbetsliv
6
personal business plan
12
13
global trucks investment
6
volvo learning partner
13
4
volvo learning partners
6
of internal consultants
14
13
bil på g communications and sales volvo cars marketing
4
13
4
second to none
6
volvo ocean race
15
12
volvo do brasil
4
arbetsgång -vcc -vct
6
outlook password change
16
10
of the brand
4
volvo truck korea
6
vehicle regulations and
17
10
and service bulletin
4
6
18
10
truck of the
4
19
10
soul of the
4
5
volvo's air drier business objects password and six customers
20
10
volvo personbilar sverige
4
5
trucks north america
21
9
volvo group presentation
4
5
manual design analysis
5
who is who
5
air drier active
22
9
corporate identity manual
4
cisco vpn client -information technology investering volvo global trucks vtc -information technology pubtex output 2002.01.09:0620 aftermarket training kit
23
9
volvo cars magazine
4
volvo attitude survey
5
24
9
molnen lättar över
4
chart of accounts
5
volvo trucks north
25
9
of the year
4
investering -vcc -vct
5
drier active surface
4
-vct vtc -information
5
just in time windows systems engineer
26 27
5 next: 8 triplets with 8 occurrences
next: 42 triplets with 3 occurrences
next: 43 triplets with 4 occurrences
We also examined the most frequently occurring queries as submitted by the users and the results are presented in table VII. Focusing on table VII, we see that single term queries dominate; there are only nine multiple term queries amongst the top-100 for the year 2000 and eight and seven for the years 2002 and 2004, respectively. The number of multiple term queries amongst the top-50 queries is five, four and one for the three years. There is only one three-term query (volvo ocean race) and ten of the multiple term queries contain the word volvo. One noticeable difference when comparing table VII with table IV (the term frequency table) is that the term volvo has disappeared from the former. In fact, comparing the two tables year by year, we find that there are only 46 terms overlapping for the year 2000. The portion of overlap is actually larger when looking at the top-50 terms; 36 of the 50 most frequently used queries appear also amongst the most
10
Stenmark, D. : What are you searching for?
frequently used terms. Corresponding numbers for year 2002 are 56 overlapping terms (41 out of the top-50) and for year 2004 39 overlaps (32 out of the top-50). Table VII. Top-100 queries and their rank listed per year (multiple word queries coloured). pos
2000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
tjänstebilar tjänstebil kola sif rapid eddo mcs metall memo standard class parma c-bil tdm cf blanketter lediga jobb bilbiten sörredsgårde n vtj eifel bokenäs jobb
kola rapid outlook tidinfo mailforms parma tjänstebilar eddo gps mcs cats reseräkning tjänstebil hempc webmail standard tdm utbildningspc web access
kola rapid tidinfo ebd gps parma gdi tdm pbp reseräkning sox outlook impact cats gdp teamplace mailforms vinst standard
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
sbgtools mail forms hem pc email
f2b protus scs alviva
70 71 72
volvo tjänstebilar job tsu
gdp
phoenix
mail bilbiten
password vcd
75
leasing profile celero start v70 ocean race e-bil vcom personal
volvobil utbildnings password kvitto standard parts profile 26400 gdi sörredsgården
exter glopps telesupport orderit helpdesk techla resplan park visitkort
77 78 79 80 81 82 83 84
gdi ford aqua virus protus vcc volvo ocean race reseräkning career hempc sprint standards volvo nu bcd aftonbladet
nice 26300 alviva forms visitkort vinst webx
bilbiten start sif hempc celero cpt 3p
86 87 88 89 90 91
scs 26000 protus class techla aqua traktamente webaccess
tjänstebil vsib volvo way vcom pcm nap friskvård sbgtools
93 94 95 96 97 97 99
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
2002
2004
pos
69
73 74 76
85
92
100
2000
2002
2004
karta training visitkort jobpostings tä forms kvitto sbgtools volvoresultat personalköp cars telefon bms gdp bilder outlook impact vtv news
3p 29000 pc nap orderit vdw milersättning celero purchasing sif telefon bms bennett vcd volvo bil start image travel jobb
qulis printer volvobil terminalglasögon skövde vda travel order it function group amica 26400 26300 vdw tjänstebilar ths niceweb netmeeting motiv hr partner
dynafleet bonus ftp travel
telesupport bonus resplan utbetalningsan modan wss
email nice metall jena
hrt gid
pictures utbildning quality fritid images antivirus jobs internbuss bibliotek
utbildning change password vct2000 image bank metall pcm digipass qulis vtj ntsb bennet
volvo it map fmea traktamente security bilköp 2510
vcom volvo it 26500 blanketter netmeeting hälsovård impact
stock market vtk bilia volvo way digipass visits motor kurser
sap vega foto vu glopps pins köping skövde
volvo cars leasingbil maps
iprint
forms edb bennet volvo bil vmt traktamente reseinfo pa-handbok change password bite2 antivirus 29000 sörredsgården servicebureau sem sap personalhandbok maps karta vcads umeå telefon qjs outlook password
11
Stenmark, D. : What are you searching for?
As is evident from the previous tables, the users sometime include operators in their queries. The use of operators, however, was declining. In year 2000, little less than 10% of the queries included operators of some sort. In year 2004 this percentage was down to just over 3%. The most commonly used operator was the quotation mark, which is used to indicate a phrase search. This single operator accounts for approximately half of all operator-including queries. The second most common operator is the + sign, which specifies that a term has to be present in the result set. In 2004, the + sign was used in around 20% of all operatorincluding queries but only in 0.7% of the total queries (se also table IX in the Discussion section). Table VIII. Language distribution for the top-100 terms across all years. Language
2000
2002
2004
English
40
33
43
Swedish
16
21
6
0
0
0
44
46
51
Other Indeterminable
What can also be observed in the previous tables is that the queries come in several different languages. We manually analysed the top-100 terms from each year and determined the language used. This is reported in table VIII and as can be seen from table VIII many terms were indeterminable. This category includes acronyms (e.g., gps), language-independent names of places (e.g., lundby), persons (e.g., johansson) or computer systems (e.g., parma), and terms that are spelled the same way in both English and Swedish (e.g., global). No languages other than Swedish and English were found amongst the top-100 terms. 30 25 20
2000 2002 2004 Total
15 10 5
Sy st em s
s
da rd s St an
Pr oj ec t
uc ts Pr od
Pl ac es
rg .
IT
H R
O
M
is c.
0
Figure 2. Distribution (%) of search topics for all three years and synthesised total.
Our open card sort exercise resulted in 9 categories as can be seen in Figure 2. Human Resources, Organisational Issues and Computer systems and Applications were the dominating categories, seen over all years. As is evident
12
Stenmark, D. : What are you searching for?
from Figure 2, there were large differences between years. Project information and Standards & policies have been growing every year, although they are still rather small. In contrast, the Organisational Issues and Products categories are consistently declining in interest. Places & People is the smallest of all categories, although many small topics are hidden in the Miscellaneous category.
Discussion As was evident from Table II, the top-x terms portions of the total are pretty consistent over the years. The portion of not repeated words is not equally stable, although the variances are rather small. Close to 60% of the query terms are used only once, but since the repeated words are sometimes used very frequently, the not repeated terms only make up some 15-19% of the total. Still, a significant portion of the query terms are not repeated, indicating that the information need is focused on quite a narrow field. When studying the top-100 terms, we noted that some 17% of the terms reappeared every year, whereas more than half of the terms only were present in one year. This distribution holds also for the top-10 terms. Apparently, there are things that the Volvo employees continue to search for year after year and such information would be useful to information providers and site designers within the organisation. Regarding term frequency we posit that the large portion of unique words indicates that there is a shift in information seeking behaviour from year to year. It is, however, not possible to say whether this depends on a shift in information needs or on a better organisation of the information. Studying table IV, one can come to the conclusion that volvo is a rather common query term. This is only partly true. The term volvo does actually top the list every year but volvo as a stand-alone term occurs only in 34 of the 2,782 queries that includes the term volvo. In 98.78% of the volvo-related queries, the term volvo is combined with other terms, which can be seen also from tables V and VI. Although table IV is correct, such listing of individual terms may skew the understanding of the search behaviour in unexpected ways. Nonetheless, term frequency lists are presented in much of the published research in this area (cf. Jansen & Spink, 2005; Spink et al., 2001; Jansen et al., 2000). Based on our study, we argue it may be better to instead, or at least as a complement, list the most frequently submitted queries or to include the most frequently used pairs and triples, as do Chau et al. (2005). Doing so adds a bit of context to the terms and helps the interpretation. Organising the query terms into topics or categories is another way to contextualise the terms. In the year 2000, queries about Organisational Issues topped the ranking with to 19%, closely followed by Human Resources (18%) and Miscellaneous (18%). These three categories accounted for more than half of the queries logged during the sampling weeks. This picture had changed
13
Stenmark, D. : What are you searching for?
dramatically in 2002, where Computer systems and Applications had doubled its portion to 26% and moved into first place. The Human Resources category was still in second place, having gained to 23%, and Organisational Issues, now in third, had fallen off to 16%. In the most recent sample, from 2004, Computer systems and Applications held on to the lead but was down to 23%. The second place was spilt between Organisational Issues and Miscellaneous, both with 14%. Human Resources had dropped to 10%. The only category scoring amongst the top three every year was the Organisational Issues category, but this category is also one that consistently is losing in frequency. Standards & policies and Project information are instead gaining in frequency for each year but are still relatively small. We interpret this to mean that much of the intranet use has been to find information about various parts of the Volvo organisation. The fact that the search interest is declining may indicate a more salient presence of this sort of information in the company provided start pages. Another noticeable detail is the Miscellaneous category, which, apart from a drop in 2002, has been pretty dominant and holds a fourth position in the overall size ranking. This also means that many small topics that are hidden under this label may be seasonal or unique to a particular year. Referring to table VIII, we see that there is no clear pattern as to how the use of different languages is developing. We see that the use of Swedish terms has dropped significantly in 2004 but we do not know if this is a lasting condition or just a temporary phenomenon. What are consistent are the dominance of indeterminable words, i.e., acronyms or words with identical spelling, and the dominance of English over Swedish. This may, however, not be the case in the corpus as a whole, and further analyses are needed to verify this. When comparing our findings with that of Jansen and Spink (2006) we immediately see that the search terms and search topics reported by them from the public search engines differ significantly from the terms and topics we found at Volvo. This is not surprising and echoes the findings of Chau et al. (2005) who noted that terms used in site searching were very different from those used in general-purpose search engines. For example, neither we nor Chau et al. found many sex related terms, whereas these terms dominated the ranking list in Spink et al.’s (2001) study. Further, the Places and people category was one of the smallest in our study while it topped the rank in Jansen and Spink’s (2006) study. We speculate that the absence of personal homepages in Volvo’s intranet and the fact that organisations such as Volvo may have other applications and functions to locate people within the organisation have lessened the need for this sort of queries. Computer systems and Applications is the most commonly used category followed by Human Resources and Organisational Info. We note that the HR category has dropped recently. This could mean that people have lost interest in these issues but another – perhaps more plausible – explanation is the HR info has been made easier to find. We also note that many categories percentages fluctuate
14
Stenmark, D. : What are you searching for?
drastically between years, again illustrating that information needs varies over time. As in Chau et al.’s (2005) study, we see that the frequencies for the highest ranked term pair is considerably lower in our study than the frequency of the highest ranked term, and that the frequency for most sought for triplet is lower still. However, the drop is much more pronounced in our data than in Chau et al.’s study. We suggest that this is because intranets contain more jargon and more acronyms than do the public web. Another possible explanation is the presence of Swedish terms. The Swedish language makes use of compound words, resulting in single terms where e.g. English would have used two terms. Additional linguistic analysis is required to fully understand this issue. We also see that the slopes of the Zipf plots in figure 1 are not as steep as theory would have them. This again can be explained by the Swedish language, where our habit of constructing compound words makes the number of terms grow quicker than the frequency. Finally, when comparing the use of operators in our study to that of Spink et al.’s and Chau et al.’s we must first remember that Spink et al.’s data dates back to 1997 whereas Chau et al.’s data is from 2003 and thus much closer in time to ours (see table IX). Table IX. Comparing the use of query operators in our study with those of the Utah study (Chau et al., 2005) and the EXCITE study (Spink et al., 2001). Feature AND
Utah study
EXCITE study
2000
2002
2004
2.6%
2.8%
0.8%
0.4%
0.3%
OR
0.08%
0.11%
0.02%
--
--
NOT
0.09%
0.03%
0.01%
0.02%
0.02
+ (plus)
0.02%
4.3%
4.2%
0.9%
0.7%
- (minus)
0.05%
2.1%
0.1%
0.2%
0.04%
“ ” (phrase)
29.0%
5.1%
4.4%
2.2%
1.9%
0.1%
(not reported)
0.05%
0.03%
0.06%
34.4%
(not reported)
9.6%
3.8%
3.1%
( ) (parenthesis) Any sort of operator
Nevertheless, our data is more similar to that of the EXCITE study, in particular for phrase searches and the use of + signs, at least as far as year 2000 is concerned. Our statistics thereafter drop radically. Overall, the use of operators is much lower than what Chau and colleagues reported, and although Spink et al. did not report their total, we know it cannot be less than 14% and thus also higher that what we found. We do not know why the use of operators has declined so dramatically nor if a similar trend can be seen on the public web. This, too, is an area where more research is needed since the correct use of operators is likely to have a significant positive impact on the results.
15
Stenmark, D. : What are you searching for?
There are organisational implications to be drawn from this study. Some information needs appear to be time-independent and organisations should adjust their information provision accordingly. This means adding information, updating it, highlighting it, adding metadata to it and linking to it from many places are important activities for the organisation once these needs are identified. Search engine log file analysis may thus be a useful tool when assessing the effects of information architecture remakes and new web site designs. Other information needs are more short term; they emerge and disappear in short cycles. To be able to respond to such shifting demands, organisations must closely monitor the queries and be quick to provide the required information. As we have illustrated, it is not enough to study the most frequently used terms, but the whole query. Out study shows that a large international organisations may have multilingual intranets, despite an official corporate language (English in this case). This means that the research on multi-language information retrieval becomes important. The large number of indeterminable terms also point to the need for research on how to correctly deal with synonyms and homonyms. Search engine vendors aiming for the intranet market should closely follow this development and preferably form joint ventures with multi-lingual retrieval researcher to incorporate the latest findings. There are obviously limitations to this study. Although we have used data from three different years and thus been able to follow the development of the queries, our study is limited to one intranet. This is understandable, since a lot of work is required to analyse this amount of data, but our findings still have to be replicated and tested elsewhere before any far-reaching conclusions can be drawn. In our qualitative analysis of the data we have restricted us to the most frequently used terms from each year. It is possible that this has skewed the outcome of the analysis and that our findings do not represent the corpus as a whole. This also has to be taken into consideration.
Conclusions We have studied three log files from a corporate intranet search engine; one file from 2000, one from 2002, and one from 2004. Having extracted the query terms we have been able to analyse what the organisational member have sought for and how their information needs have shifted over time. It is common practice to use query term frequency lists to illustrate information needs. In this paper we have shown that this may produce misleading conclusions since many words in isolation carry very little information. More useful is to present the most frequently used queries or the most frequently used term pairs or term triplet, since this approx allows for more context.
16
Stenmark, D. : What are you searching for?
Not surprisingly, we find that intranet users search for other things than do users of public search engines. Not only are sex related queries rare; intranet users also search less for people and places. The interest for computer-related things seems to be shared between intra and Internet users. The majority of the queries and query terms are replaced from year to year, and the topic categories also shift in size over time. This suggests that information needs fluctuate and are time-dependent. Organisations must thus continuously keep track of the current and emergent needs and be ready to provide the corresponding information. However, we also conclude that certain information needs are rather persistent and time-independent and organisations should focus on providing content in these areas.
Acknowledgments We are grateful to Volvo for providing the log files and to the Swedish Council for Work and Social Research (FAS) who sponsored this research via grant #004-1268. Thanks are also due to Richard Wallmark and Artur Foxander from the Department of Computational Linguistics at Göteborg University for their assistance during the preparation of the data.
References Chau, M., Fang, X., and Sheng, O. R. L. (2005). “Analysis of the Query Logs of a Web Site Search Engine”. Journal of the American Society for Information Science and Technology, 56(13), pp. 1363-1376. Choo, C. W., Detlor, B., and Turnbull, D. (1998). “A Behavioral Model of Information Seeking on the Web: Preliminary Results of a Study of How Managers and IT Specialists Use the Web”. In Proceedings of ASIS Annual Meeting, Pittsburgh, PA., Oct 24-25, pp. 290-302. Dong J., Shirley M. and Waldo P. (2001). “A user input and analysis tool for information architecture”. In Proceedings of CHI2001, pp. 23-24. Fagin, R., Kumar, R., McCurley, K., Novak, J., Sivakumar, D., Tomlin, J. and Williamson, D. (2003). “Searching the Corporate Web”. In Proceedings of WWW2003, Budapest, Hungary, pp. 366-375. Göker, A. and He, D. (2000). “Analysing Web Search Logs to Determine Session Boundaries for User-Oriented Learning”. In Proceedings of Adaptive Hypermedia and Adaptive Web-based Systems, Trento, Italy, pp. 319-322. Hawking, D., Bailey, P. and Craswell, N. (2000). “An intranet reality check for TREC ad hoc”. Technical report: CSIRO Mathematical and Information Sciences. Hölscher, C. (1998). “How Internet experts search for information on the Web”. In Proceedings of WebNet ’98, Orlando, FL. Jansen, B. (2006). “Search log analysis: What is it; what’s been done; how to do it”. Library and Information Science Research, forthcoming. Jansen B. and Pooch U. (2004). “Assisting the searcher: utilizing software agents for Web search systems”. Internet Research: Electronic Networking Applications and Policy, 14 (1), 19-33.
17
Stenmark, D. : What are you searching for?
Jansen, B. and Spink, A. (2003). “An Analysis of Web Documents Retrieved and Viewed”. In Proceedings of ICIC’03, Las Vegas, NE, pp. 65-69. Jansen, B. and Spink, A. (2005). ”An analysis of Web searching by Eruopean AlltheWeb.com users”. Information Processing and Management, 41, pp. 361-381. Jansen, B. and Spink, A. (2006). “How are we searching the World Wide Web? A comparison of nine search engine transaction logs”. Information Processing and Management, 42, pp. 248263. Jansen, B., Spink, A., and Saracevic, T. (2000). “Real life, Real users, and Real needs: A study and analysis of user queries on the web”. Information Processing and management. 36, pp. 207-227. Silverstein, C., Henzinger, M., Marais, H. and Moricz, M. (1999). “Analysis of a very large web search engine query log”. ACM SIGIR Forum, 33(1), pp. 6-12. Sinha, R. and Boutelle, J. (2004). “Rapid Information Architecture Prototyping”. In Proceedings of DIS 2004, Cambridge, MA., 1-4 August, pp. 349-352. Spink, A. and Jansen, B. (2004). Web Search: Public searching of the web. Kluwer Academic Publisher. Spink, A., Ozmutlu, S., Ozmutlu, H. and Jansen, B. (2002). “U.S. versus European Web Searching Trends”. ACM SIGIR Forum, 36(2), pp. 32-38. Spink, A., Wolfram, D., Jansen, B. and Saracevic, T. (2001). “Searching the web: The public and their queries”. Journal of the American Society for Information Science and Technology, 52(3), pp. 226-234. Stenmark, D. (2005). “One week with a corporate search engine: A time-based analysis of intranet information seeking”. In Proceedings of AMCIS, Omaha, NE, 11-14 August. Zipf, G. K. (1932). Selected studies of the principle of relative frequencies in language. AddisonWesley.
18