A review of the development and application of the Web impact factor ... 2002) because they have advanced search ... advanced search facility but it does not.
Introduction
A review of the development and application of the Web impact factor Xuemei Li
The author Xuemei Li works in the School of Computing and Information Technology at the University of Wolverhampton, UK. Keywords Internet, Worldwide Web, Journals, Bibliographic systems Abstract Since 1996, hyperlinks have been studied extensively by applying existing bibliometric methods. The Web impact factor (WIF), for example, is the online counterpart of the journal impact factor. This paper reviews how this linkbased metric has been developed, enhanced and applied. Not only has the metric itself undergone improvement but also the relevant data collection techniques have been enhanced. WIFs have also been validated by significant correlations with traditional research measures. Bibliometric techniques have been further applied to the Web and patterns that might have otherwise been ignored have been found from hyperlinks. This paper concludes with some suggestions for future research. Electronic access The Emerald Research Register for this journal is available at http://www.emeraldinsight.com/researchregister The current issue and full text archive of this journal is available at http://www.emeraldinsight.com/1468-4527.htm
Online Information Review Volume 27 · Number 6 · 2003 · pp. 407-417 q Emerald Group Publishing Limited · ISSN 1468-4527 DOI 10.1108/14684520310510046
Although the Web is unorganised and anarchic in nature because anyone can publish any content and create hyperlinks pointing elsewhere on the Web, it has become an invaluable information resource. Suspecting that hyperlinks can reveal information about Web sites, information scientists have studied hyperlinks since 1996. The rationale behind this is an analogy between hyperlinks and citations (Almind and Ingwersen, 1997; Rousseau, 1997; Davenport and Cronin, 2000). Citation analysis has proved to be fruitful, partly due to the work of the Institute for Scientific Information (ISI), which has indexed 8,500 of the most prestigious, high impact research journals in the world (ISI, 2003). Journal impact factors (JIFs) are calculated by dividing the number of citations made in a time period to articles in a journal that have been published in another period by the number of articles published in the second time period. Citations can be used to evaluate the scholarly output of researchers, departments, whole universities or even whole nations, with caution (Moed, 2002). Collaboration patterns in science have been examined using citation information supplied by the ISI (White and Griffith, 1982; Small, 1999; Cawkell, 2000; Gla¨nzel, 2001; Gla¨nzel and Schubert, 2001). The question is: what kind of information can hyperlinks illustrate? The content of the Web is not of the same quality as the database maintained by the ISI. Although there are more and more online journals, any sort of content can appear on the Web. In addition to peer reviewed papers, lecture notes, preprints of draft papers, the author’s CV, hobbies, family information and religious content can appear on personal Web pages. To measure a researcher’s scholarly output by the number of links their Web pages receive is therefore ridiculous. However, the number of hyperlinks can record informal communication between scholars. Unlike telephone or oral conversations which usually leave no record, hyperlinks can stay longer (though they can disappear as well over time) to give an accessible source of evidence. Refereed article received 5 July 2003 Accepted for publication 12 August 2003 The author wishes to thank Dr Mike Thelwall and Dr David Wilkinson for their help in writing up this paper. Thanks also to the referees for their helpful comments.
407
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
This paper reviews recent progress that information scientists have made towards understanding this new information source. Metrics have been developed to measure Web interlinking. Techniques have been invented to clean the link data. Motivations for hyperlink creation have been studied to validate interpretations. We conclude with an indication of possible future research directions in this area.
another time period while the WIF is a snapshot of the Web at a certain time. Compared with the content of a journal paper, the content of a Web page lacks peer review and thus lacks quality control. The WIF is therefore not a direct translation of the JIF. Early WIF calculations were found to be a crude instrument for Webometric studies (Smith, 1999a; Thelwall, 2000a; Thomas and Willet, 2000; Bjo¨rneborn and Ingwersen, 2001). First, the majority of internal inlinks may serve navigation purposes rather than endorsing the content of target pages. The bigger the Web site the larger the number of internal inlinks will tend to be. Links from outside represent more effort to point to target pages, and thus contain more valuable information. However, it is not always easy to separate internal from external links. For example, the School of Computing and Information Technology (scit.wlv.ac.uk) is a subsite of the University of Wolverhampton (wlv.ac.uk). Should the links from the parent site be regarded as internal links or external links? As the links are still within the same university Web site, they should be regarded as internal links. Second, the search engines that were used to collect the link data had inherent deficiencies. The results from AltaVista were especially unstable before it was re-launched in October 1999 (Rousseau, 1999; Bjo¨rneborn and Ingwersen, 2001). Logically identical Boolean queries gave different results, so information scientists had to design methods to downplay this effect (Ingwersen, 1998; Smith, 1999a). Third, the WIF denominators are the number of Web pages within the Web site studied. This includes another source of uncertainty as there is no convention for Web page output format. One document can be displayed in one Web page or separated into several screen-sized Web pages. For example, suppose that one online document attracts 100 inlinks from outside. If the document is represented by one huge Web page, the WIF for this document is 100 while if the document is represented by 100 smaller Web pages, the WIF for this document will be 1. This shows how much results can be affected by the way in which documents are presented on the Web.
The origins of the Web impact factor The Web impact factor (WIF) was developed by Ingwersen (1998) to measure the impact of a Web area by the number of links it receives. Rodrı´guez Gairı´n (1997) introduced the same concept the previous year, but the paper was published in Spanish and was not as influential as Ingwersen’s. The WIF was based on an analogy between hyperlinks and citations and was the adaptation of the JIF for the Web. Generally speaking, the WIF is the number of links to a site divided by the number of Web pages inside the site in question. Ingwersen (1998) defined three types of WIF: internal, external and overall. For the internal WIF of a Web site, the numerator is the number of inlinks counted from within the site; for the external WIF, the numerator is the number of inlinks counted from outside the site; for the overall WIF, the numerator is the number of inlinks from both within and outside the site. The denominators all remain the same, the number of Web pages within the Web site in question. In Ingwersen’s study, AltaVista was used to count the number of links and Web pages that were necessary for the calculation of WIFs. In fact the number of links returned by AltaVista is actually the number of Web pages that contain at least one link to the Web site in question. Egghe (2000) pointed out that the hyperlinks can be bi-directional (Web pages can link to each other regardless of their publication date) while citations are unidirectional. Normally only previously published papers can be cited by later published ones, not vice versa. However, it is also possible for authors to cite each other’s paper at the same time due to “invisible colleges” (Egghe, 2000). The time periods for the WIF and JIF are also different. The JIF measures citations made in journals published during one time period to articles published in
408
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
Techniques for data collection In order to quantitatively investigate the Web raw links data must be first collected. For Webometric research, both search engines and an academic Web crawler have been used. They both have advantages and disadvantages. Sometimes it is more appropriate to use a search engine, at other times it is more suitable to use a personal Web crawler. Search engines AltaVista (www.altavista.com/) and AllTheWeb (www.alltheWeb.com/) have been used to count the number of links or Web pages for Webometrics research (Ingwersen, 1998; Smith, 1999a; Smith and Thelwall, 2002) because they have advanced search facilities (Sullivan, 2001a). Smith and Thelwall (2002) used both AltaVista and AllTheWeb in their study. Below is the syntax used to count the number of links from the UK academic domain to the Australian academic domain: . AltaVista: host:.ac.uk AND link:xxx.edu.au . AllTheWeb: URL.host:ac.uk+link.all:xxx.edu.au xxx stands for the third level domain name of an Australian university. Google (www.google.com) also has an advanced search facility but it does not support the same level of Boolean querying as AltaVista or AllTheWeb. Google can only count all Web pages linking to a given Web page and not all links to a given site. Its advanced search can limit the source to a given domain but it cannot explicitly exclude all links from within the site itself (it can not deduct the internal inlinks), a second critical gap in its functionality. Although Google is the most used search engine at the moment (Sullivan, 2001b; Sullivan, 2002), it is not recommended for collecting link data for link analysis purposes. It is free and convenient to use search engines to collect link data. For a large Web area, search engines can be the only choice to collect link data, because it is not pragmatic for a self designed crawler to crawl the whole Web or even the Web for a single nation. Nevertheless, search engines have inherent problems, for example, partial coverage of the Web and opaqueness of the search algorithm (Bar-Ilan, 1999; Lawrence and Giles, 1999;
Thelwall, 2000b; Bar-Ilan, 2001; Bjo¨rneborn and Ingwersen, 2001). Thus, the results of analyses of the Web by search engines can only be regarded as indications rather than definite conclusions. Because of this, Bar-Ilan (2001) has urged information scientists to create their own crawler in order to get accurate results. The academic Web crawler Thelwall (2001a, b) has designed an academic crawler to overcome the problems of search engines. Essentially, the crawler starts from the home page of a university Web site, extracts all of its links and then downloads all of the pages found that are on the same site. This process repeats until all links have been followed. It more rigorously identifies and eliminates duplicate pages within a Web site, as well as mirror sites that are not created by the staff or students within that university. Through the use of the data collected in this way, the researcher has more control over the extent of coverage of sites, and can be in control of the algorithm used to count links from the database. The academic crawler can only crawl the publicly indexable Web pages that can be accessed by following links (Lawrence and Giles, 1999). Web pages which are not linked (directly or indirectly) by the home page of a university will not be covered, even if they are linked to by the Web pages outside the university. If the priority of a study is to achieve accurate link counts, the special academic crawler is a better choice than a search engine. In order to get rid of link anomalies, sometimes we have to use alternative document models (ADMs) (discussed in more detail later) to count the number of links at different levels, for example, directory, domain or whole university site. As counting links between Web pages is how search engines work, the only choice for the ADMs is to use the academic crawler. The advantage of the academic Web crawler is that it is a more scientific approach. It is possible to cover individual Web sites comprehensively within specified parameters (Thelwall, 2002a). Search engines have to cover a significant proportion of the Web, and the priority of a search engine is to respond to a user’s query efficiently rather than accurately. A user cannot be expected to be patient enough to wait for search results nor to check hundreds and thousands of result pages to find the most useful ones. A search engine
409
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
that is slow or does not tend to provide relevant results in the first page is likely to lose customers. However, the drawback of the academic crawler is that it is not suitable for large Web area studies. If a large Web area is the object of a study, only search engines can be used to supply the data. If the object of a study is to focus on a small Web area and accuracy of the data is of high priority, or if ADMs have to be used to count the number of links, then the academic Web crawler is the only choice for data collection.
general AltaVista WIF. They were all found to correlate significantly with average research assessment exercise (RAE) (Higher Education Funding Council for England (HEFCE), 1998) ratings. The Spearman rather than Pearson correlation coefficient test was used to calculate the significance of the relationship between the links information and the RAE ratings. This is because the frequency distributions of inlink counts are highly skewed: . The general WIF is the number of external inlinks to a university (derived by the academic Web crawler) divided by the number of academic staff members in the university. . With the research WIF, the numerator is the number of external inlinks to research related targets. The denominator is the same as the general WIF. . With the AltaVista general WIF, the numerator is the external inlinks to a university, as counted by AltaVista. The denominator is the same as for the General WIF. . With the original general AltaVista WIF, the numerator is the same as the General AltaVista WIF while the denominator is the number of Web pages within the university in question, as counted by AltaVista.
Enhancements to the WIF Early developments Although early WIFs calculations did not give significant results, they served to start the era of Webometric studies. The WIF has been enhanced since it was created. Because of the internal inlinks problem mentioned above, external inlinks are now almost always used as the WIF numerator. Smith (1999b), for example, studied Australasian universities and online journals, using external inlinks as the WIF numerator. Thelwall (2001c) used the academic Web crawler discussed above to study links to six UK universities. The denominators were still the number of Web pages. Smith and Thelwall (2001) studied the links between UK, Australian and New Zealand universities. In this study both AltaVista and the academic Web crawler were used to collect the link data. The number of academic staff members was introduced in this study to represent the size of universities, replacing Web pages as the WIF denominator. The WIF results were not validated by comparison with traditional research ratings. However, both the enhancement of the WIFs and the creation of the academic Web crawler were useful steps towards deriving significant results. WIF and traditional research measures Thelwall (2001d) studied external WIFs for 25 UK universities. Both AltaVista and the academic Web crawler were used to collect the link data, and both the number of staff members and the number of Web pages were used as WIF denominators. The target pages were classified as research-related or not. Four different WIF versions were calculated. These were the general WIF, the research WIF, the AltaVista general WIF and the original
This was the first time that WIFs were found to correlate significantly with an external research measure. Not surprisingly, the correlation coefficient of the research WIF was found to be the best while the original general AltaVista WIF was found to be the worst. However, to classify target pages is very time consuming. In order to get rid of the manual classification of the target pages and in the assumption that international links to a Web page in a university more probably point to research related content, Thelwall (2002b) studied the links from 10 different domains and external to 96 UK universities. The results showed that edu, ac.uk, uk, org and external WIFs all correlated similarly with the average RAE ratings. Although more suitable domains to count links from were not found, the results suggested that WIFs could be extended to different countries. Smith and Thelwall (2002) used AltaVista, AllTheWeb and the academic crawler to collect link data and used the number of academic staff as denominators and found that uk-au, nz-au and au-au WIFs all
410
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
correlated significantly with the research quantum (RQ). The RQ is the official Australian government measurement for research infrastructure grants which is calculated by a simple publicly available formula (DETYA, 2001; Smith and Thelwall, 2002). Tang and Thelwall (2002) also found significant correlations between absolute external WIFs and a public university ranking system in mainland China.
might create thousands of related links, for example, between static HTML databases. All these can be anomalies in studies of link impact and might render link counts meaningless. ADMs are a good method for dealing with repeated inlinks from outside (Thelwall, 2002d). The intention of the ADMs is to aggregate and remove repeated links at a higher document level. Here are descriptions of the ADMs, where A is a source university and B is a target university (see also Figure 1 and Table I): . At the page level, the original link data is transformed into page link data by truncating the URLs of links from A to B before the first slash to avoid repeated links to different parts of the same Web page. Duplicate links from the same page are then removed. The link count from A to B is the number of links in A that target B. . At the directory model level, the original link data is transformed into directory link data by truncating before the last slash of the URLs of all pages and links in A and then merging directories which are
Alternative document models In order to get useful information from the anarchic Web, it is necessary to remove noise (unwanted variations) as much as possible. For example, Thelwall (2002c) found that the minimum number of links between pairs of UK universities is a more reliable data source than the actual one-way link counts. ADMs, which were also introduced by Thelwall (2002d), are a more significant step towards link count data cleansing. Later studies exploited this technique to identify new link patterns. We review this technique in detail in this section. Citations, the counterpart of hyperlinks in bibliometrics, are counted between papers in journals. The basic unit for counting citations is a scholarly paper. For example, if one journal paper is cited by two other journal papers, the citation count of that paper is two, even if the same part of a paper is cited by two different papers in one journal; but if one journal paper is cited many times in different parts by another paper then the citation count is only one. For link counting on the Web, a Web page is the default unit. The number of links to a Web page is assessed to be the number of Web pages that contain at least one link to that Web page, if a search engine is used (e.g. AltaVista, AllTheWeb). This is different from citation counting where different pages within one paper are regarded as one unit. Web page is not necessarily the best choice of unit to count links on the Web as one Web document sometimes can be broken down into several small Web pages and links pointing to these pages might arise from a single motivation. We know that links within a Web site can be created for navigation rather than for endorsing the content of a Web page (Smith, 1999a; Thelwall, 2000a). Links from external joint project Web sites to all host universities, however, can also be for navigational purposes. Occasionally an individual faculty
Figure 1 A simple example shows Web pages in university A linking to Web pages in university B
Table I The numbers of links at different ADMs levels from university A to university B Model name Page Directory Domain University Page range Directory range Domain range
411
Source(unit)
Target(unit)
Links from A to B
Web page Directory Domain Whole university Whole university Whole university Whole university
Web page Directory Domain Whole university Web page Directory Domain
12 6 4 1 4 3 2
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
.
.
duplicated into one directory and removing duplicate links within each directory. Links are then counted as above. At the domain model level, the link data is transformed into domain link data by truncating the page and link URLs after the first slash following the domain name, then merging the same domains and removing duplicate links within each domain. Links are then counted as above. For the university model, the whole university will be regarded as the unit to count links, one university can have one link to the other, if any page in A targets any page in B, otherwise none.
The above mentioned four ADMs were created by Thelwall (2002d). “The impact of both widely linked to individual resourcerelated pages and widely linked to entire sites” (Thelwall, 2002d) was reduced by the domain model. The university model might be too greatly aggregated to give any useful information while the page model involved too many repeated links. Three further ADMs have been used: (1) With the page-range model, the whole source university A is regarded as one unit and duplicate links to the same target page are eliminated. The count of links is the same as above. (2) With the directory-range model, the whole source university A is regarded as one unit and duplicate links to the same target directory are eliminated. The count of links is the same as above. (3) With the domain-range model, the whole source university A is regarded as one unit and duplicate links to the same target domain are eliminated. The count of links is the same as above. The three range models mentioned above were introduced in Thelwall and Wilkinson (2003a). The rationale behind these range models is that even counting links from the source site at the domain level may still produce anomalies. For example, the same person might be authorised to use multiple domains within a university and create repeated links to a target page, or the information from the target site might be shared by people in different domains within the source site. The directory range was found to be the best link counting method for UK universities (Thelwall and Wilkinson, 2003a).
Alternative approaches WIFs have so far been shown to be able to measure some aspects of the Web over a large area. However, the focus of research in Webometrics is changing from measuring the Web to illustrating behaviours of Web creators. In this section we will review patterns that have been found through simple link counts and another metric, link propensity. Simple link count studies Although the Web is unorganised in nature, as a whole it does display regularities. Power laws were found both for Web pages and hyperlinks (Rousseau, 1997; Albert et al., 1999; Pennock et al., 2002; Thelwall and Wilkinson, 2003b). Thelwall (2002c) used the minimum number of links between pairs of universities and factored out the impact of size and research from both source and target sites. A geographical pattern was found for UK universities’ interlinkings. Universities nearer to each other tend to link more than those further apart. However, no geographic pattern was found between departments in the US (Tang and Thelwall, 2003). Thelwall (2002e) used links to and from UK universities and multivariate statistical techniques identified several geographic clusters on the Web. Disciplinary variations have also been found on the Web. Tang and Thelwall (2003) found that in the US hard science interlinked more than social science. However, Thelwall et al. (2003a) found that hard science was less dominant over social science in Taiwan and Australia. Subjects have different online impacts in Taiwan and Australia. Vaughan and Thelwall (2003) found that library and information science journal Web sites attract more links than those of law. Link propensity The link propensity metric was introduced by Smith and Thelwall (2002). It is the number of external inlinks divided by the product of the total number of staff of the source university and those of the target university (sometimes the number of Web pages are used instead if staff numbers are not available). The link propensity effectively factors out size from both source and target sites. It is suitable for illustrating the intention of two sites to link to one another. The link propensities from the
412
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
UK and Australia to New Zealand were lowest of the three countries, so New Zealand was found to be relatively isolated on the Web of the three countries. Thelwall (2001e) used four different linkcount-based weightings to study the link patterns among general top level domains (gTLDs) (Groups, 2001) and among some computer companies and universities: (1) absolute number of external links; (2) external links divided by the number of Web pages in target site; (3) external links divided by the number of Web pages in source site; and (4) external links divided by the product of number of Web pages in both source and target sites.
denominator removed the uncertainty about the number of Web pages. The WIF became “a hybrid calculation combining Web information with another source” (Thelwall, 2001d). The designing of the academic Web crawler made it possible to crawl a subset of the Web extensively and accurately. The unreliability of search engine results has also been alleviated. The ADMs effectively removed repeated links at different document levels. The significant correlation coefficients found between WIFs and research ratings support their use. The UK is a good choice to conduct this sort of study because of the number and range of types of universities it has and also the existence of an authoritative research assessment scheme in this country. Nevertheless, significant correlation coefficients were also found in Australia (Smith and Thelwall, 2002), Taiwan (Thelwall and Tang, 2003) and mainland China (Tang and Thelwall, 2002). Significant correlation coefficients were found mainly at the university level but also at the departmental level (Li et al., 2003; Tang and Thelwall, 2003). WIF studies have not been applied in European countries other than the UK as they lack authoritative research measures (Thelwall et al., 2002). However, the linguistic patterns of Web use in Western European countries have been identified (Thelwall et al., 2003b). All of the above studies showed that WIFs may measure an aspect of online scholarly communication for target Web sites. Can we simply conclude that the WIFs measure the research profile of a university? This is certainly not true. Web pages do not contain enough research content. WIFs may measure the reputation of a university rather than its research output. Web pages created by famous universities tend to attract more links regardless of the language used on the page (Smith, 1999c; Thelwall et al., 2003b). Even citations can only measure the impact of research rather than its quality (Moed, 2002). We must be cautious about the conclusions we make because hyperlinks can be created for many reasons. Citations can also be created for various reasons (Garfield, 1979; Brooks, 1986; Case and Higgins, 2000), and so caution also must be exercised when drawing conclusions from citation measures (Moed, 2002). Kim (2000) found out that motivations for hyperlink creation between online journal papers can be
The first is the raw link count. The second removes the impact of the target site while the third removes the impact of the source site. The fourth is the link propensity. The.com domain was found to be a major source of external links among gTLDs. Business relationships were also identified among universities and companies. Note that the first weighting is the same as the absolute WIF while the second is the same as the relative WIF. However, the intention here is not to measure the Web but to illustrate link patterns. Although link propensity is the most balanced weighting among the four, it alone cannot illustrate a full picture of linking patterns. Thelwall and Smith (2002) used the four weightings to study the link patterns among universities in the Asia-Pacific area. Australia and Japan were found to be the centre of Web use in that area. This resembled the results from Gla¨nzel and Schubert (2001) and Gla¨nzel (2001) using citation analysis. Thelwall et al. (2003b) used the link propensity technique to study the linguistic Web link patterns for Western European universities. English was found to be the dominant language on the Web in Western Europe. Countries sharing the same language tend to link more than those that do not.
Critical analysis The WIF metric has been modified extensively since it was first devised. Counting the number of external inlinks rather than internal inlinks removed the problem associated with internal inlinks. Using the number of academic staff members instead of the number of Web pages for the WIF
413
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
different from those of traditional citations because of their use to easily access multimedia resources. More efforts have been devoted to find out motivations for link creation on the Web. Thelwall (2001c) regarded the cause of no significant results in his early study between WIFs and RAE ratings to be that most of the target pages were not research related. Thelwall (2001d) classified the target pages and found that the Research WIFs correlated better with RAE ratings than General WIFs. Thelwall (2002f) studied the UK top 100 linked to pages and found that the most linked to pages were university homepages which can facilitate access to a large range of information rather than supplying specific content. He concluded that “simple link counts are highly unreliable indicators of the average behaviour of scholars” (Thelwall, 2002f). Wilkinson et al. (2003) studied 414 randomly collected links from the ac.uk domain, and 90 percent of the links were academic related. However, less than 1 percent of the targets were copies of refereed publications. Thelwall (2003) studied 100 randomly collected links to UK universities” homepages. By investigating the source pages, four types of link creation motivations were found. He concluded that “very few hyperlinks between academic sites are created as a result of a necessity on a par with that for citations” (Thelwall, 2003). The significant correlation between WIFs and research ratings showed that “it seems very likely that Web activities unrelated to research are influenced by it directly or indirectly, for example through the availability of computing resources or the development of the technological know-how” (Thelwall, 2002d). We can see that the efforts to recognise link creation motivations have so far been focused either on the content of target pages or that of both source and target pages. Focusing on the content of the target pages was based on the assumption that all links pointing to the same Web page arise from the same motivation. However, this is not always the case. A link to a page might be for a negative purpose, for example to show bad Web design. The study of both source and target content may put the link creation motivation more in context. In addition to the problems involved with link motivations, the other fundamental problem is that currently all links have been regarded as the same in Webometric studies. A link made by a first year student may not be
as important as one created by a professor. The idea to give links different weights is not new, in bibliometrics it is more sensible to give a higher weight to a citation from a prestigious journal than from a trivial one (Pinski and Narin, 1976). In computer science, Google’s PageRank gives each Web page weight to pass on to the other pages through links (Brin and Page, 1998). Kleinberg (1999) used the weight of inlinks to determine a Web page’s authority score and the weight of outlinks to determine a Web page’s hub score. The exploitation of link structure by computer scientists for search engines is for finding the most relevant result Web pages for user searches. However, the aim of Webometrics is to identify patterns of interlinkings among Web creators, which is very similar to bibliometrics, in which information scientists try to identify collaboration patterns among scientists, departments, universities or even countries. Although PageRank is another natural way to rate pages, it includes links within a site and thus it is not a good technique for Webometric purposes. Meaningful results have already been found from macroscopic Web areas in Webometrics. In order to get meaningful results for individual Web sites, allocating different weights to links might be a sensible future direction for research.
Conclusions Despite the unplanned nature of the Web, hyperlinks do reveal significant trends over large areas of the Web. After enhancements, WIFs have been found to correlate significantly with phenomena external to the Web. There is therefore some promise in continuing to apply other bibliometric techniques on the Web to further mine meaningful information from it. ADMs have been found a useful data cleansing method on the Web. However, directory, domain and whole university are not necessarily the best document levels. It might be more sensible to identify Web pages created by an author, a group or a department in order to count links in a more appropriate document model. Hopefully individual authors, research groups or departments can be identified and measured on the Web perhaps in a similar way to bibliometrics.
414
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
Google’s successful use of link structures in its results ranking algorithm suggests that it is necessary to further refine the Web metrics to illustrate useful information not only for large Web areas but also for individual Web sites. Methods must therefore be developed to give individual links different weights. Since its inception in 1996, Webometrics has become a promising and exciting new area. However, despite advances in techniques for the collection, processing and interpreting of link data, there are still problems with data reliability and the interpretation of results that provide a significant challenge for future researchers.
Davenport, E. and Cronin, B. (2000), “The citation network as a prototype for representing trust in virtual environments”, in Cronin, B. (Ed.), The Web of Knowledge: a Festschrift in Honour of Eugene Garfield, Information Today, Medford, NJ, pp. 517-34. DETYA (2001), Higher Education Report for the 2000 to 2002 Triennium, available at: www.detya.gov.au/ highered/he_report/2000_2002/html/3_8.htm Egghe, L. (2000), “New informetric aspects of the Internet: some reflections – many problems”, Journal of Information Science, Vol. 26 No. 5, pp. 329-35. Garfield, E. (1979), Citation Indexing: its Theory and Applications in Science, Technology and the Humanities, Wiley Interscience, New York, NY. Gla¨nzel, W. (2001), “National characteristics in international scientific co-authorship relations”, Scientometrics, Vol. 51 No. 1, pp. 69-115. Gla¨nzel, W. and Schubert, A. (2001), “Double effort ¼ double impact? A critical view at international co-authorship in chemistry”, Scientometrics, Vol. 50 No. 2, pp. 199-214. Groups, I.G.C. (2001), “gTLD registries”, available at: www.dnso.org/constituency/gtld/gtld.html (accessed 13 February 2002) Higher Education Funding Council for England (HEFCE) (1998), An Introduction to the Work of the Higher Education Funding Council for England, available at: www.hefce.ac.uk/Pubs/HEFCE/1998/98_16.htm Ingwersen, P. (1998), “The calculation of Web Impact Factors”, Journal of Documentation, Vol. 54 No. 2, pp. 236-43. ISI (2003), ISI Web of Science, available at: www.isinet.com/isi/products/citation/wos/ Kim, H.J. (2000), “Motivations for hyperlinking in scholarly electronic articles: a qualitative study”, Journal of the American Society for Information Science, Vol. 51 No. 10, pp. 887-99. Kleinberg, J.M. (1999), “Authoritative sources in a hyperlinked environment”, Journal of the ACM, Vol. 46 No. 5, pp. 604-32. Lawrence, S. and Giles, C.L. (1999), “Accessibility of information on the Web”, Nature, No. 400, pp. 107-9, available at: http://wwwmetrics.com/ Li, X., Thelwall, M., Musgrove, P. and Wilkinson, D. (2003), “The relationship between the WIFs or inlinks of computer science departments in the UK and their RAE ratings or research productivities in 2001”, Scientometrics, Vol. 57 No. 2, pp. 239-55. Moed, H.F. (2002), “The impact-factors debate: the ISI’s uses and limits”, Nature, No. 415, pp. 731-2. Pennock, D., Flake, G.W., Lawrence, S., Glover, E.J. and Giles, C.L. (2002), “Winners don’t take all: characterizing the competition for links on the Web”, Proceedings of the National Academy of Sciences, Vol. 99 No. 8, pp. 5207-11. Pinski, G. and Narin, F. (1976), “Citation influence for journal aggregates of scientific publications: theory with application to the literature of physics”,
References Albert, R., Jeong, H. and Barabasi, A.L. (1999), “Diameter of the World Wide Web”, Nature, No. 401, pp. 130-1. Almind, T.C. and Ingwersen, P. (1997), “Informetric analyses on the World Wide Web: methodological approaches to ‘Webometrics’“, Journal of Documentation, Vol. 53 No. 4, pp. 404-26. Bar-Ilan, J. (1999), “Search engine results over time: a case study on search engine stability”, Cybermetrics, Vol. 2/3 No. 1, available at: www.cindoc.csic.es/ cybermetrics/articles/v2i1p1.html Bar-Ilan, J. (2001), “Data collection methods on the Web for informetric purposes: a review and analysis”, Scientometrics, Vol. 50 No. 1, pp. 7-32. Bjo¨rneborn, L. (2001), “Shared outlinks in small-world co-linkage analysis: a Webometric pilot study of bibliographic couplings on researchers’ bookmark lists on the Web”, Royal School of Library and Information Science, Copenhagen. Bjo¨rneborn, L. and Ingwersen, P. (2001), “Perspectives of Webometrics”, Scientometrics, Vol. 50 No. 1, pp. 65-82. Brin, S. and Page, L. (1998), “The anatomy of a large scale hypertextual Web search engine”, Computer Networks and ISDN Systems, Vol. 30 No. 1-7, pp. 107-17. Brooks, T.A. (1986), “Evidence of complex citer motivations”, Journal of the American Society for Information Science, Vol. 37 No. 1, pp. 34-6. Case, D.O. and Higgins, G.M. (2000), “How can we investigate citation behaviour? A study of reasons for citing literature in communication”, Journal of the American Society for Information Science, Vol. 51 No. 7, pp. 635-45. Cawkell, T. (2000), “Visualising citation connections”, in Cronin, B. (Ed.), The Web of Knowledge: A Festschrift in Honour of Eugene Garfield, Information Today, Medford, NJ, pp. 177-94.
415
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
Information Processing and Management, Vol. 12 No. 5, pp. 297-312. Rodrı´guez Gairı´n, J.M. (1997), “Valorando el impacto de la informacio´n en Internet: Altavista, el ‘Citation Index’ de la Red” (“Impact assessment of information on the Internet: AltaVista, the citation index of the Web”), Revista Espanola de Documentacion Scientifica, Vol. 20 No. 2, pp. 175-81, available at: www.kronosdoc.com/publicacions/altavis.htm Rousseau, R. (1997), “Sitations: an exploratory study”, Cybermetrics, Vol. 1 No. 1, available at: www.cindoc.csic.es/cybermetrics/articles/ v1i1p1.html Rousseau, R. (1999), “Daily time series of common single word searches in AltaVista and NorthernLight”, Cybermetrics, Vol. 2/3 No. 1, available at: www.cindoc.csic.es/cybermetrics/articles/ v2i1p2.html Small, H. (1999), “Visualising science through citation mapping”, Journal of the American Society for Information Science, Vol. 50 No. 9, pp. 799-812. Smith, A.G. (1999a), “A tale of two Web spaces: comparing sites using Web impact factors”, Journal of Documentation, Vol. 55 No. 5, pp. 577-92. Smith, A.G. (1999b), “ANZAC Webometrics: exploring Australasian Web structures”, Proceedings of Information Online and On Disc 99, Sydney, Australia, 19-21 January 1999, pp. 159-81, available at: www.csu.edu.au/special/online99/ proceedings99/203b.htm Smith, A.G. (1999c), “The impact of Web sites: a comparison between Australasia and Latin America”, Proceedings of INFO”99, Congreso Internacional de Informacion, Havana, 4-8 October 1999, available at: www.vuw.ac.nz/,agsmith/ publns/austlat/ Smith, A.G. and Thelwall, M. (2001), “Web impact factors and university research links”, Proceedings of the 8th International Conference on Scientometrics and Informetrics, Sydney Australia, 16-20 July 2001, Vol. 2, pp. 657-64. Smith, A.G. and Thelwall, M. (2002), “Web Impact Factors for Australasian universities”, Scientometrics, Vol. 54 No. 1/2, pp. 363-80. Sullivan, D. (2001a), “Search engine features”, SearchEngineWatch, available at: http:// searchenginewatch.com/facts/assistance.html Sullivan, D. (2001b), “Search engine sizes”, SearchEngineWatch, available at: http:// searchenginewatch.com/reports/sizes.html Sullivan, D. (2002), “Google tops in “Search Hours” ratings”, SearchEngineWatch, available at: http:// searchenginewatch.com/sereport/02/05-ratings.html Tang, R. and Thelwall, M. (2002), “Exploring the pattern of links between Chinese university Web sites”, Proceedings of the 65th Annual Meeting of the American Society for Information Science and Technology, Vol. 39, pp. 417-24.
Tang, R. and Thelwall, M. (2003), “Disciplinary differences in US academic departmental Web site interlinking”, Library and Information Science Research, forthcoming. Thelwall, M. (2000a), “rdquo;Web impact factors and search engine coverage”, Journal of Documentation, Vol. 56 No. 2, pp. 185-9. Thelwall, M. (2000b), “Implications of search engine coverage on the viability of commercial Websites”, Proceedings of ICEIS 2000, poster session. Thelwall, M. (2001a), “A Web crawler design for data mining”, Journal of Information Science, Vol. 27 No. 5, pp. 319-25. Thelwall, M. (2001b), “A publicly accessible database of UK university Website links and a discussion of the need for human intervention in Web crawling”, University of Wolverhampton, Wolverhampton. Thelwall, M. (2001c), “Results from a Web Impact Factor crawler”, Journal of Documentation, Vol. 57 No. 2, pp. 177-91. Thelwall, M. (2001d), “Extracting macroscopic information from Web links”, Journal of American Society for Information Science and Technology, Vol. 52 No. 13, pp. 1157-68. Thelwall, M. (2001e), “Exploring the link structure of the Web with network diagrams”, Journal of Information Science, Vol. 27 No. 6, pp. 393-402. Thelwall, M. (2002a), “Methodologies for crawler-based Web surveys”, Internet Research: Electronic Networking and Applications, Vol. 12 No. 2, pp. 124-38. Thelwall, M. (2002b), “A comparison of sources of links for academic Web Impact Factor calculations”, Journal of Documentation, Vol. 58 No. 1, pp. 60-72. Thelwall, M. (2002c), “Evidence for the existence of geographic trends in university Web site interlinking”, Journal of Documentation, Vol. 58 No. 5, pp. 563-74. Thelwall, M. (2002d), “Conceptualising documentation on the Web: an evaluation of different heuristic-based models for counting links between university Web sites”, Journal of the American Society for Information Science and Technology, Vol. 53 No. 12, pp. 995-1005. Thelwall, M. (2002e), “An initial exploration of the link relationship between UK university Web sites”, ASLIB Proceedings, Vol. 54 No. 2, pp. 118-26. Thelwall, M. (2002f), “The top 100 linked pages on UK university Web sites: high inlink counts are not usually directly associated with quality scholarly content”, Journal of Information Science, Vol. 28 No. 6, pp. 485-93. Thelwall, M. (2003), “What is this link doing here? Beginning a fine-grained process of identifying reasons for academic hyperlink creation”, Information Research, Vol. 8 No. 3, available at: http://informationr.net/ir/8-3/paper151.html
416
A review of the development and application of the Web impact factor
Online Information Review
Xuemei Li
Volume 27 · Number 6 · 2003 · 407-417
Thelwall, M. and Smith, A.G. (2002), “A study of interlinking between Asia-Pacific University Web sites”, Scientometrics, Vol. 55 No. 3, pp. 335-48. Thelwall, M. and Tang, R. (2003), “Disciplinary and linguistic considerations for academic Web linking: an exploratory hyperlink mediated study with mainland China and Taiwan”, Scientometrics, Vol. 58 No. 1, pp. 153-79. Thelwall, M. and Wilkinson, D. (2003a), “Three target document range metrics for university Web sites”, Journal of the American Society for Information Science and Technology, Vol. 54 No. 6, pp. 489-96. Thelwall, M. and Wilkinson, D. (2003b), “Graph structure in three national academic Webs: power laws with anomalies”, Journal of the American Society for Information Science and Technology, Vol. 54 No. 8, pp. 706-12. Thelwall, M., Binns, R., Harries, G., Page-Kennedy, T., Price, E. and Wilkinson, D. (2002), “European Union associated university Websites”, Scientometrics, Vol. 53 No. 1, pp. 95-111. Thelwall, M., Vaughan, L., Cothey, V., Li, X. and Smith, A.G. (2003a), “Which academic subjects have most
online impact? A pilot study and a new classification process”, Online Information Review, forthcoming. Thelwall, M., Tang, R. and Price, E. (2003b), “Linguistic patterns of academic Web use in Western Europe”, Scientometrics, Vol. 56 No. 3, pp. 417-32. Thomas, O. and Willet, P. (2000), “Webometric analysis of departments of librarianship and information science”, Journal of Information Science, Vol. 26 No. 6, pp. 421-8. Vaughan, L. and Thelwall, M. (2003), “Scholarly use of the Web: what are the key inducers of links to journal Web sites?”, Journal of the American Society for Information Science and Technology, Vol. 54 No. 1, pp. 29-38. White, H.D. and Griffith, B.C. (1982), “Author co-citation: a literature measure of intellectual structure”, Journal of the American Society for Information Science, Vol. 32 No. 3, pp. 163-72. Wilkinson, D., Harries, G., Thelwall, M. and Price, E. (2003), “Motivations for academic Web site interlinking: evidence for the Web as a novel source of information on informal scholarly communication”, Journal of Information Science, Vol. 29 No. 1, pp. 59-66.
417