'Mini small worlds' of shortest link paths crossing domain ... - CiteSeerX

2 downloads 0 Views 377KB Size Report
'Mini small worlds' of shortest link paths crossing ..... i.hp homepages of academic institutions i.list institutional pages with link lists as main content i.proj.
Jointly published by Akadémiai Kiadó, Budapest and Springer, Dordrecht

Scientometrics, Vol. 68, No. 3 (2006) 395–414

‘Mini small worlds’ of shortest link paths crossing domain boundaries in an academic Web space LENNART BJÖRNEBORN Department of Information Studies, Royal School of Library and Information Science, Copenhagen (Denmark)

Combining webometric and social network analytic approaches, this study developed a methodology to sample and identify Web links, pages, and sites that function as small-world connectors affecting short link distances along link paths between different topical domains in an academic Web space. The data set comprised 7669 subsites harvested from 109 UK universities. A novel corona-shaped Web graph model revealed reachability structures among the investigated subsites. Shortest link path nets functioned as investigable small-world link structures – ‘mini small worlds’ – generated by deliberate juxtaposition of topically dissimilar subsites. Indicative findings suggest that personal Web page authors and computer science subsites may be important small-world connectors across sites and topics in an academic Web space. Such connectors may counteract balkanization of the Web into insularities of disconnected and unreachable subpopulations.

Introduction The Web is the largest complex network for which connectivity data presently is available and the Web has thus become a testing ground and model system for many current research efforts to build models of the growth, connectivity formation and dynamics of complex networks (e.g., ALBERT & BARABÁSI, 2002; NEWMAN, 2003; SCHARNHORST, 2003). Received December 20, 2005 Address for correspondence: LENNART BJÖRNEBORN Department of Information Studies, Royal School of Library and Information Science Birketinget 6, DK-2300, Copenhagen, Denmark E-mail: [email protected] 0138–9130/US $ 20.00 Copyright © 2006 Akadémiai Kiadó, Budapest All rights reserved

L. BJÖRNEBORN: ‘Mini small worlds’

Link connectivity structures and topical clusters on the Web evolve in a selforganized way as macro-level aggregations of micro-level interactions; as an ongoing distributive weaving of dynamic hypertext networks conducted by millions of local Web page authors. These dynamic Web cluster aggregations reflect social and cultural trends and formations, including emerging scientific networks (BJÖRNEBORN, 2004). Local actions by Web page authors creating links thus have global consequences on the vast hypertext networks on the Web. For example, by the intriguing so-called smallworld phenomena in the shape of short distances along paths of links (ALBERT et al., 1999) from Web page to page, from Web site to site, from Web cluster to cluster. Small-world theory stems from research in social network analysis (cf. Appendix) on short distances between two arbitrary persons through intermediate chains of acquaintances (MILGRAM, 1967; KOCHEN, 1989). WATTS & STROGATZ (1998) revived small-world theory by introducing a smallworld network model characterized by a combination of highly clustered network nodes and short average path lengths between pairs of network nodes. In recent years, an avalanche of research has revealed small-world properties in a large variety of networks, including biochemical, neural, ecological, physical, technical, social, economical, and informational networks. For instance, scientific collaboration networks, citation networks and semantic networks show small-world features (NEWMAN, 2001; STEYVERS & TENENBAUM, 2001). Containing both high local clusterization and short global separation, small-world networks simultaneously have small local and small global distances that facilitate high efficiency in disseminating information, ideas, contacts, signals, energy, viruses, etc., both on a local and global scale in the concerned networks. On the Web, small-world link structures are concerned with core library and information science issues such as navigability and reachability of information across vast document networks. For instance, short link distances along link paths affect the speed and exhaustivity with which Web crawlers can reach and retrieve Web pages when following links from Web page to Web page. Further, small-world link structures may reflect cross-social connections between different interest communities and crossdisciplinary contacts in scientific networks. Moreover, small-world Web topologies may have implications for the ways users explore the Web and the ease with which they gather information (ADAMIC, 1999). So far, research on small-world phenomena in complex networks including the Web has yielded important results regarding overall structural factors such as graph components, clustering coefficients, characteristic path lengths and scale-free properties including power-law frequency distribution of network connections (cf. reviews in, e.g., ALBERT & BARABÁSI, 2002; NEWMAN, 2003; SCHARNHORST, 2003). However, there is a need to reveal more details on what micro-level Web activities contribute to the emergence of small-world phenomena across macro-level Web structures.

396

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

Most links within and between Web sites connect Web pages containing similar topics (e.g., DAVISON, 2000) leading to topically clustered aggregations of Web pages and sites. The high degree of clusterization of the Web implies there must be crosscluster, that is, cross-topic, connections somewhere in order to create short distances between different topical clusters. In other words, it would be interesting to identify these cross-topic connections in order to reveal how small-world properties emerge in a Web space. In the present webometric study (BJÖRNEBORN, 2004), focus was on small-world properties of academic Web spaces. On this background, the main research question was: What types of Web links, Web pages, and Web sites function as cross-topic connectors in small-world link structures across an academic Web space? Data set The data set in the study was harvested from 109 UK university Web sites in June/July 2001 by a special Web crawler (THELWALL, 2001).* The data set comprised all identified 7669 university subsites, i.e. departments, research groups, etc., with derivative university domain names, for instance, www.geog.plym.ac.uk (Geography Department, University of Plymouth). Data files contained source URLs of harvested Web pages listed with target URLs of all outlinks from each Web page. Because many universities have variant domain names, including old versions, canonical domain name forms were used to ensure data comparability. For example, .doc.edinburgh.ac.uk was converted to .doc.ed.ac.uk. The data set contained almost 3.4 million Web pages with outlinks. Of the 39.3 million page outlinks, 34.4 million were site selflinks pointing to a Web page at the same university. There was sparse link connectivity within the investigated UK academic Web space. An average outlinking university Web page in the UK data set had 11.6 outlinks comprising 10.1 site selflinks and 1.5 site outlinks.** Of the site outlinks, only 7.7% were targeted to the other 108 universities and their subsites. The vast majority of site outlinks in the data set thus were targeted to academic, commercial, and other targets outside the data set. Universities employ different policies to locate subunit information. Some universities use an extensive range of subsites, sub-subsites, and further subordinate derivative domain names – like Chinese boxes within boxes – as Web territories for schools, departments, centers, laboratories, research groups, individual researchers, etc.

*

The 2001 data set is a subset of a freely accessible database at http://cybermetrics.wlv.ac.uk/database (cf. THELWALL, 2002/2003). ** Link terminology by BJÖRNEBORN (2004) and BJÖRNEBORN & INGWERSEN (2004).

Scientometrics 68 (2006)

397

L. BJÖRNEBORN: ‘Mini small worlds’

Other universities locate the same type of subunits into different sublevels of directories in the server file hierarchy. Many universities use a combination of derivative domain names and directories. Obviously, such diverse allocation and naming practices complicate comparability in webometric studies including the present one. In the study, links to and from the 109 university main sites were excluded from the data set because multi-disciplinary contents allocated in sub-directories would impede identification of cross-topic links across sites. Another effort to obtain a more clear-cut picture of cross-disciplinary connectivity was the exclusion of site selflinks in order to avoid subsites belonging to the same university functioning as cross-topic shortcuts due to embedded university navigational links within subsite pages. The exclusion of site selflinks and links to and from university main sites logically created delimited link structures. However, the included data subset still reflected realworld connectivity structures in the investigated academic Web space. Only links connecting subsites belonging to different universities were thus included in the study. There were 207,865 such links located at 105,817 Web pages. However, a data set this size was not tractable for the employed network analysis tool Pajek.* Instead, the 7669 university subsites were selected as units of analysis in the study. Data runs in Pajek were based on an adjacency matrix of the 7669 subsites interconnected by 48,902 subsite level connections comprising the above 207,865 page level links, cf. Figures 1a and b.

(a)

(b)

Figure 1. Two subsite level links (a) between subsites A, B, C comprise six page level links (b).

Methodology A five-step methodology was developed in the study in order to sample and identify small-world properties by zooming stepwise into more and more fine-grained Web node levels and link structures among the harvested 7669 UK academic subsites.

*

Available at http://vlado.fmf.uni-lj.si/pub/networks/pajek/

398

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

The five steps A-E will be described in more details further below. In summary, step A identified overall Web graph components and connectivity patterns among the 7669 subsites. From the strongly connected component (SCC) identified in step A, a random sample of 189 subsites was examined in step B in order to classify overall subsite topics. Within the SCC, all subsites can reach each other through directed link paths. This enabled a stratified sample to be extracted in step C resulting in 10 path nets comprising all shortest link paths between pairs of SCC subsites belonging to dissimilar topics. In step D, page genres and topics of source and target pages along followed link paths in the 10 path nets were classified. The objective of all these steps was to lead up to the final step E concerned with identifying what types of Web links, pages, and sites function as cross-topic connectors in small-world link structures across an academic Web space, the main research question in the study. Step A: Corona model In recent years, researchers from computer science, physics and mathematics, as well as information scientists, have applied approaches from graph theory and social network analysis (cf. Appendix) to explore and exploit connectivity patterns on the Web. The Web may thus be treated as a directed graph consisting of nodes in the shape of Web pages connected by directed edges in the shape of hypertext links. Other levels of Web nodes may also be chosen as units of analysis (cf. BJÖRNEBORN & INGWERSEN, 2004), for example, directories, Web sites, entire top level domains, or subsites as in the present study. The first step in the five-step methodology in the present study was to establish a graph model of connectivity patterns among the 7669 UK university subsites in order to enable the selection of a suitable sample for further investigation. Figure 2 shows a Corona* Web graph model (BJÖRNEBORN, 2004) of graph components and reachability structures among the 7669 subsites extracted by a special program. Reachability structures are graph connectivity patterns that reflect whether and how Web nodes can reach each other along intermediate paths of directed links. The Corona model is a modification of the Bow-tie model by BRODER et al. (2000). The Corona model depicts actual inter-component reachability structures in the investigated Web subgraph not evident in the Bow-tie model. For example, direct links from the IN component to the OUT component indicated at the top of the Corona model in Figure 2 are not clearly revealed in the Bow-tie model.

*

The term ‘corona’ denotes the model’s rudimentary resemblance to a solar corona with protuberances.

Scientometrics 68 (2006)

399

L. BJÖRNEBORN: ‘Mini small worlds’

Figure 2. Corona model (BJÖRNEBORN, 2004) of Web graph components and reachability structures among 7669 UK university subsites with numbers of subsites in each component and outlining some possible link paths. Number of nodes and sizes of components roughly reflect the actual numbers in the investigated UK academic Web space

In the strongly connected component (SCC) any subsite can reach any other along intermediate link paths. The IN component contains subsites that can reach the SCC along link paths but cannot be reached back to. The OUT component contains subsites that can be reached from the SCC but cannot reach back. Subsites in the Tendrils* and Tube components are connected with the IN and OUT components but cannot reach to the SCC or be reached from the SCC. The remaining disconnected components consist of subsites not connected in any way with the main ‘corona’. See BJÖRNEBORN (2004) for more details of the Corona model and its different components. Similar graph components have been identified in other subgraphs of the Web, including individual countries (BAEZA-YATES et al., 2002), and national university systems (THELWALL & WILKINSON, 2003). The Corona model of the UK academic subweb supports the notion of a fractal self-similar Web (DILL et al., 2001) with subsets of the Web displaying the same graph properties as the Web at large. It would be interesting to investigate similar Corona-like models for other cyclic networks (i.e. containing closed loops of interconnected nodes with both unidirectional and bidirectional relations) than the Web, e.g., journal-level** citation networks or semantic networks. *

The same overall graph component terms are used in the Corona model as in the Bow-tie model by BRODER et al. (2000). In the present study, though, the tendrils were assigned more specific terms depending on whether they are connected with the IN or OUT component. ** Contrary to this, citation networks on the article level can not be Corona-modeled. On the article level, citation networks are acyclic as no closed loops like A B C A (letters denote articles, arrows citations) are possible. On the journal level, though, citation networks contain loops and bidirectional paths. The hypothesis of Corona-modeled journal citation networks has yet to be tested.

400

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

Step B: SCC subsite topics The study focused on the strongly connected component because only within the SCC there can be link paths between all pairs of subsites irrespective of topical dissimilarity thus allowing easier identification of cross-topic connections. A random sample of 189 (10%) subsites was extracted from the 1893 SCC subsites. Overall topics of the sampled subsites were classified by the author by visiting each subsite in the Internet Archive (www.archive.org) indexed as closely as possible to the original Web crawl in June/July 2001 where no other contents than source and target URLs had been extracted. Internet Archive was used because subsite topics might have changed on the current Web. Only one of the 189 subsites was not available in the Archive. Subsite topics were classified and grouped with related topics in a pragmatic way to form topical groups as a basis for stratified sampling of diversified topical pairs of subsites in the subsequent step C. Step C: path nets From the topical groups formed in step B, a stratified sample was extracted consisting of five subsites from different categories in humanities and social sciences (hum/soc) and five subsites from natural sciences and technology (nat/tech). The ten seed subsites were randomly paired, one hum/soc subsite with one nat/tech subsite, giving the five pairs listed in Table 1. Table 1. Stratified sample of five pairs of SCC subsites (prefix www removed from URLs) Path net HN01

Hum/soc node

Affiliation

hum.port.ac.uk Faculty of Humanities and Social Sciences, Portsmouth

HN02

economics.soton.ac.uk Economics Dept, Southampton

HN03

psy.man.ac.uk Psychology Dept, Manchester

HN04

HN05

speech.essex.ac.uk Speech Research Group, Language and Linguistics Dept, Essex geog.plym.ac.uk Geography Dept, Plymouth

Nat/tech node

Affiliation

atm.ox.ac.uk Atmospheric, Oceanic and Planetary Physics, Physics Dept, Oxford chem.gla.ac.uk Chemistry Dept, Glasgow maths.gcal.ac.uk Mathematics Dept, Glasgow Caledonian palaeo.gly.bris.ac.uk Palaeontology Research Group, Earth Sciences Dept, Bristol eye.ox.ac.uk Ophthalmology Dept, Oxford

Subsequently, the order of start and end nodes was reversed, resulting in a total list of five plus five diversified topical pairs of subsites, cf. Table 2 further below. Pajek extracted all shortest link paths between the ten subsite pairs based on the adjacency

Scientometrics 68 (2006)

401

L. BJÖRNEBORN: ‘Mini small worlds’

matrix of how the 7669 subsites were interconnected. The objective of this step was to construct investigable small-world link structures, ‘mini small worlds’, generated by the deliberate juxtaposition of topically dissimilar seed nodes. Figure 3 shows one of the resulting 10 subgraphs consisting of all shortest link paths within the data set (see all 10 subgraphs in BJÖRNEBORN, 2004, Appendix 10). In the study, such an ‘all shortest paths’ subgraph was called a path net.* The path net (HN01, cf. Table 1) in the figure consists of all shortest link paths between node 2099 (hum.port.ac.uk), the Faculty of Humanities and Social Sciences in Portsmouth, and node 1904 (atm.ox.ac.uk), the atmospheric physics subsite in Oxford. All shortest link paths between the two seed nodes have the same path length 3. As shown in Table 2 below, the shortest path lengths were 3 or 4 in the 10 path nets. Due to the reachability structures in the strongly connected component, cf. Figure 2, all intermediary nodes in a path net belong to the SCC when the start and end node in the path net belong to the SCC.

Figure 3. Path net with all shortest link paths (path length 3) between node 2099 (hum.port.ac.uk) and node 1904 (atm.ox.ac.uk) with counts of page level links (cf. Fig. 1b). Path net levels denote degrees of separation from start node. See affiliations in BJÖRNEBORN (2004, Appendix 10)

Step D: page topics and genres This step zoomed in on source pages and target pages with outlinks and inlinks in the 10 path nets. The objective was to identify page genres and topics and thus enable

*

No literature has been found with similar investigations of subgraphs of all shortest paths between two network nodes.

402

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

the identification of pages and links that provide cross-topic connections between subsites in the final step E.

*

# subsite level links

# page level links

# followed page links

# source pages

# visited source pages

# target pages

# visited target pages

End node* atm chem math palaeo eye hum econ psych speech geogr

# visited subsites

Start node* hum econ psych speech geogr atm chem math palaeo eye

# sub-sites in path net

Path net HN01 HN02 HN03 HN04 HN05 NH01 NH02 NH03 NH04 NH05

Path length

Table 2. Summary of 10 path nets. Headings are explained in Step D below

3 3 4 3 3 3 4 3 4 4 3.4

15 6 8 4 6 15 28 5 41 13 141

15 6 8 4 6 8 12 5 26 13 103

23 7 12 3 7 25 53 5 99 18 252

48 7 38 4 12 87 183 8 232 38 657

48 7 38 4 12 33 53 8 111 38 352

41 7 29 4 7 68 134 7 167 33 497

41 7 29 4 7 26 37 7 90 33 281

28 7 28 3 12 47 114 7 152 27 425

28 7 28 3 12 23 37 7 77 27 249

See Table 1 for affiliations of start and end nodes.

Figure 4 shows a node diagram with an excerpt of a path net (NH05, cf. Table 2).

Figure 4. Node diagram with excerpt of a path net (NH05, cf. Table 2) with source and target pages, and page level links belonging to shortest link paths (path length 4) between start node eye.ox.ac.uk and end node geog.plym.ac.uk. Bold links show one of the link paths. Double border circles denote subsites, and single borders main sites, not reflecting actual sizes or numbers. See affiliations in BJÖRNEBORN (2004, Appendix 10). Stacked pages are from the same subsite directories

Scientometrics 68 (2006)

403

L. BJÖRNEBORN: ‘Mini small worlds’

The figure gives an impression of how page level links connect source and target pages along shortest link paths, in this case between the subsites of eye research in Oxford and geography in Plymouth. All source pages and target pages in the smaller path nets were visited in the study (cf. Table 2). In the large path nets (NH01, NH02, NH04, cf. Table 2), only pages belonging to link paths not passing service-type (with no specific topic) subsites were visited. A total of 530 pages comprising 281 unique source pages and 249 unique target pages were visited on 103 subsites connected by 352 page level links, cf. Table 2. Internet Archive was used to retrieve source and target pages indexed as closely as possible to June/July 2001 when the original link data set was collected. 45 source pages (16.0%) and 30 target pages (12.0%) were not available in the Archive but were visited on the current Web. The higher availability of target pages in the Archive may reflect that more inlink-prone page genres like institutional homepages may have larger probability of being reached and retrieved by the Archive’s Web crawlers. Page genres were classified by the author for all visited 530 source and target pages in the 10 path nets. The term genre is here used in a broad sense in accordance with contemporary Web terminology for describing types of functions and purposes of Web sites and Web pages (e.g., KOEHLER, 1999). The visited Web pages represented a rich diversity of genres reflecting a multitude of creators, purposes and potential audiences. By an iterative process of comparing and grouping similar pages, 17 broader meta genres crystallized, divided into two main categories, personal and institutional pages as listed in Table 3. Personal Web pages are pages used for personal academic or nonacademic purposes. Typical personal Web pages are personal homepages, hobby pages, personal link lists, personal publications, software programs, and personal course pages including student’s assignments. Institutional Web pages are created for official, institutional and non-personal purposes, both academic and non-academic. The term institutional was used in the study as a generic term to cover Web pages at any academic unit level in the investigated Web space. Typical institutional Web pages are homepages of schools, departments, centers, research groups, etc.; generic campus-wide service pages; institutional link lists pointing to research partners, institutions, research projects etc.; staff and publication lists; institutional reports, newsletters and other publications; conference and workshop pages. A prioritized categorization order (BJÖRNEBORN, 2004, p. 156) was used to classify pages in a consistent way. The classification was not exhaustive, as a larger sample of academic Web pages would probably yield additional genres. Nor are the genres mutually exclusive, but overlap with fluid borders.

404

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

Table 3. Meta genres of academic Web pages: 9 institutional genres (i.) and 8 personal (p.) Meta genre

Description

i.archive i.conf i.generic i.hp i.list i.proj i.publ i.soft i.teach p.archive p.hobby p.hp p.list p.proj p.publ p.soft p.teach

collection of data in institutional archive or database conferences, workshops, seminars, etc. non-research-related, campus-wide service Web pages homepages of academic institutions institutional pages with link lists as main content joint or single research groups, including specific projects institutional publications, e.g., reports institutional software programs, software documentation, tutorials, etc. institutional teaching-focused Web pages collection of data in personally maintained archive or database, e.g., discussion group archive personal hobby webpages not related with the person’s main academic activity personal homepages personal pages with link lists as main content personal research project pages personal publications, e.g., papers, posters personal software programs, software manuals, etc. personal teaching-focused Web pages

Institutional and personal page genres made up about 50% each of the visited source pages in the 10 path nets. This result may indicate a relatively large influence by personal Web pages for providing site outlinks in an academic Web space. About 60% of the visited target pages belonged to institutional page genres, perhaps reflecting more relevant and authoritative contents of institutional pages. About 40% of the target pages had personal genres which still may indicate a relatively large influence by personal Web pages for receiving site inlinks in an academic Web space. Contrary to scientific literature where references interconnect a small number of different document genres such as papers, monographies, etc., academic Web spaces contain a much richer diversity of interconnected document genres. There were 83 different interlinked genre pairs among the visited pages in the 10 path nets. The most frequent genre pairs were institutional link lists linking to institutional homepages (9.4%), personal link lists linking to personal publications (4.8%), personal link lists linking to institutional homepages (4.3%), and institutional link lists linking to other such lists (4.0%). The study gives support to how the Web literally is a Web of genres with genre drift and topic drift, that is, changes in page genres and page topics along link paths that may affect small-world properties (cf. BJÖRNEBORN, 2004, Section 7.2.3). The manual inspection of Web pages in the 10 path nets showed how page genres are linked to other genres and with genre drift and topic drift along the followed shortest link paths. Based on a genre adjacency matrix, Pajek extracted the corresponding graph shown in Figure 5. Dark nodes in the figure represent personal page genres like personal

Scientometrics 68 (2006)

405

L. BJÖRNEBORN: ‘Mini small worlds’

homepages, link lists, projects, publications, hobby pages, etc. (see Table 3). White nodes are corresponding institutional page genres. Link width reflects frequencies of genre pairs, i.e. genre connectivity. The figure clearly illustrates how institutional homepages (i.hp) and personal homepages (p.hp) are inlink-prone, and personal linklists (p.list) as well as institutional linklists (i.list) are outlink-prone. Institutional homepages and personal homepages thus received most followed inlinks in the path nets, 23% and 13%, respectively. Personal link lists provided almost a third (32%) of all followed outlinks in the 10 path nets and the institutional link lists about a quarter (25%).

Figure 5. A Web of genres. Genre pairs among the classified 17 meta genres (Table 3). Link width reflects frequencies of genre connectivity. Due to the Pajek software, thinner reciprocal links are concealed underneath thicker links. Genre selflinks are not shown. White nodes denote institutional (i.) meta genres and dark personal (p.)

Step E: cross-topic connectors In the final step E were identified 112 cross-topic links (called transversal links in the study) between dissimilar subsite topics among the 352 followed page level links in the 10 path nets. The question of dissimilarity and topic drift between subsites was determined by the author based on affiliations and topical descriptions given by the visited subsites and Web pages. This was not trivial because of many interdisciplinary and multidisciplinary subunits at the UK universities. For example, environmental studies and meteorology posed interdisciplinary overlaps with earth sciences making it

406

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

inconvenient to designate links between them as cross-topic. This illustrates how topic drift may appear as sliding transitions along overlapping interdisciplinary topics. A pragmatic view on cross-topic links was employed in the study, focusing on links crossing more clear-cut topical borders, for instance, between computer science and atmospheric physics as in Figure 6.

Figure 6. Path net (cf. Fig. 3) with enclosed topical areas in humanities (hum), computer science (cs), geography (geo) and atmospheric sciences (atm). Non-enclosed nodes are service-type subsites. Cross-topic links are marked with dashed bold. See BJÖRNEBORN (2004, Appendix 10) for affiliations

Figure 6 shows all the shortest link paths (path length 3) between the Faculty of Humanities and Social Sciences in Portsmouth, and the atmospheric physics subsite in Oxford (cf. Figure 3). One link path passes a humanities area (hum). Another topical area (atm) consists of three atmospheric science subsites. In a similar way, the topics of the 10 subsites that functioned as start and end nodes (cf. Table 1) of the 10 path nets naturally influenced the number of subsites with similar topics in each path net. In the figure, the computer-science area (cs) functions as connector on many of the shortest link paths between the start node and end node. Also one geography subsite (geo) is a cross-topic connector. The non-enclosed nodes are service-type subsites, typically providing services for staff and students at the whole university. The dashed links denote cross-topic links connecting one topical area with another in the path net,

Scientometrics 68 (2006)

407

L. BJÖRNEBORN: ‘Mini small worlds’

for example, a link from Faculty of Classics in Cambridge (node 337) to an oceanographer at the atmospheric subsite in Oxford (node 1904) having a hobby Web page devoted to an ancient Greek warship. Selected findings The findings in the study cannot be generalized to the whole Web or to all academic Web spaces. The findings are indicative only as there were delimitations in the investigated data set: national (only UK sites), sectoral (only academic sites), institutional (only university subsites), and temporal (snapshot data set from June/July 2001 not covering dynamic changes on the Web). Further, a small stratified sample of 10 path nets was used. However, the indicative findings and identified phenomena may be fruitful for future large-scale investigations. For example, longitudinal time series of snapshots of the same population of academic Web subsites could reveal how graph components, reachability structures, topical clusters, and cross-topic connectors change over the years on the dynamic Web. In order to enable large-scale investigations, automatic classification of overall topics both on the subsite and page levels would be necessary, as well as more objective heuristics for determining dissimilarity between topics, for instance, by including low co-inlink or co-outlink measures (analogous to co-citations and bibliographic couplings, respectively, cf. BJÖRNEBORN & INGWERSEN, 2004) of Web sites, combined with low co-word occurrences. Below, some selected findings are outlined regarding small-world properties and small-world connectors in the investigated UK academic Web space. See BJÖRNEBORN (2004) for a more comprehensive presentation and discussion. Small-world properties The investigated UK academic Web space was found to contain small-world properties. Figure 7 shows the distribution of lengths of shortest link paths between pairs of subsites in the data set of 7669 UK academic subsites. Only 19.5% of all pairs of subsites could be connected by a directed link path within the data set. The low percentage is affected by the high share of subsites belonging to the OUT and Disconnected graph components, cf. Figure 2. The characteristic path length, that is, the average path length in the UK academic subweb was 3.46 among subsites that could reach each other along link paths. The diameter of the UK academic subweb was 10, the longest of the existing shortest paths between subsites.

408

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

Figure 7. Distribution of shortest path lengths between 7669 UK subsites. Semi-log scale

Pajek was used to construct a corresponding random graph with 7669 nodes and 48,902 edges as in the UK academic subweb. The characteristic path length of the random graph was 5.04 compared with 3.46 for the UK graph. Pajek also computed all local clustering coefficients for the 7669 subsites in the UK graph giving an average overall clustering coefficient of 0.09038. In other words, if a subsite node v1 was connected with two subsites nodes v2 and v3, there was 9.0% probability that v2 and v3 were also connected. The clustering coefficient for the corresponding random graph was 0.00084, that is, over 100 times smaller. The small characteristic path length of the UK subweb and the large clustering coefficient compared with the corresponding random graph meet the requirements for a small-world network as defined by WATTS & STROGATZ (1998). However, as also noted by BRODER et al. (2000), true small-world properties are only present within the SCC component, because as stated above, only 19.5% of all pairs of subsite nodes in the investigated UK graph could be connected by link paths. Small-world connectors Below are some selected indicative findings answering the main research question in the study: what types of Web links, Web pages, and Web sites function as cross-topic connectors in small-world link structures across an academic Web space? Of the 112 cross-topic links identified in the 10 path nets, 64 (57.1%) originated from a personal Web page, whereas 48 (42.9%) were from an institutional page. Over

Scientometrics 68 (2006)

409

L. BJÖRNEBORN: ‘Mini small worlds’

80% of the identified cross-topic links were related to academic activities such as research and teaching. There was a higher percentage of personal link lists among the cross-topic links (40%) and cross-topic source pages (36%) than among all the followed links (32%) and visited source pages (30%) in the 10 path nets. This finding may indicate a special impact of personal Web authors and their link lists for the emergence of small-world phenomena across dissimilar topical Web domains. About 46% of subsites providing or receiving cross-topic links in the 10 path nets were related to computer science. In the random sample of 189 SCC subsites, only about 11% were related to computer science. Even if the sample of 10 path nets was small, this finding may indicate a special role of computer science subsites as crosstopic connectors on shortest link paths in an academic Web space. This probably reflects the auxiliary function of computer technology in many scientific disciplines. Furthermore, a more experienced and extrovert Web presence may exist among persons and institutions in computer science. The special role of computer science subsites as connectors across an academic Web space is supported by the fact that 8 of the 10 subsites with the highest betweenness centrality in the whole data set were related to computer science. The social network analytic measure of betweenness centrality (FREEMAN, 1977) reflects the probability that a node occurs on a shortest link path between two arbitrary nodes. The present study (BJÖRNEBORN, 2004, Section 6.3.2.4) indicated a close relation between KLEINBERG’s (1999) concepts of hubs and authorities on the Web and the measure of betweenness centrality, as the subsites with the highest betweenness centrality also were among the strongest hubs and authorities as computed by Pajek. No other literature has been found discussing such a relation. Conclusion and perspectives This webometric study has developed a five-step methodology to identify Web links, pages, and sites that function as small-world connectors across an academic Web space. The methodology includes a novel Corona model of reachability structures in a Web graph, as well as shortest link path nets functioning as investigable small-world link structures, ‘mini small worlds’, generated by deliberate juxtaposition of topically dissimilar subsites. Of importance to Web sociology and science studies, small-world link structures may reflect cross-social connections between different interest communities including cross-domain contacts in scientific networks. Indicative findings in the present study suggest that personal Web page authors, such as researchers and students, as well as computer science-related subsites may be important small-world connectors across sites and topics in an academic Web space. Such small-world connectors are important as they counteract balkanization of the Web into insularities of disconnected 410

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

subpopulations (cf. Figure 2 of the Corona model) not reachable by search engine crawlers following paths of links. Further, such small-world connectors reduce distances and ‘crumple up’ the Web by pulling Web sites and their neighborhoods close together across the Web. Possibilities of ‘crumpling up’ Web spaces are unlimited because the virtual space of the Web is billion-dimensional, as every new link literally adds a new dimension. As academic Web spaces increasingly include extensive scholarly self-presentations and link creations, science studies may employ small-world approaches including social network analytic concepts like the abovementioned betweenness centrality for automatic tracking of central gatekeepers and interdisciplinary boundary crossings in academic Web spaces. This might identify fertile areas for interdisciplinary exploration and cross-pollination. Special decentralized algorithms have been developed that utilize local connectivity information for identifying short traversal paths through a network where link data for the whole network are not available (e.g., ADAMIC et al., 2001; MENCZER, 2002). In particular, well-connected hub-like Web nodes may be exploited in such decentralized algorithms. The findings in the present study suggest that the rich diversity of inlinks and outlinks to and from computer-science Web sites and personal link lists may be utilized for such computer-aided navigation along small-world shortcuts, which also may be exploited as transit points for more exhaustive Web coverage by search engine crawlers.

* The paper is an extended version of BJÖRNEBORN (2005): Identifying small-world connectors across an academic Web space: A webometric study. In: Proceedings of the 10th International Conference of the International Society for Scientometrics and Informetrics. Stockholm: Karolinska University Press, 2005, pp. 56–66. The author thanks Professor Mike Thelwall at the University of Wolverhampton for giving access to the UK link data set and for programming help with extracting the necessary data to be analyzed.

References ADAMIC, L. A. (1999), The small world Web. Lecture Notes in Computer Science, 1696 : 443–452. ADAMIC, L. A., R. M. LUKOSE, A. R. PUNIYANI, B. A. HUBERMAN (2001), Search in power-law networks. Physical Review E, 64 : 46135. ALBERT, R., A.-L. BARABÁSI (2002), Statistical mechanics of complex networks. Reviews of Modern Physics, 74 (1) : 47–97. ALBERT, R., H. JEONG, A.-L. BARABÁSI (1999), Diameter of the World-Wide Web. Nature, 401 : 130–131.

Scientometrics 68 (2006)

411

L. BJÖRNEBORN: ‘Mini small worlds’

BAEZA-YATES, R., F. SAINT-JEAN, C. CASTILLO (2002), Web structure, age and page quality. Online Proceedings of the 2nd International Workshop on Web Dynamics, in conjunction with WWW 2002, Honolulu, Hawaii, 7th May 2002. (http://www.dcs.bbk.ac.uk/webdyn2/proceedings/baeza_yates_web_strucutre.pdf; visited 15.09.2005)

BJÖRNEBORN, L. (2004), Small-World Link Structures across an Academic Web Space: A Library and Information Science Approach. PhD thesis. Royal School of Library and Information Science. (http://www.db.dk/lb/phd; visited 15.09.2005). BJÖRNEBORN, L. (2005), Identifying small-world connectors across an academic Web space: A webometric study. In: Proceedings of the 10th International Conference of the International Society for Scientometrics and Informetrics. Stockholm: Karolinska University Press, pp. 56–66. BJÖRNEBORN, L., P. INGWERSEN (2004), Toward a basic framework for webometrics. Journal of the American Society for Information Science and Technology, 55 (14) : 1216–1227. BOLLOBÁS, B. (1998), Modern Graph Theory. New York: Springer. BOTAFOGO, R. A., E. RIVLIN, B. SHNEIDERMAN (1992), Structural analysis of hypertexts: identifying hierarchies and useful metrics. ACM Transactions on Information Systems, 10 (2) : 142–180. BRODER, A., R. KUMAR, F. MAGHOUL, P. RAGHAVAN, S. RAJAGOPALAN, R. STATA, A. TOMKINS, J. WIENER (2000), Graph structure in the Web. Computer Networks, 33 (1-6) : 309–320. CRANE, D. (1972), Invisible Colleges: Diffusion of Knowledge in Scientific Communities. Chicago: University of Chicago Press. DAVISON, B. D. (2000), Topical locality in the Web. Proceedings of the 23rd International ACM SIGIR Conference. ACM Press. pp. 272–279. DILL, S., S. R. KUMAR, K. MCCURLEY, S. RAJAGOPALAN, D. SIVAKUMAR, A. TOMKINS (2001), Selfsimilarity in the Web. Proceedings of the 27th International Conference on Very Large Data Bases, pp. 69–78. DOREIAN, P., T. J. FARARO (1985), Structural equivalence in a journal network. Journal of the American Society for Information Science, 36 (1) : 28–37. EGGHE, L., R. ROUSSEAU (1990), Introduction to Informetrics: Quantitative Methods in Library, Documentation and Information Science. Amsterdam: Elsevier. EGGHE, L., R. ROUSSEAU (2002), Co-citation, bibliographic coupling and a characterization of lattice citation networks. Scientometrics, 55 (3) : 349–361. EGGHE, L., R. ROUSSEAU (2003), BRS-compactness in networks: theoretical considerations related to cohesion in citation graphs, collaboration networks and the Internet. Mathematical and Computer Modelling, 37 (7-8) : 879–899. FANG, Y., R. ROUSSEAU (2001), Lattices in citation networks: an investigation into the structure of citation graphs. Scientometrics, 50 (2) : 273–287. FREEMAN, L. C. (1977), A set of measures of centrality based on betweenness. Sociometry, 40 (1) : 35–41. FURNER, J., D. ELLIS, P. WILLETT (1996), The representation and comparison of hypertext structures using graphs. In: M. AGOSTI, A. F. SMEATON (Eds), Information Retrieval and Hypertext. Boston: Kluwer Academic Publishers. pp. 75–96. GARNER, R. (1967), A computer oriented, graph theoretic analysis of citation index structures. In: B. FLOOD (Ed.), Three Drexel Information Science Research Studies. Drexel Press, pp. 3–46. (http://www.garfield.library.upenn.edu/rgarner.pdf; visited 15.09.2005) GROSS, J., J. YELLEN (1999), Graph Theory and its Applications. Boca Raton, Flor.: CRC Press. HUMMON, N. P., P. DOREIAN (1989), Connectivity in a citation network: The development of DNA theory. Social Networks, 11 : 39–63. (http://www.garfield.library.upenn.edu/papers/hummondoreian1989.pdf; visited 15.09.2005) KLEINBERG, J. M. (1999), Authoritative sources in a hyperlinked environment. Journal of the ACM, 46 (5) : 604–632. KOCHEN, M. (Ed.) (1989), The Small World. Norwood, NJ: Ablex. KOEHLER, W. (1999), Classifying Web sites and Web pages: the use of metrics and URL characteristics as markers. Journal of Librarianship and Information Science, 31 (1) : 297–307.

412

Scientometrics 68 (2006)

L. BJÖRNEBORN: ‘Mini small worlds’

MENCZER, F. (2002), Growing and navigating the small world Web by local content. Proceedings of the National Academy of Sciences, 99 (22) : 14014–14019. MILGRAM, S. (1967), The small-world problem. Psychology Today, 1 (1) : 60–67. NEWMAN, M. E. J. (2001), The structure of scientific collaboration networks. Proceedings of the National Academy of Sciences, 98 (2) : 404–409. NEWMAN, M. E. J. (2003), The structure and function of complex networks. SIAM Review, 45 (2) : 167–256. OTTE, E., R. ROUSSEAU (2002), Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science, 28 (6) : 441–454. ROGERS, E. M. (2003), Diffusion of Innovations. 5.ed. New York: The Free Press. SCHARNHORST, A. (2003), Complex networks and the Web: insights from nonlinear physics. Journal of Computer-Mediated Communication, 8 (4). (http://www.ascusc.org/jcmc/vol8/issue4/scharnhorst.html; visited 15.09.2005) SCOTT, J. (2000), Social Network Analysis : A Handbook. 2.ed. Thousand Oaks, Cal.: SAGE Publications. STEYVERS, M., J. B. TENENBAUM (2001), The large-scale structure of semantic networks: statistical analyses and a model for semantic growth. (http://arxiv.org/pdf/cond-mat/0110012; visited 15.09.2005) THELWALL, M. (2001), A Web crawler design for data mining. Journal of Information Science, 27 (5) : 319–326. THELWALL, M. (2002/2003), A free database of university Web links: data collection issues. Cybermetrics, 6/7 (1) : paper 2. (http://www.cindoc.csic.es/cybermetrics/articles/v6i1p2.html; visited 15.09.2005) THELWALL, M., D. WILKINSON (2003), Graph structure in three national academic Webs: power laws with anomalies. Journal of the American Society for Information Science and Technology, 54 (8) : 706–712. WASSERMAN, S., K. FAUST (1994), Social Network Analysis: Methods and Applications. Cambridge: Cambridge University Press. WATTS, D. J., S. H. STROGATZ (1998), Collective dynamics of ‘small-world’ networks. Nature, 393 (June 4) : 440–442.

Scientometrics 68 (2006)

413

L. BJÖRNEBORN: ‘Mini small worlds’

Appendix Theories and empirical work on small-world phenomena (e.g., MILGRAM, 1967; KOCHEN, 1989) stem from research in social network analysis and the related mathematical discipline of graph theory. A graph is a mathematical representation of a network structure consisting of network nodes (or vertices) connected by edges (undirected relations between nodes) or arcs (directed) (cf., e.g., GROSS & YELLEN, 1999). A graph can thus represent, for instance, a social network with the nodes corresponding to people and undirected edges corresponding to acquaintanceships, friendships and kinships. The nodes could also be ecological actors (in a food web), Internet servers, documents (in a citation network), concepts (in a thesaurus or semantic network), and all other types of interrelated network objects. In a directed graph the edges represent directional relations between the nodes. An example of a directed graph is the World-Wide Web with Web pages (or directories, Web sites, entire top level domains – or subsites as in the present study) as the nodes and hyperlinks as directed arcs (e.g., BRODER et al., 2000; KLEINBERG, 1999). Graph theory has developed a large mathematical framework (e.g., BOLLOBÁS, 1998; GROSS & YELLEN, 1999) useful in a very wide variety of scientific disciplines for mathematical modeling of networks, e.g., in biology, chemistry, physics, sociology, psychology, and technology. Graph theory has also been applied in information sciences for modeling and analyzing, for instance, citation networks (e.g., DOREIAN & FARARO, 1985; EGGHE & ROUSSEAU, 1990; 2002; 2003; FANG & ROUSSEAU, 2001; GARNER, 1967; HUMMON & DOREIAN, 1989), information systems (e.g., KORFHAGE et al., 1972), and hypertextual networks (e.g., BOTAFOGO et al., 1992; FURNER et al., 1996). One of the scientific disciplines that make extensive use of graph theory is social network analysis dealing with quantitative measurement of structural relations between actors in social networks, employing different measures of, e.g., density, centrality, components, cores, cliques, clusters, weak ties and strong ties (cf. overviews in, e.g., SCOTT, 1992; WASSERMAN & FAUST, 1994). Concepts, techniques and measures developed in social network analysis are used in a wide array of social and behavioral science disciplines, e.g., social psychology, sociology, social anthropology, cultural geography, economy, and political sciences. Examples of research topics where social network analysis is applicable, include, for example, global political and economic systems, social support, problem solving in groups, diffusion of innovations (cf. ROGERS, 2003), ‘invisible colleges’ in the sociology of science (CRANE, 1972), power and exchange, and coalition formation (see more examples in WASSERMAN & FAUST, 1994). OTTE & ROUSSEAU (2002) give an overview of applications and potentials of social network analysis in the information sciences with regard to studies of, for example, citation and co-citation networks, collaboration structures and other forms of social interaction networks.

414

Scientometrics 68 (2006)

Suggest Documents