Databases on the Web: national web domain survey

5 downloads 80790 Views 193KB Size Report
Keywords web databases, deep web, web characterization, web mea- surement, national web, structured data, cluster random sampling, virtual hosting. 1.
Databases on the Web: national web domain survey Denis Shestakov Department of Media Technology, Aalto University Konemiehentie 2, Espoo, 02150 Finland

[email protected]

ABSTRACT The deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are disputable. In this paper, we address the problem of accurate estimation of the deep Web by sampling one national web domain. We report some of our results obtained when surveying the Russian Web. The survey findings, namely the size estimates of the deep Web, could be useful for further studies to handle data in the deep Web.

Categories and Subject Descriptors H.3.5 [Information Storage and Retrieval]: Online Information Services—Web-based services; H.3.7 [Information Storage and Retrieval]: Digital Libraries—Collection

General Terms Measurement

Keywords web databases, deep web, web characterization, web measurement, national web, structured data, cluster random sampling, virtual hosting

1. INTRODUCTION Dynamic pages generated based on parameters provided by a user via web search forms are poorly indexed by major web searchers and, hence, scarcely presented in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases, contents of which comprise a huge part of the Web known as the deep Web [17]. Since introducing structured web data to search results is one of the current priorities for web search engines such as Google or Microsoft Bing [12], there is a huge interest in better understanding of deep web resources, the main sources of structured data. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. IDEAS11 2011, September 21-23, Lisbon [Portugal] Editors: Bernardino, Cruz, Desai c 2011 ACM 978-1-4503-0627-0/11/09 $10.00 Copyright

Though the term deep Web was coined in 2000 [9], sufficiently long ago for any web-related concept, many important characteristics of the deep Web still remain unknown. For example, such parameter as the total number of searchable databases on the Web is highly disputable. In fact, until now there are only three works (namely, [9, 13, 19]) solely devoted to the deep web characterization and, more than that, one of these works is a white paper, where all findings were obtained by using proprietary methods. Another matter of concern is that the mentioned surveys are based on approaches with inherent limitations. The most serious drawback is ignoring so called virtual hosting, i.e., the fact that multiple web sites can share the same IP address. Neglect of virtual hosting factor in earlier deep web surveys means that their estimates are highly biased. In this work, our goal is to get accurate characteristics of the deep Web by sampling one national web domain. In our characterization survey we use the Host-IP clustering approach, that is based on the idea of clustering hosts sharing the same IPs and analyzing “neighbors by IP” hosts together. Usage of host-IP mapping data allows us to address drawbacks of previous surveys, specifically to take into account the virtual hosting factor. We obtain some rough estimates for the number of entities (or objects) in the analyzed national web domain and argue that the size of the deep Web (measured in the number of searchable entities) is similar to the size of indexable Web. The next section gives a background on methods to characterize the deep Web. In Section 3 we present our approach, the Host-IP cluster sampling technique. The results of our survey of the Russian Web are described in Section 4. Discussion and literature review are given in Sections 5 and 6 correspondingly. Finally, Section 7 concludes the paper.

2.

BACKGROUND: DEEP WEB CHARACTERIZATION

Existing attempts to characterize the deep Web [9, 13, 19] are based on two methods originally applied to general Web surveys: namely, overlap analysis [10] and random sampling of IP addresses [16]. The first technique involves pairwise comparisons of listings of deep web sites, where the overlap between each two sources is used to estimate the size of the deep Web (specifically, total number of deep web sites) [9]. The critical requirement to listings be independent from one another is unfeasible in practice, thus making the estimates produced by overlap analysis seriously biased. Additionally, the method is generally non-reproducible.

IPk:

has one or more search interfaces

Web site 1 (analyzed)

IP2:

Web site 2 (missed)

Web site 1 (analyzed) Web site 2 (missed)

Web server

Web site 3 (analyzed) Web site 4 (missed) (missed) H po

st

es

ue

R

eq

P

R

TT

P

H

TT

Web site 20 (missed)

ns e

IP1

HTTP Request No Response

HTTP Request

.....

...

IP2

HTTP Response

Web site 5 (missed)

IPk

IP address

... N

e oR HT

IP1 no search IP2 : responding : interfaces ..... IPk : responding : deep web site ..... IPn-1 IPn

sp

TP

on

Re

IPn-1

se

q

s ue

t

No Response t HTTP Reques

IPn

Sample

Figure 1: Random sampling of IP addresses method: sample of IPs (IP1 , . . . , IPn ) are tested for active web servers (with IP2 and IPk ), which are then checked for the presence of interfaces to web databases; due to inability to find out all web sites hosted on a particular IP only three sites in total are analyzed while the rest is missed. Unlike the overlap analysis the second technique, the random sampling of IP addresses technique (rsIP for short), is easily reproducible and requires no pre-built listings. The rsIP estimates the total number of deep web sites by analyzing a sample of unique IP (Internet Protocol) addresses randomly generated from the entire space of valid IPs and extrapolating the findings to the Web at large. Since the entire IP space is of finite size and every web site is hosted on one or several web servers, each with an IP address (such an address is not a unique identifier for a server though – a single server may use multiple IPs and, conversely, several servers can answer for the same IP), analyzing an IP sample of adequate size can provide reliable estimates for the characteristics of the Web in question. In [13], one million unique randomly-selected IP addresses were scanned for active web servers by making an HTTP connection to each IP. Detected web servers were exhaustively crawled and those hosting deep web sites (defined as web sites with search interfaces, or search forms, that allow a user to search in underlying databases) were identified and counted. The technique is depicted in Figure 1, where one deep web site is found to be hosted on a web server with IPk . Note that a main indicator of a deep web site is a functionality of search through the content of an underlying database(-s) rather than through (crawlable) content of web site’s pages. In this way, a deep web site and a database-driven web site are two different notions. Unfortunately the rsIP approach has several limitations. The most serious drawback is ignoring virtual hosting, i.e., the fact that multiple web sites can share the same IP address. This leads to ignoring a certain number of sites, some of which are apparently deep web sites. To illustrate, Figure 1 shows that servers with IP2 and IPk host twenty and two web sites correspondingly, but only three out of 22 web sites are actually crawled to discover interfaces to web databases. The numbers of analyzed and missed sites per IP in this example are perfectly typical: the reverse IP procedure usually returns one or two web sites hosted on a given IP address, while hosting a lot of sites on the same IP is a common practice. Table 1 presents the average numbers

of virtual hosts per IP address obtained in four web studies conducted in 2003-2007. The data clearly suggests that: (1) one IP address is, in average, shared by 7-11 hosts; and (2) the number of hosts per IP increases over time [4, 14, 3]. Another factor overlooked by the rsIP method is DNS load balancing, i.e., the assignment of multiple IP addresses to a single web site. For instance, Russian news site newsru.com mapped to three (here and hereafter if not otherwise indicated, resolved in 05/2010) IPs is three times more likely to appear in a sample of random IPs than a site with one assigned IP. Since the DNS load balancing is the most beneficial for popular and highly trafficked web sites we expect that the bias caused by the load balancing is less than the bias due to the virtual hosting. Indeed, according to the SecuritySpace’s survey as of April 2004, only 4.7% of hosts had their names resolved to multiple IP addresses [5], while more than 90% of hosts shared the same IP with others (see the first row of Table 1). To summarize, the virtual hosting cannot be ignored in any IP-based sampling survey. Next we present the sampling strategy that addresses these challenges.

3.

HOST-IP CLUSTERING TECHNIQUE

Real-world web sites are hosted on several web servers, share their web servers with other sites, and are often accessible via multiple hostnames. Neglecting these issues makes estimates produced by IP-based or host-based sampling seriously biased. The clue to a better sampling strategy lies in the fact that hostname aliases for a given web site are frequently mapped to the same IP address. In this way, given a hostname resolved to some IP address, we can identify other hostnames potentially pointing to the same web content by checking other hostnames mapped to this IP. It is interesting to see here a strong resemblance to the virtual hosting problem, where all hosts sharing a given IP address have to be found. Assuming a large listing of hosts is available, we can acquire the knowledge about which hosts mapped to which IPs by resolving all hostnames in the listing to their corresponding IP addresses. Technically, such massive resolving of available hosts to their IPs is essentially a process of clustering hosts into groups, each including hosts sharing the same IP address. Grouping hosts with the same IPs together is quite natural because it is exactly what happens on the Web, where a web server serves requests only to those hosts that are mapped to a server’s IP. Once the overall list of hosts is clustered by IPs we can apply a cluster sampling strategy, where an IP address is a primary sampling unit consisting of a cluster of secondary sampling units, hosts. Our Host-IP approach to characterization of deep Web consists of the following major steps: • Resolving, clustering and sampling: resolve a large number of hosts relating to a studied web segment to their IP addresses, group hosts based on their IPs, and generate a sample of random IP addresses from all resolved IPs. • Crawling: for each sampled IP analyze hosts sharing a sampled IP for near-duplicates, remove near-duplicates and crawl the rest to a predefined depth. While crawling new hosts (which are not in the initial main list) may be found: those mapped to a sampled IP are to be analyzed, others are analyzed if belong to a studied web segment.

Table 1: Virtual hosting: the average number of hosts per IP address reported in several web surveys. Short descriptions of analyzed hostname datasets and references Entire Web: all hosts known to Netcraft [4] Russian Web: all 2nd-level domain names in .ru and .su zones [2, 3] Portuguese Web: 85% of hosts in .pt, 12% in .com, 3% in others [14]

• Deep web site identification: Analyze all pages retrieved during the crawling step and detect those with search interfaces to databases.

When conducted 04/2004 03/2007 03/2006 04/2003

Num of hosts

Num of IPs Aver.num of hosts per IP

49.75×106 ≈4.4×106 639,174 68,188 387,470 51,591 46,457 6,856

≈11.3 9.4 7.5 6.8

A fraction of hosts that:

The detailed description of Host-IP clustering sampling approach can be found elsewhere [18]. In the next section we describe the survey of Russian deep Web conducted using the Host-IP clustering technique in September 2006.

4. RUSSIAN DEEP WEB SURVEY We started our survey by merging two sets of hostnames (namely, datasets “Hostgraph” and “RU-hosts” [20]) into one listing of unique hostnames. We then resolved all the hosts to IP addresses and, after removing all host-IP pairs with invalid IPs, resulted in 717,240 host-IP pairs formed by 672,058 unique hosts and 79,679 unique IP addresses. To avoid dealing with the DNS load-balancing factor at the next clustering and sampling steps, we excluded redundant hostIP pairs (e.g., www.google.ru resolved to six IPs produces six host-IP pairs – one pair, participating in the further analysis, and five ’redundant’ ones), and finally got 672,058 hosts-IP pairs on 78,736 IP addresses. In this way, our compiled dataset gives yet another support for the magnitude of virtual hosting: there are nine virtual hosts per one IP address on average. The degree of IP address sharing is depicted in Figure 2. Particularly, 55,6% (398,608) of all hosts in the dataset share their IPs with at least 200 other hosts. We then clustered 672,058 host-IP pairs by their IPs and, in a such manner, got 78,736 groups of pairs, each having from one to thousands of hosts. It is natural to assume that deep web sites are more likely to be found within host groups of certain sizes, i.e., it might be beneficial to study groups with a few hosts separately from groups including hundreds of hosts. One of the reasons to stratify was to actually verify whether deep web sites are served by servers hosting only a few sites. We formed three strata using the following stratification criteria: Stratum 1 included those host-IP pairs which IP addresses are each associated with seven or less hostnames, groups of size from 8 to 40 inclusive formed Stratum 2, and Stratum 3 combined groups with no less than 41 hosts in each. 8 and 41 were chosen to make Stratum 1 contain 90% of all IP addresses and to put 70% of all hosts into Stratum 3. Particularly, Stratum 3 comprised 70% (472,474) of all hosts and only 2% (1,860) of all IP addresses. We randomly selected 964, 100 and 11 primary sampling units (IP addresses) from Stratum 1, 2 and 3 correspondingly. It resulted in 6,237 secondary units (hosts) in total to crawl. Hosts of every sampled IP were crawled to depth three (see [6] for discussion on crawling depth value). The estimates for the total numbers of deep web sites and databases and their corresponding confidence intervals were

Figure 2: IP address sharing for our dataset.

calculated according to the formulas given in [18] and are presented in Table 2. The ’Num of all detected ’ column shows (in italic) the numbers of deep web sites and web databases that were actually detected in strata. However, not all of them appeared to be Russian deep web sites. In particular, several sampled hosts in .RU were in fact redirects to non-Russian deep web resources. Another noticeable example in this category was xxx.itep.ru, which is one of the aliases for the Russian-mirror of arXiv (http://arxiv. org/), an essentially international open e-print archive. We excluded all such non-Russian resources and put the updated numbers in the ’Num of Russian’ column. The survey results, the overall numbers of deep web sites and web databases in the Russian segment of the Web as of September 2006, estimated by the Host-IP clustering method are 15,700±3,700 and 20,700±4,400 correspondingly (see Table 2).

4.1

Size of Russian deep Web

With estimates for total numbers of deep web sites and web databases we then approached to estimating the size of Russian deep Web. Our goal was to get an estimate for the total number of searchable objects (also referred to as records or entities) accessible in the Russian deep Web. For each web database detected in our survey we assessed the number of searchable objects in it. Sometimes the assessing was straightforward as some page of a corresponding deep web site contained the information about database size. More often, however, there were no such information available and, for these resources, we used combination of queries to roughly understand how many entities are accessible. Note that the entity-based size evaluation for certain types of databases, like, for instance, those supporting transport timetable or airfare search, is highly uncertain and, hence, our size assessments for a number of resources gave only the order of magnitude.

Table 2: Approximate 95% confidence intervals for the total numbers of deep web sites (dws) and web databases (dbs) in each stratum and in the entire survey. Num of all detected dws dbs Stratum 1: - Detected in sample - Confidence interval, [103 ] Stratum 2: - Detected in sample - Confidence interval, [103 ] Stratum 3: - Detected in sample - Confidence interval, [103 ] Survey total: Confidence interval, [103 ]

( (

10 6 10 5

) )

Figure 3: Estimated numbers of small, medium and large Russian deep web sites for each stratum and in total.

For each deep web site we summed up the sizes of databases accessible through it and, based on obtained sizes, classified each deep web site into three classes: small if less than 105 entities are accessible in total via a web site, medium if the number of entities is in the range from 105 to 106 , and large if a site leads to more than 106 records. Figure 3 shows the estimated numbers of Russian deep web sites (non-Russian omitted) for each size class within each stratum and in total. Interestingly, almost 90% of Russian deep web sites are either of small (approximately 12,100 sites; 77%) or medium (1,900; 12%) size and only approximately one of ten deep web sites (1,700) provides a search through more than 106 entities. Among the large deep web sites detected in the survey there were only three resources that led to more than 107 records – tury.ru (tour search) allowed, at the time of our survey, a user to access to slightly more than 20 millions entities; and other two resources being actually web interfaces to Z39.50 gateway [15] (through which a user can search through records of large number of libraries) were leading to dozens of millions records. Based on the numbers above, the upper-bound estimate for the total number of entities in the Russian deep Web is around twenty billions (=12, 100 · 105 + 1, 900 · 106 + 1, 700 · 107 ). Since large deep web sites do a main contribution in

Num of Russian dws dbs

80 6.0±1.4

131 9.9±3.6

72 5.4±1.3

106 8.0±2.8

38 2.1±0.7

46 2.5±1.1

38 2.1±0.7

46 2.5±1.1

64 9.6±3.4

87 13.0±5.5

55 8.2±3.5

68 10.2±3.5

17.7±3.7

25.4±6.5

15.7±3.7

20.7±4.4

this estimate, we manually inspected all large deep web sites detected in samples and, in turn, divided them into three groups. The first group includes the sites aggregating their content from a large number of smaller web sites; the price aggregator price.ru collecting offers from more than two thousands companies (most of which have their web sites, each serving its own small piece of content) is an example of a site in this group. Deep web sites with ’mirrored’ content form the second group. Such sites typically provide a licensed access to a database(-s) owned and managed by a third party. For example, all five large deep web sites detected in the sample of Stratum 3 provide a search through contents of the same two databases (both in travel domain, but owned by different companies). As we further investigated travel web sites, there are around a dozen of major tour search systems in Russia and almost any Russian travel web site with tour search functionality is in fact serving the content of one of these systems’ databases. Finally, the third group consists of the deep web sites with the original content. The numbers of large deep web sites with aggregated, mirrored and original contents detected in each sample and the total projected estimates are given in Table 3. One can specifically see that more than a half of large deep web sites provide an access to essentially the same set of databases rather than serve the original content. By excluding large resources with mirrored content, we corrected our upper-bound estimate for the total number of entities in the Russian deep Web to around ten billions (=20.1 · 109 − 970 · 107 ). While ten billions entities in web databases on the Russian Web is a huge number, it is, as specified above, an upperbound estimate, and includes the sites which contents are aggregated (i.e., quite likely available, may be in other format, on other web sites). The estimate also suggests that the size of the Russian deep Web is comparable to the indexed size of Russian Web – indeed, around one billion of web documents were indexed by the Yandex, a web search engine for the Russian part of the Web, in September 2006 [1]. Assuming that an entity roughly corresponds to a web page (which is often not true as many web pages list full information about a number of entities with no further links), we argue that the deep and indexed parts of Russian Web are of similar sizes.

Table 3: Number of large (> 106 entities) deep web sites in each sample and total estimates. Overall num of large dws Stratum 1 sample Stratum 2 sample Stratum 3 sample Total number (estimate)

11 2 5 1680 (=100%)

5. DISCUSSION The most surprising results in our survey are two empirical observations. Firstly, the deep Web in terms of number of resources (namely, deep web sites) is larger than it is commonly considered. Particularly, the estimated numbers of deep web sites obtained in [13] are underestimates due to ignoring the virtual hosting factor. Secondly, the deep Web in terms of number of entities is not that huge and is likely to be of comparable size with the indexed part of the Web. At least, we doubt the claim in [9] stating that the deep Web is two orders of magnitude larger than the indexed Web. Notwithstanding that our survey focused on just one national segment of the deep Web, we presume that, by analogy with the Russian part of the Web, the majority of deep web sites are of small and medium sizes and, additionally, a large fraction of web databases with more than 106 entities contain duplicated content (such content is not only available on several deep web resources but may be even indexed at one of them by regular search engines). Another interesting observation is the fact that around a half of all deep web sites are hosted on IP addresses shared by more than 40 hosts (see Table 2). It is somewhat unexpected since a deep web site serves dynamic content and thus normally requires more resources than an ordinary web site. Common sense suggests that a dedicated server (i.e., a server hosting from one or two to perhaps dozens of hosts, most of which are aliases) would be a better alternative for hosting a web site with database access. Nevertheless, it gives us just another strong justification for taking into consideration the virtual hosting factor. While we generally satisfied with our decisions on stratification and sample sizes we could still do better. Firstly, it appears that having four strata would be useful: the third (including IPs shared by from 41 to around 150-200 hosts) and fourth (the rest) strata could together form the current Stratum 3. Such stratification could identify if there is some borderline (i.e., the certain number of hosts per IP) for servers to host deep web sites. Secondly, to improve the accuracy for obtained estimates, we could include more IPs in Stratum 3 sample.

6. RELATED WORK In Section 2, we mentioned existing deep web surveys and discussed the limitations of techniques used in these studies. The most serious drawback is ignoring the virtual hosting factor, which, in the context of deep web characterization, was firstly noticed in [19]. The same study also suggested the plain modification of the rsIP approach, namely random sampling of hosts, in which a primary unit for analy-

Num of dws which content is Aggregated Mirrored Original 4 3 4 2 0 0 0 5 0 410 (24%)

970 (58%)

300 (18%)

sis is a host rather than an IP address and analyzed hosts are randomly selected from a large listing of hosts. While the virtual hosting factor is apparently not a problem for host-based sampling there are, however, other limitations of such technique [19]. Firstly, the method requires a large list of hosts to properly cover a certain part of the Web (e.g., a specific national segment), otherwise the sample analysis findings would be meaningless. Secondly, many web sites are accessible via several hostnames, and, in general, identifying all such hostname aliases in a given list of hosts is uncertain. For example, googel.com and www.ziqing.net are among the numerous aliases for the Google.com search site – while the first one is easily identified using the string similarity the second is hard to reveal. Thus, as some hosts in a sample may have unknown aliases in a non-sampled population, the estimates produced by the random sampling of hosts method are upper-bound. Several studies on the characterization of the indexable Web space of various national domains have been published (e.g., [8, 14, 21]). The review work [7] surveys several reports on national Web domains, discusses survey methodologies and presents a side-by-side comparison of their results. The idea of grouping hosts based on their IP addresses was used by Bharat et al. [11] to identify host aliases (or mirrored hosts according to Bharat’s terminology). At the same time, we are unaware of any web survey study based on the Host-IP clustering approach.

7.

CONCLUSION

The Host-IP clustering sampling technique addresses drawbacks of previous deep web surveys and allows to accurately characterize a national segment of the deep Web. We conducted the survey of Russian deep Web and estimated, as of September 2006, the overall number of deep web sites in the Russian segment of the Web as 15,700±3,700 and the overall number of web databases as 20,700±4,400. We also got an insight on the size of Russian deep Web by calculating the upper-bound estimate for the total number of entities in online databases. The obtained estimate, ten billions entities, suggests that the deep and indexed parts of Russian Web are of similar sizes. Lastly, we argued that the deep Web in terms of number of entities is not that huge and is of comparable size with the indexed part of the Web.

Notes and Comments. Data collected during the survey is available at http: //www.tml.tkk.fi/~denis/datasets/.

8. REFERENCES [1] Internet Archive’s snapshot of Yandex statistics page as of September 18, 2006. http://web.archive.org/ web/20060918081218/http://company.yandex.ru/. [2] Runet in March 2006: domains, hosting, geographical location. http://www.rukv.ru/analytics-200603.html. In Russian. [3] Runet in March 2007: domains, hosting, geographical location. http://www.rukv.ru/runet-2007.html. In Russian. [4] April 2004 Web Server Survey. http://news.netcraft.com/archives/2004/04/01/april_ 2004_web_server_survey.html, April 2004.

[5] DNS load balancing report. http://www.securityspace.com/s_survey/data/ man.200404/dnsmult.html, April 2004. [6] R. Baeza-Yates and C. Castillo. Crawling the infinite Web: five levels are enough. In Proceedings of the third Workshop on Web Graphs (WAW), pages 156–167, 2004. [7] R. Baeza-Yates, C. Castillo, and E. N. Efthimiadis. Characterization of national Web domains. ACM Trans. Internet Technol., 7(2), 2007. [8] R. Baeza-Yates, C. Castillo, and V. L´ opez. Characteristics of the Web of Spain. Cybermetrics, 9(1), 2005. [9] M. Bergman. The deep Web: surfacing hidden value. Journal of Electronic Publishing, 7(1), 2001. [10] K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Comput. Netw. ISDN Syst., 30(1-7):379–388, 1998. [11] K. Bharat, A. Broder, J. Dean, and M. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. J. Am. Soc. Inf. Sci., 51(12):1114–1122, 2000. [12] M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the Web. Commun. ACM, 54:72–79, 2011. [13] K. C.-C. Chang, B. He, C. Li, M. Patel, and Z. Zhang. Structured databases on the Web: observations and implications. SIGMOD Rec., 33(3):61–70, 2004. [14] D. Gomes and M. J. Silva. Characterizing a national community web. ACM Trans. Internet Technol., 5(3):508–531, 2005. [15] C. A. Lynch. The Z39.50 information retrieval protocol: an overview and status report. SIGCOMM Comput. Commun. Rev., 21(1):58–70, 1991. [16] E. T. O’Neill, P. D. McClain, and B. F. Lavoie. A methodology for sampling the World Wide Web. Annual Review of OCLC Research 1997, 1997. [17] D. Shestakov. Deep Web: databases on the Web. In: Handbook of Research on Innovations in Database Technologies and Applications, pp. 581–588, IGI Global (2009). [18] D. Shestakov. Sampling the national deep Web. In Proceedings of DEXA 2011, pages 331–340, 2011. [19] D. Shestakov and T. Salakoski. On estimating the scale of national deep Web. In Proceedings of DEXA’07, pages 780–789, 2007. [20] D. Shestakov and T. Salakoski. Characterization of

national deep Web. Technical Report 892, Turku Centre for Computer Science, May 2008. [21] G. Tolosa, F. Bordignon, R. Baeza-Yates, and C. Castillo. Characterization of the Argentinian Web. Cybermetrics, 11(1), 2007.