Host-IP Clustering Technique for Deep Web ... - IEEE Xplore

0 downloads 0 Views 121KB Size Report
Abstract—A huge portion of todays Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep ...
2010 12th International Asia-Pacific Web Conference

Host-IP Clustering Technique for Deep Web Characterization Tapio Salakoski Department of Infomation Technology University of Turku Finland-20014 Turun yliopisto

Denis Shestakov Department of Media Technology Helsinki University of Technology PL 5400, Finland-02015 TKK [email protected]

data allows us to address drawbacks of previous surveys, specifically to take into account the virtual hosting factor. The next section gives a background on methods to characterize the deep Web. In Section III we present our approach, the Host-IP cluster sampling technique. The experiments and results of our survey of the Russian Web are described in Section IV. Finally, Section V concludes the paper.

Abstract—A huge portion of todays Web consists of web pages filled with information from myriads of online databases. This part of the Web, known as the deep Web, is to date relatively unexplored and even major characteristics such as number of searchable databases on the Web is somewhat disputable. In this paper, we are aimed at more accurate estimation of main parameters of the deep Web by sampling one national web domain. We propose the Host-IP clustering sampling technique that addresses drawbacks of existing approaches to characterize the deep Web and report our findings based on the survey of Russian Web conducted in September 2006. Obtained estimates together with a proposed sampling method could be useful for further studies to handle data in the deep Web.

II. BACKGROUND : DEEP WEB CHARACTERIZATION Existing attempts to characterize the deep Web [2], [3], [4] are based on two methods originally applied to general Web surveys: namely, overlap analysis [5] and random sampling of IP addresses [6]. The first technique involves pairwise comparisons of listings of deep web sites, where the overlap between each two sources is used to estimate the size of the deep Web (specifically, total number of deep web sites) [2]. The critical requirement to listings be independent from one another is unfeasible in practice, thus making the estimates produced by overlap analysis seriously biased. Additionally, the method is generally non-reproducible. Unlike the overlap analysis the second technique, the random sampling of IP addresses technique (rsIP for short), is easily reproducible and requires no pre-built listings. The rsIP estimates the total number of deep web sites by analyzing a sample of unique IP (Internet Protocol) addresses randomly generated from the entire space of valid IPs and extrapolating the findings to the Web at large. Since the entire IP space is of finite size and every web site is hosted on one or several web servers, each with an IP address1 , analyzing an IP sample of adequate size can provide reliable estimates for the characteristics of the Web in question. In [3], one million unique randomly-selected IP addresses were scanned for active web servers by making an HTTP connection to each IP. Detected web servers were exhaustively crawled and those hosting deep web sites (i.e., web sites with at least one search interface to a database) were identified and counted. Unfortunately the rsIP approach

Keywords-deep web; web characterization; virtual hosting; host-IP clustering sampling; search interface discovery

I. I NTRODUCTION Dynamic pages generated based on parameters provided by a user via web search forms are poorly indexed by major web searchers and, hence, scarcely presented in searchers results. Such search interfaces provide web users with an online access to myriads of databases, contents of which comprise a huge part of the Web known as the deep Web [1]. Though the term deep Web was coined in 2000 [2], sufficiently long ago for any web-related concept/technology, many important characteristics of the deep Web still remain unknown. For example, such parameter as the total number of searchable databases on the Web is highly disputable. In fact, until now there are only three works (namely, [2], [3], [4]) solely devoted to the deep web characterization and, more than that, one of these works is a white paper, where all findings were obtained by using proprietary methods. Another matter of concern is that the mentioned surveys are based on approaches with inherent limitations. The most serious drawback is ignoring so called virtual hosting, i.e., the fact that multiple web sites can share the same IP address. Neglect of virtual hosting factor means that the estimates produced by existing deep web surveys are highly biased. In this work, our goal is to propose better technique for deep web characterization. Our approach is based on the idea of clustering hosts sharing the same IPs and analyzing “neighbors by IP” hosts together. Usage of host-IP mapping 978-0-7695-4012-2/10 $26.00 © 2010 IEEE DOI 10.1109/APWeb.2010.59

1 An IP address is not a unique identifier for a web server as a single server may use multiple IPs and, conversely, several servers can answer for the same IP.

378

has several limitations. The most serious drawback of the rsIP is ignoring virtual hosting, i.e., the fact that multiple web sites can share the same IP address. The virtual hosting problem is that for a given IP address there is no reliable procedure to obtain a full list of web sites hosted on a web server with this IP. As a result, the rsIP method ignores a certain number of sites, some of which may apparently be deep web sites. The virtual hosting issue in the context of deep web characterization was firstly noticed in [4]. The same study also suggested the plain modification of the rsIP approach, namely random sampling of hosts, in which a primary unit for analysis is a host rather than an IP address and analyzed hosts are randomly selected from a large listing of hosts. While the virtual hosting factor is apparently not a problem for host-based sampling there are, however, other limitations of such technique [4]. Firstly, the method requires a large list of hosts to properly cover a certain part of the Web (e.g., a specific national segment), otherwise the sample analysis findings would be meaningless. Secondly, many web sites are accessible via several hostnames, and, in general, identifying all such hostname aliases in a given list of hosts is uncertain. Thus, as some hosts in a sample may have unknown aliases in a non-sampled population, the estimates produced by the random sampling of hosts method are upper-bound. To summarize, the virtual hosting cannot be ignored in any IP-based sampling survey. The change from IP-based to host-based sampling eliminates the “many web sites on the same IP” problem but has its own inherent limitations. Next we propose the sampling strategy that addresses these challenges.

server serves requests only to those hosts that are mapped to a server’s IP. Once the overall list of hosts is clustered by IPs we can apply a cluster sampling strategy, where an IP address is to be a primary sampling unit consisting of a cluster of secondary sampling units, hosts. In the cluster sampling, whenever a primary unit is included in a sample, every secondary unit within it is analyzed. To summarize, our Host-IP approach to characterization of deep Web consists of the following two major steps: Resolving, clustering and sampling: resolve a large number of hosts relating to a studied web segment to their IP addresses, group hosts based on their IPs, and generate a sample of random IP addresses from a list of all resolved IPs. Crawling and deep web site identification: for each sampled IP crawl hosts sharing a sampled IP to a predefined depth. While crawling new hosts (i.e., not in the initial main list) may be found: those mapped to a sampled IP are to be analyzed, others are analyzed if certain conditions met. Finally, analyze all crawled pages for the presence of search interfaces to databases2 . IV. E XPERIMENTS AND RESULTS We run our experiments in September 2006. We used two sets of hostnames from datasets “Hostgraph” and “RUhosts” [8]. We merged them into one listing of unique hostnames. We then resolved all the hosts to IP addresses and, after removing all host-IP pairs with invalid IPs, resulted in 717,240 host-IP pairs formed by 672,058 unique hosts and 79,679 unique IP addresses. Excluding redundant hostIP pairs left us with 672,058 hosts on 78,736 IP addresses. In this way, our compiled dataset gives us yet another support for the magnitude of virtual hosting: there are nine virtual hosts per one IP address on average3 . The degree of IP address sharing for our dataset is mentioned in [8]. Particularly, 55,6% (398,608) of all hosts in the dataset share their IPs with at least 200 other hosts. Next we clustered hosts by their IPs and, in a such manner, got 78,736 groups of hosts, each having from one to thousands of hosts. It is natural to assume that deep web sites are more likely to be found within host groups of certain sizes, i.e., it might be beneficial to study groups with a few hosts separately from groups including hundreds of hosts. The idea behind such separation lies in the fact that IP addresses referred to a large number of hosts are good indicators of server hosting spam web sites and, hence, deep web sites are less likely to be found when analyzing such IPs. Another reasoning to stratify was to actually verify whether deep web sites are served by servers

III. H OST-IP CLUSTERING SAMPLING APPROACH Real-world web sites typically share their web servers with other sites and are often accessible via multiple hostnames. Neglecting these issues makes estimates produced by IP-based or host-based sampling seriously biased. The clue to a better sampling strategy lies in the fact that hostname aliases for a given web site are frequently mapped to the same IP address. In this way, given a hostname resolved to its IP address, we can identify other hostnames potentially pointing to the same web content by checking other hostnames mapped to this IP. One can see here a strong resemblance to the virtual hosting problem where the task is to find all hosts sharing a given IP address. Assuming a large listing of hosts is available, we can acquire the knowledge about which hosts mapped to which IPs by resolving all hostnames in the listing to their corresponding IP addresses. Technically, such massive resolving of known hosts to their IPs is essentially a process of clustering hosts into groups, each including hosts sharing the same IP address. Besides, grouping hosts with the same IPs together is quite natural because it is exactly what happens on the Web, where a web

2 We identified search interfaces to online databases using the framework described in [7]. 3 A host resolved to two or more IPs is counted for each corresponding IP.

379

Table I H OST-IP CLUSTERING METHOD : APPROXIMATE 95% CONFIDENCE INTERVALS FOR THE TOTAL NUMBERS OF DEEP WEB SITES AND WEB DATABASES IN EACH STRATUM AND IN THE ENTIRE SURVEY. Strata

Stratum 1: - Detected in sample - Confidence interval, [103 ] Stratum 2: - Detected in sample - Confidence interval, [103 ] Stratum 3: - Detected in sample - Confidence interval, [103 ] Survey total: - Confidence interval, [103 ]

Num of all

Num of Russian

deep web sites

web databases

deep web sites

web databases

Num of (after unequal probability reduction) deep web sites web databases

80 6.0±1.4

131 9.9±3.6

72 5.4±1.3

106 8.0±2.8

61.2 4.6±1.2

86.7 6.6±2.1

38 2.1±0.6

46 2.5±1.0

38 2.1±0.6

46 2.5±1.0

36.1 2.0±0.6

44.1 2.4±1.0

64 9.6±3.1

87 13.0±5.2

55 8.2±3.3

68 10.2±3.1

51.2 7.6±3.3

62.6 9.3±3.4

17.7±3.4

25.4±6.2

15.7±3.5

20.7±4.1

hosting only a few sites. We decided to form three strata and chose the number of hosts in a group as the stratification criteria. Stratum 1 includes those IP addresses that are each associated with seven or less hostnames, while Stratum 3 combines host groups with no less than 41 hosts in each. The criteria values, namely 7 and 41, were taken to make Stratum 1 contain 90% of all IP addresses and to put 70% of all hosts into Stratum 3.

14.2±3.5

18.3±4.0

V. C ONCLUSION We described a new sampling strategy, the Host-IP clustering sampling, to characterize the deep Web. The Host-IP clustering sampling technique allows us to address drawbacks of previous deep web surveys, specifically to take into account the virtual hosting factor. Finally, we conducted the survey of Russian deep Web and estimated, as of September 2006, the overall number of deep web sites in the Russian segment of the Web as 14,200±3,500 and the overall number of web databases as 18,300±4,000.

Depending on a stratum, we included 0.6-1.9% of each stratum’s IP addresses in a sample and analyzed every IP from our three samples (1075 IPs in total). The estimates for the total numbers of deep web sites and databases and their corresponding confidence intervals were calculated according to the formulas given in [8]. The final results are presented in Table I. The ’Num of all’ column shows (in italic) the numbers of deep web sites and web databases that were actually detected in strata. However, not all of them were appeared to be Russian deep web sites. In particular, several sampled hosts in .RU were in fact redirects to nonRussian deep web resources. Another noticeable example in this category is xxx.itep.ru, which is one of the aliases for the Russian-mirror of essentially international open eprint archive arXiv.org. We excluded all such non-Russian resources and put the updated numbers in the ’Num of Russian’ column. Next we examined whether each detected deep web site is accessible via host(-s) on IP(-s) other than a sampled IP. If so, a corresponding weight has to be assigned to such a deep web resource [8]. As a result, we assigned such weights to a certain quantity of deep web sites and databases and aggregated the numbers in the ’Num of (after unequal probability reduction)’ column of Table I.

ACKNOWLEDGMENT We would like to thank the Yandex LLC, a Russian search engine, for providing us with the Hostgraph dataset. R EFERENCES [1] D. Shestakov, “Deep Web: databases on the Web,” 2009, entry in Handbook of Research on Innovations in Database Technologies and Applications, IGI Global. [2] M. Bergman, “The deep Web: surfacing hidden value,” Journal of Electronic Publishing, vol. 7, no. 1, 2001. [3] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep Web,” Commun. ACM, vol. 50, no. 5, pp. 94–101, 2007. [4] D. Shestakov and T. Salakoski, “On estimating the scale of national deep Web,” in Proceedings of DEXA’07, 2007, pp. 780–789. [5] K. Bharat and A. Broder, “A technique for measuring the relative size and overlap of public web search engines,” Comput. Netw. ISDN Syst., vol. 30, no. 1-7, pp. 379–388, 1998. [6] E. T. O’Neill, P. D. McClain, and B. F. Lavoie, “A methodology for sampling the World Wide Web,” Annual Review of OCLC Research 1997, 1997.

The survey results, the overall numbers of deep web sites and web databases in the Russian segment of the Web as of September 2006, estimated by the HostIP clustering method are 14,200±3,500 and 18,300±4,000 correspondingly (see boxed in Table I).

[7] D. Shestakov, “On building a search interface discovery system,” in Proceedings of RED’09, 2009. [8] D. Shestakov and T. Salakoski, “Characterization of national deep Web,” Turku Centre for Computer Science, Tech. Rep. 892, May 2008.

380

Suggest Documents