Observing spatial and temporal metadata in hostnames 1. Introduction

5 downloads 149 Views 592KB Size Report
Page 1 ... Hostnames map to IPs and a set of IP addresses relative to a specific geographic ... Using the hostname representation of this sample we compute the ...
Paraskevas V. Lekeas

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

Observing spatial and temporal metadata in hostnames Paraskevas V. Lekeas* National Technical University of Athens Thessalias 4, Zographou, 15772, Greece

Abstract. In this work we present some evidence showing that spatial and temporal metadata can be mined from hostnames. Hostnames map to IPs and a set of IP addresses relative to a specific geographic region forms an IP map. If we choose a geographic region and obtain its IP map we can study the hostname representation of the map and mine spatial and temporal information that refer to the entities that use these hostnames (metadata). In this work we choose United Kingdom to be a region of interest and we take a random sample of its IP map. Using the hostname representation of this sample we compute the spatial distribution of the infrastructure of a British ISP and a snapshot of its Internet traffic. We also compute the rate of growth of another British ISP by introducing time in hostname's representation with the help of a partial order. At last we conclude that the above spatial and temporal information can serve as an independent source for the evaluation of these companies. Keywords: IP sampling, Internet Providers, Knowledge discovery.

Service

1. Introduction We introduce this work with the following motivating example concerning Internet Service Providers (ISPs). ISPs are companies that offer various kind of internet services (dial up connections, voice over IP, ADSL etc.) and usually spread their infrastructure over the geographic regions of countries. Suppose now that someone would like to draw a map showing how the infrastructure of an ISP is geographically distributed. How is such a map going to be constructed? Of course someone may ask what is the need of such a map but some obvious answers would be to visualize the coverage of the ISP in the territory and to compare ISPs in terms of their coverage calculating also some quantity of their infrastructure. Another question that someone may pose would be "how the infrastructure of an ISP evolves in time?". An answer to this question may inform us about the growth rate of the ISP and can help us compare two ISPs in terms of their economic development. In this work we will try to give answers to all of these questions by processing representative samples of hostnames that belong to ISPs. These *

E-mail: [email protected]

samples are drawn from parts of the IPv4 address space called country code Top Level Domains (ccTLDs) and the main idea is the following: Internet is a heterogeneous collection of interconnected networks that use IP addresses to route information. The set of all these addresses form the IPv4 (version 4) address space and this space can be clustered in smaller parts that can be viewed as IP maps. Each IP map refers to a geographic region and contains IP addresses that network entities (e.g. an ISP, an Autonomous system) use inside that region. As we said the infrastructure of an ISP may spread over the geographic region and may experience changes over time. This kind of information (spatial and temporal) is what we seek to extract from IP maps. Unfortunately we can't directly extract such information since IP addresses are just numbers and does neither contain indication for geographic location nor capture time events. What we can do instead is to use another representation of the IP addresses that captures spatial and temporal information. In this work we prove that this can be done using the hostname representation of the IP addresses. Related work to this problem has to do with mapping IPs to geographic locations of the hosts to which IP addresses have been assigned. In Padmanabhan and Subramanian (2001) 2 techniques close to this work are presented. The first technique is called GeoCluster and builds location maps for subsets of IPv4 address space assuming predetermined knowledge of location of some hosts in the subset. The second is called GeoTrack and tries to infer locations of hosts using the hostname representation. In Ziviani, Fdida, Rezende and Duarte (2002) hosts with a known geographic location are distributed and host locations are provided by network delay measurements. In Moore, Periakaruppan, Donohoe and Claffy (2000) and in IP2LL the NetGeo and IP2LL tools respectively are presented which queries whois servers to infer geographic locations of hosts. Although all of the previous techniques and tools try to solve the problem of how to give the geographic location of a given IP address our work differs in the following sense: Our objective is not just to map IPs to hosts but to infer spatial and temporal information for group of hosts (e.g. an ISP) using their IPs and hostnames.

Paraskevas V. Lekeas

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

The rest of the paper is structured as follows. In section 2 we give some terminology from computer networks. Readers familiar with that may skip it. In section 3 we present the methodology. We choose a geographic region and we construct its IP map. In the same section we use a sampler to get a subset (dataset) of this map that consists of hostnames. In section 4 we present how the dataset is clustered and in section 5 we mine spatial and temporal information from 2 of its clusters. More specific we calculate the spatial distribution of the infrastructure of an ISP and a snapshot of its internet traffic. We also compute the growth of another ISP in a 6 years time period. In section 6 we give the conclusions and at last we give an appendix with the constructed spatial map of the ISP's infrastructure and internet traffic.

2. Some basics Networks

from

Computer

In this section we present some definitions and terminology from the theory of Computer Networks necessary for the rest of the paper. Familiar readers may skip this section. In theory of Computer Networks (Peterson & Davie, 1996) we refer to networks that are interconnected using TCP/IP protocol as the “Internet”. A host in the Internet is just a node in the network that either helps users or run application programs. In order for a set of hosts to communicate with each other each one must know which of the other nodes it wants to communicate with. This means that each node must have a unique address. This unique address is called IP address (referencing the Internet Protocol service) and consists of two parts, a network part that specifies the network to which the host is attached and a host part that identifies the host uniquely in the network. All of these unique addresses compose the so-called IPv4 (version 4) address space. Since in current Internet version (4) we use only 32 bit words, the total amount of the IP addresses is 232. Each IP address is written as four decimal integers separated by dots. Each integer can belong to the range [0,…, 255], so the IPv4 address space starts from IP 0.0.0.0 and ends at IP 255.255.255.255. IPv4 address space is worldwide administered by IANA (IANA). IANA assigns addresses according to geographic regions and the earth has been divided into four regions. In each region there is an organization responsible for the IP addresses. These organizations are called RIRs (Regional Internet Registries) and namely are: APNIC (Asia Pacific Region), ARIN (America and Southern Africa), LACNIC (Latin America and some Caribbean islands) and RIPE NCC

(Europe and surrounding areas) (APNIC; ARIN; LACNIC; RIPE NCC). We have seen so far that each RIR holds IP allocation maps that inform us which IP addresses have been further assigned to internet entities (such as LIRs –Local Internet Registries – or AS – Autonomous Systems –) that belong to the region the RIR is responsible for. To give an example RIPE NCC holds some of its maps at: ftp://ftp.ripe.net/ripe/stats. We have seen so far that IPs are used to identify hosts, but since they are numbers they are not user friendly. The solution to this is to map these addresses to unique names. So each host has a unique hostname. Since hostnames are represented by alphanumerical characters they can be mnemonic based and so easier for humans to remember (we will see later that the above representation is important for our spatial and temporal observations). All the possible host names constitute the Internet name space, which is administered by the Domain Name System (DNS); a hierarchical name space that have domain names for each country (called country code Top Level Domains, ccTLD for short, e.g. .fr for France, .uk for United Kingdom etc.) plus the big six domains .edu, .mil, .org, .gov, .com, .net. From the above we can see that humans prefer to work in the Internet name space using hostnames, machines prefer to work in the IPv4 address space using IPs and between these two spaces there exist a resolution mechanism that resolves IPs to hostnames and vice versa1.

3. A method for obtaining hostname samples In this section we present a methodology for obtaining representative samples of hostnames from geographic regions: the first step is to decide which region we want to study and the second step is to construct the IP map of that region. The third step is to sample that map and the last step is to get the sample and convert it to its hostname representation. After completing all the above steps we have a dataset that contains hostnames and we are ready to process it in order to extract spatial and temporal information.

3.1 The geographic region and the corresponding IP map As we said the first step is the choice of the geographic region. Although this step may seems easy to take there exist a restriction that has to do with how Regional Internet Registries assign IP 1 Commands that use this resolution mechanism are ping (DOS) and nslookup (UNIX).

Paraskevas V. Lekeas

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

addresses. RIRs usually assign addresses in network entities that establish activities in countries inside RIRs regions of authority. This restricts our choice of geographic region to countries since IP allocations are organized according to which country they where assigned. The good part of this restriction is that it simplifies a lot our method. Let us see why: Suppose we get Greece as the geographic region of interest. Then in order to construct the IP map we have to check only one ccTLD (the .gr) and find the assignments inside it. Taking now the first step let us choose United Kingdom as the geographic region of interest. This choice is based in British high rates of hosts per inhabitants according to Eurostat (Eurostat). This will result in obtaining a "rich" in hostnames dataset. Other datasets that we collected (e.g. from Greece and Jordan) were not as rich in data as that of United Kingdom. For the second step we have to construct the IP map of United Kingdom. For this we will use an approximation (Paltridge, 1999) that says that with a fair degree of confidence we can correlate the assignments of internet hosts to countries according to ccTLDs. The ccTLD for United Kingdom is .uk and the Regional Internet Registry in charge is RIPE. We contacted RIPE (RIPE NCC) and collected all the assignments that were done inside .uk until December 2002. The full IP map contains 21,771,520 IPs and can be accessed at http://users.ntua.gr/plekeas/ Data_sets/metadata _in_hostnames (dataset). In Table 1, we can see a small part (the beginning and the end) of this map. This map can be used as follows: Each row gives the first IP and the number of IPs the allocation contains. For example the first row says: starting from IP 62.3.64.0, 16384 = 214 continuing IPs have been assigned to some network entity in the United Kingdom.

Table 1. Small part of .uk IP map 62 62 62 62 62 62 62 62 62 62 62 62 . . . 217 217 217 217 217 217 217 217

3 6 7 8 12 13 18 24 25 25 28 30 . . . 196 196 197 197 198 199 199 204

64 0 0 96 64 128 0 128 0 128 0 0 . . . 0 224 32 192 32 160 176 0

0 0 0 0 0 0 0 0 0 0 0 0 . . . 0 0 0 0 0 0 0 0

16384 65536 65536 8192 8192 8192 131072 32768 32768 32768 65536 65536 . . . 4096 4096 4096 4096 4096 4096 4096 262144

3.2 Sampling the map Having defined the geographic region and it's IP map we are heading through the third step which is drawing a representative sample from this map. This step is very important since we can't manipulate the whole map which contains 27,771,520 IPs. We would rather use a sampling technique to draw representative samples and then extrapolate to the whole population. We will use the simple random sampling technique (Cochran, 1977) to draw a 0.1% sample of the map as follows: using a random number generator (Press, Teukolsky, Vetterling, & Flannery, 2002 and Park & Miller, 1988) we produce 21,772 random numbers that belong to (0,1) ∈ R .

1. Input sample size, S; 2. Output S random numbers; 3. Input numbered list of IP map, L; 4. Transform S to the specific L range, S’; 5. Select from L the S’ numbers and produce a presample, P; 6. Filter P excluding IPs with port 80 closed and Output data set,

D;

Fig. 1. The Sampling Algorithm

By applying a linear transformation we convert them to 21,772 random numbers that belong to

(1, 21771520) ∈ Ν .Using these random numbers we draw the sample from the map. In Fig. 1, we

Paraskevas V. Lekeas

can see the algorithm that samples the IP map. This algorithm was implemented in C (the random number generator) and in java (the rest of the sampler).

3.3 Converting the sample to hostname representation The last step of the method is to get the sample and keep only these IPs that resolve to hostnames. Using the ping command we queried all the IPs in the sample for their hostnames. IPs that did not return a hostname were excluded from the final sample which we call the "dataset".

4. The dataset In this section we present the dataset that we collected using the methodology of section 3. Due to it's large size the dataset can't be shown in a list or table but can be accessed at http://users.ntua.gr/plekeas/Data_sets/ metadata_in_hostnames. Table 2 presents a small part of it. Table 2. Small part of the dataset ... modem-3575.orangutan.dialup.pol.co.uk modem-3719.hyena.dialup.pol.co.uk modem-972.barrelled.dialup.pol.co.uk ns.ecat-tech.com ns2.avantdns.co.uk ns2.qtec.uk.net ntfm382-10.facility.pipex.com pc1-cmbg3-5-cust115.cmbg.cable.ntl.com pc1-papw1-6-cust77.cmbg.cable.ntl.com 80-194-58-94.cable.ubr08.bf.blueyonder.co.uk adsl.r-t-f-m.co.uk 213.219.21.165 213.249.159.75 217.13.131.167 www.heavens-above.org www.jacksonspower.co.uk www.kayto.co.uk www.learning-opportunities.co.uk fp03-347.web.dircon.net ftp.dip.co.uk host213-1-133-95.in-addr.btopenworld.com ...

From this table we can see that the hostnames inside the dataset form clusters and in each cluster information is encoded in the same manner. Generally there exist two kind of clusters. The first kind contains hostnames that begin with "www" (see Table 3).

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492. Table 3. Part of the www cluster www.focus21.co.uk www.continentalresources.org www.easyshopping4u.co.uk www.oouid.man.ac.uk www.sundogs.co.uk www.patricksofcamelon.com www.quays.co.uk www.issltech.net www.vinesgroup.co.uk

These hostnames are base URLs and contain no spatial information because according to Svensson (1998) URLs does not give physical location but location in the hyperspace (web), which we know that is not spatial in the sense of bodily experience. All these hostnames form a big cluster (the "www" cluster) which is not useful for us since we can't extract from it spatial and temporal information. The second kind contains hostnames that are not URLs. These hostnames form various clusters and in each cluster all the hostnames have the same "tail". For example the hostnames pc3hudd3-3-cust138.hudd.cable. ntl.com and pc1bagu4-3-cust54.mant.cable .ntl.com share the same tail "cable.ntl.com". Each tail gives us information about the owner of each hostname (e.g. an ISP, an Institution, a bank etc.). As we will see in the next section these clusters will be proved useful since we can extract from them spatial and temporal metadata.

5. Knowledge dataset

discovery

from

the

So far we applied our methodology and produced a representative dataset of hostnames from .uk ccTLD. We also saw that in this dataset clusters of hostnames are formed. In this section we will examine two of these clusters extracting spatial and temporal information. From the first cluster we are going to extract spatial information about the infrastructure and the internet traffic of a British ISP. From the second cluster we are going to extract temporal information about the rate of growth of another British ISP.

5.1 Mining an Infrastructure and an Internet traffic spatial map of a British ISP Let us enumerate all the clusters in the dataset (as we said the dataset can be accessed at http://users.ntua.gr/plekeas/Data_sets/metadata_i n_hostnames) and examine cluster 5. This cluster belongs to one of the biggest ISPs in UK (NTL

Paraskevas V. Lekeas

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

Internet). We can see a small part of cluster 5 in Table 4. Table 4. Small part of cluster 5 ... pc2-whit1-3-cust241.cdif.cable.ntl.com pc3-hitc1-6-cust172.lutn.cable.ntl.com pc2-cmbg4-5-cust140.cmbg.cable.ntl.com ...

From this table we can see that each hostname can be written in the form of A.B.tail. For example the hostname pc2-whit1-3-cust24 .cdif.cable.ntl.com has A = pc2-whit1-3-cust241, B = cdif and tail = cable.ntl.com. Inspection now on B parts of cluster 5 shows that there exist an encoding of British city names. For example cdiff encodes Cardiff, lutn encodes Luton, cmbg encodes Cambridge etc. To check this conjecture we compared all B parts of cluster 5 with the first 3 letters of British city names as these are stated in the 2001 census of population of England and Wales (UK’s National Statistics online, 2001). We consider a match if the first three letters of B part matches exactly with the first 3 of a British city (if two or more candidate cities exist we consider the one with the greater population according to the survey). Table 5. Comparing the first 3 letters of B parts in cluster 5 First 3 letters of B segment

match (Census 2001)

match (Ripe database)

matches in cluster 5

blf brh bro cdi cmb col glf hud lds lut man mid not nrt oxf pop ren swa win

X X Bromley X X Colchester X X X Luton Manchester Middlesbrough Nottingham X Oxford X X Swansea Windsor

Belfast Birmingham Bromley Cardiff Cambridge Colchester Guildford Huddersfield Leeds Luton Baguley Middlesbrough Nottingham Northampton Oxford Waltham Park Renfrew Swansea Brentford

3 5 1 5 6 2 1 1 4 5 1 7 18 2 3 1 9 3 2

To validate our results we also queried the ripe whois database (RIPE NCC) with the resolved IP of each hostname. The ripe whois database is a database that contains information about IP address space allocations, routing policies etc. in the RIPE region and beyond. Records in this database are known as objects and we are interested in 2 specific objects, "inetnum" (contains information assignments of IPv4 address space) and "role" (contains information

about technical or administrative contacts, but also describes a role performed by one or more human beings). These objects sometimes may contain spatial information that identifies the holder of the assignments (in our case the ISP). In Table 5 we summarize our results. There we can see that there is a 43% match with census and a 62% match with ripe. These numbers can help us estimate a spatial pattern that maps the geographic spread of ISP’s infrastructure in UK since we can imagine hostnames of cluster 5 as routers (infrastructure). This map can also help us measure ISPs internet traffic since in every A part of the hostnames “–cust” can be interpreted as "customer". The above can help us construct an infrastructure and internet traffic spatial map of the ISP. This map can be seen in Fig. 2 (Appendix). The circle-size represents the number of appearances of encoded UK citynames in hostnames of cluster 5. Also circle-size can represent the internet traffic in UK geographic region, of the specific ISP in the specific time period (Dec. 2002) of our data collection. Fig. 2 was created using Generic Mapping Tool's (GMT) libraries of coastlines and centers of circles represent the real latitudes and longitudes of the specific UK cities.

5.2

Detecting the growth rate of an ISP

In section 5.1 we presented a case where spatial data where mined. In this section we will try to do the same with temporal data. In almost all the clusters of our dataset there appear codes in hostnames that a number attached to them increases. This captures some enumeration of these objects. For example let us take hostnames modem-2266.giraffe. dialup .pol.co.uk and modem-785.giraffe. dialup.pol .co.uk from cluster 3. Using common sense reasoning we can see that the above coding states that on server giraffe there exist 2 modems with names 785 and 2266. Other examples are hostnames ppp-0-162.birm-a-2.access.uk.tiscali .com, ppp-1-189.birm-a-2.access.uk.tiscali.com, ppp-1-98.birm-a-1.access.uk.tiscali.com from cluster 7, or hostnames web1113.pavilion.net, web151 .pavilion.net and web411.pavilion.net from cluster 13. Going back to hostnames in cluster 3 (in this section we process cluster 3, another large British ISP provider that appears in our dataset) we can assume that the information that is probably to change over time is the number of modems in the specific server (e.g. the number may increase due to higher needs or decrease due to some technical problem). Despite the fact that the group of modems attached to the specific server may change its size over time, it

Paraskevas V. Lekeas

has a specific location in physical space. This happens because when somewhere systematic dialup is happening (e.g. in an Internet Service Provider) then all the modems are located in a rack close to the server for easy inspection and maintenance. But how can we introduce time in cluster 3? Our idea is to insert time with the help of a partial order. Then if two hostnames belong to the same server they can be compared with some ≤ t relation. We can further extend our idea and introduce a partially ordered set. Let C 3 be a set of hostnames of cluster 3. C 3 is a partially ordered set with the relation ≤ t on C 3 because, it is a non empty set and for all (a, b, c∈ C 3 ) the properties of reflexivity (a ≤ t a), antisymmetry (a ≤ t b, b ≤ t a ⇒ a=b) and transitivity (a ≤ t b, b ≤ t c ⇒ a ≤ t c) are satisfied. The three properties are satisfied because each hostname has an attached IP number that uniquely identifies it. And since IP numbers can represent integers they can also be compared with the standard inequality relation ( ≤ ) on integers. The relation that we use ( ≤ t ) differs from the standard inequality in the sense that it compares integers that capture time events (that is why we use a t subscript). The idea behind all these is that as soon as the ISP grows more customers want to dial up and so the number of modems on the servers must increase (up to a limit of course). Also more servers may be maintained in order to satisfy the higher demands. But since each user must have an IP when he connects, each new modem attached to a server gets a unique IP. As we stated in section 2 the ISP is an LIR and gets its IP addresses from RIPE. The ISP assigns the IPs to the new modems pulling addresses from its reservoir. The pull is not at random but there is an order that tries to minimise fragmentation of the ISPs bank of addresses. It is natural when someone has to assign large numbers of IPs to large numbers of interfaces (modems here) to assign them statically. For example if 100 new modems get connected probably a bulk of 100 continuing IPs would map these modems. But this is exactly what we needed because in such a way of encoding the notion of time is preserved. For example modem-99 which is older (because it was named first) than modem-100 will have a successor or predecessor IP from modem-100. In this sense we can compare hostnames and order them in time. In Table 6, we present cluster number 3 with the corresponding IPs, name of servers and modem numbers. There we can see that some servers join the cluster with only one modem. This fact forbids the ≤ t to be a total relation on C 3 because there have to be at least 2 elements of the same server to order them in time.

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492. Table 6. servers, modems and corresponding IPs in cluster 3 Name of server

jaguar

hyena

gorilla

giraffe

gazelle alakazam antelope argonath barrelled bluestreakdamsel buffalo charizard cheetah clefairy cougar dragonwrasse elephant Elk grommet hodad lemur leopard lion monkey new-mexico orangutan parrotfish porcupine tiger wolf

modem #

Corresponding IP 81.76.(176.144-178.133144, 645, 1080, 1102, 180.56-180.78-181.1701450, 1603, 1689, 182.67-182.153-184.772125, 2539, 2669, 185.235-186.1092756, 2778, 3192, 186.196-186.2183606 188.120-190.22) 81.78.(112.214-113.23214, 279, 475, 584, 113.219-114.72-116.2131237, 1346, 1629, 117.66-118.93-118.1361672, 1955, 2391, 119.163-121.87-121.1522456, 2586, 3044, 122.26-123.228-124.153087, 3218, 3719, 124.146-126.1353828 126.144) 81.78.(96.216-97.47216, 303, 521, 543, 98.9-98.31-98.140652, 761, 957, 1131, 98.249-99.189-100.1071218, 1240, 1566, 100.194-100.216-102.30102.139-103.2101675, 2002, 2176, 104.128-106.73-108.622633, 3134, 3221, 108.149-108.236-110.73308, 3591, 3765, 4092 110.181-111.252) 459, 698, 785, 1177, 81.78.(81.203-82.18683.17-84.153-85.501330, 1351, 1656, 85.71-86.120-86.2291765, 1787, 1918, 86.251-87.126-88.222070, 2266, 2549, 88.218-89.245-91.1032919, 2941, 3137, 91.125-92.65-92.863158, 3202, 3376, 92.130-93.48-93.703398, 3550, 4008, 93.222-95.168-95.255) 4095 81.78.(64.201-65.141201, 397, 745, 897, 66.233-67.129-67.216984, 1093, 1311, 68.69-69.31-69.2051485, 2182, 2225, 72.134-72.177-73.2052509, 2617, 2682, 74.57-74.122-76.2423314, 3336, 3444, 77.8-77.116-78.1003684, 3793, 3858, 78.209-79.18-79.127) 3967 647 217.135.13.135 1833 217.134.23.41 47 62.136.124.47 972 62.25.143.204 74 1257 1149 2013 601 3099

62.136.241.74 217.134.68.233 217.135.73.125 217.134.103.221 217.135.91.89 217.134.236.27

115 3489 1801 163 272 2483 2197 3303 1293 170 3575 27 449 510 1176

62.137.1.115 217.134.253.161 81.76.167.9 62.25.156.163 62.25.161.16 217.135.137.179 217.135.152.149 217.135.172.231 217.135.213.13 62.137.82.170 217.135.237.247 62.137.46.155 217.134.193.193 62.136.209.254 81.76.132.152

By simple inspection of Table 6, we can see that for all the elements a, b ∈ C 3 that can be compared (that are attached to the same server) it holds that a ≤ t b ⇒ IPa ≤ t IPb, where IPx stands for IP of hostname x. It also holds that for all the elements a, b ∈ C 3 that can be compared, |ma mb | = | IPa – IPb | where mx stands for modem

Paraskevas V. Lekeas

number of hostname x and |x| is the absolute value of x. The above confirms our hypothesis that continuing bulk of IPs are assigned to new modems attached to the same servers. This simple structure of the C 3 partially ordered set can be very helpful in our data mining. First of all it helps us to predict to which location shall we move and expand our data set in order to contain only members of the cluster of interest. In C 3 we can define a server chain that is of the form IPsnk ≤ t + k +1 IPsn(k+1) ≤ t + k + 2 … ≤ t + k + m IPsn(k+m) (where k, m ∈ Ν , IPsnk stands for IP of the kth modem of server with name "sn" and ≤ t + k + m means that starting from t+k+0 moment when we assigned IPsnk we are now in t+k+m moment assigning IPsn(k+m)). Knowing a small part of this chain we can guess all of its range by simply finding the minimum and maximum elements of it. We can take the maximum and minimum elements (with the same name of server) that we have in our cluster, ping them to get their IPs and start to explore (upwards and downwards respectively) when there would be a change of the name server. This would imply that these points are the edges of the chain and define the population of modems attached to the specific server. Also the minimum elements of the chain can be found easily by pinging k IPs down from IPsnk. For example the first server in Table 6 is jaguar and the chain that we have for it is 81.76.176.144 ≤ t +144 81.76.178.133 ≤ t + 645 … ≤ t + 3191 81.76.188.120 ≤ t + 3192 81.76.190.22 ≤ t + 3606 . The start of the chain must be the IP that is 144 units smaller than 81.76.176.144. Indeed pinging 81.76.176.0 (we should be careful and ping –a each IP in order to resolve addresses to hostnames, a simple ping would probably result to a request time out) gives us a 0 modem number. Also pinging 81.76.175.255 gives us a hostname with a different name server and ensures us that we have found the minimum of the chain. Moving now from 81.76.190.22 and upwards we found that the maximum element of the chain is at 81.76.191.255. So the full chain of modems for the server would be 81.76.176.0 81.76.176.1 … ≤ t +1 ≤t + 2 ≤ t + 4094 81.76.191.254 ≤ t + 4095 81.76.191.255. We can see from this chain that the specific server (jaguar) has 212 modems attached to it. Testing the same technique for each server that contains more than one element in cluster 3 we can see that each one of them has 212 modems attached. We can also apply this technique to servers that join our data set with only one element and form their chain calculating their minimum and maximum elements. For example the cougar server appears in our data set with only one IP (217.134.236.27)

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

with a corresponding modem number 3099. Trying to ping 217.134.224.0 which is 3099 IPs far from our elements we get a 0 modem number which implies the beginning of the chain. For the upper bound of the chain we get 217.134.239.255. Reconstructing all the chains for all the servers in Table 6 and their modems attached gives us Table 7. Table 7. servers, modems and corresponding IPs in cluster 3

server name

Number of IP of first modem IP of last modem modems

antelope porcupine cougar elephant buffalo cheetah alakazam lemur leopard lion llama monkey orangutan charizard clefairy argonath tiger bluestreakdamsel dragonwrasse parrotfish new-mexico barrelled grommet hodad wolf elk jaguar hyena gazelle giraffe gorilla

217.134.16.0 217.134.192.0 217.134.224.0 217.134.240.0 217.134.64.0 217.134.96.0 217.135.11.0 217.135.128.0 217.135.144.0 217.135.160.0 217.135.176.0 217.135.208.0 217.135.224.0 217.135.69.0 217.135.89.0 62.136.124.1 62.136.208.0

217.134.31.255 (/) 4096 217.134.208.255 4096 217.134.239.255 4096 217.134.255.255 4096 217.134.78.255 4096 217.134.111.255 4096 217.135.15.255 1280 217.135.143.255 4096 217.135.159.255 4096 217.135.175.255 4096 217.135.191.255 4096 212.135.223.255 4096 217.135.239.255 4096 217.135.73.255 1280 217.135.93.255 (/) 1280 62.136.124.126 (*) 126 62.136.223.255 4096

62.136.241.1

62.136.241.254 (*) 254

62.137.1.1 62.137.46.129 62.137.82.1 62.25.140.0 62.25.156.0 62.25.160.0 81.76.128.0 81.76.160.0 81.76.176.0 81.78.112.0 81.78.64.0 81.78.80.0 81.78.96.0

62.137.1.254 62.137.46.254 62.137.82.254 62.25.143.255 62.25.159.255 62.25.164.255 81.76.143.255 81.76.175.255 81.76.191.255 81.78.127.255 81.78.79.255 80.78.95.255 81.78.111.255

total number of modems

(/) 52992

(-) 254 126 (-) 254 (+) 1024 1024 (+) 1024 (^) 4096 4096 4096 4096 4096 4096 (^) 4096

(*) 4476

(-) 634

(+) 3072

(^) 28672

So far our hypothesis was that only elements that belong to the same servers could be compared with ≤ t . Would it be correct to allow comparing between elements from different servers? In general this is not correct because RIPE assigns addresses to LIRs according to their needs and also according to RIPE’s available IPs. In Table 8, we can see the complete IP map that the ISP received from RIPE to use. (http://www.ripe .net/ripencc/mem-services/general /allocs4.html).

Paraskevas V. Lekeas

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

Table 8. The complete IP map of cluster’s 3 ISP date of assignment

IPs assigned

20000329 20001123 19970506 19990521 20020306 19960119 19970116 19960612 20010125

62.25.64/18 62.25.128/17 62.136/16 62.137/16 81.76/14 194.152.64/19 195.80.64/19 195.92/16 217.134/15

number of modems

We can see that allocations happen at different times and it is not necessary that bigger IPs are later allocations than smaller ones. In 2001/1/25 RIPE assigned 232-15 = 217 IPs (217.134.0.0 –

217.135.255.255) while in 2002/03/06 RIPE gave 232-14 = 218 IPs (81.76.0.0 – 81.78.255.255) to the ISP. From Table 7, servers lion and elk cannot be compared with ≤ t . If they did we would take that elk is older than lion which seems from Table 8 to be wrong. What is true is that the 5 colored regions of Table 7 represents 5 different time periods in the ISP growth history. Using Tables 7 and 8 we plot Fig. 3 that shows the ISP’s growth for a 6 years time period. From the plot we can see the high rates of growth, which explains the large assignments, that RIPE gave to the specific ISP (RIPE ask for proofs of demands before assigning large bulks of addresses to ISPs).

100000 80000 60000 40000 20000 0 1997 modems in cluster

2000

2002

year

ISP growth

Fig. 3. ISP growth

6 Conclusions In this work we present some heuristic techniques that can help us extract spatial and temporal metadata from hostnames. For this we introduced a method of drawing representative samples of hostnames from ccTLDs. More specific we drew a representative sample (dataset) of hostnames from .uk and manipulating it we discovered new knowledge concerning two different companies (ISPs) that spread their activities in the corresponding region (United Kingdom). For the first ISP we calculated a snapshot of it's Internet traffic and a snapshot of how it's infrastructure is distributed in the region (spatial information). For the second ISP we estimated a diagram shown it's growth rate in terms of how the company has used it's IP assignments (temporal information). We used the notion of a partial order to predict how many modems are attached to the servers of the ISP found in the dataset. All this knowledge was extracted only from the hostname representation of the dataset which lead us to the

conclusion that indeed spatial and temporal information relates to hostnames. The significance of the findings does not lie in the fact that we managed to discover new knowledge about specific Internet companies, but in the fact that we have a method that enable us to use independent sources (such as hostnames, the web and it's Internet Infrastructure) free to all citizens for computing some ISPs metrics. These metrics can be used to compare ISPs and to base decisions on information other that commercials or unrealistic data that these companies may provide in order to promote their services.

References APNIC - Asia Pacific Network Information Center. http://www.apnic.net/ ARIN - American Registry for Internet Numbers. http://www.arin.net/ Cochran, W. (1977). Sampling Techniques. USA: John Wiley. dataset. http://users.ntua.gr/plekeas/Data_sets/ metadata_in_hostnames

Paraskevas V. Lekeas

Eurostat - Statistical Office of the European Communities. http://www.europa.eu.int/comm/eurostat GMT - The Generic Mapping Tool. http://gmt.soest.hawaii.edu. IP2LL - IP to Latitude/Longitude tool. http://cello.cs.uiuc.edu/cgi-bin/slamm/ip2ll IANA - Internet Assigned Numbers Authority. http://www.iana.org/ LACNIC - Regional Latin-American and Caribbean IP Address Registry. http://lacnic.net/en/index.html Moore D., Periakaruppan, R., Donohoe, J., & Claffy, K. (2000). Where in the world is netgeo.caida.org? Procceedings of the INET’2000. Yokohama, Japan. Press, W. H., Teukolsky, S. A., Vetterling, W. T., Flannery, B. P. (2002). Numerical Recipes in C. Cambridge University Press. Padmanabhan, V. N., & Subramanian, L. (2001). An investigation of geographic mapping techniques for Internet hosts. Procceedings of the ACM SIGCOMM ’2001, San Diego, CA, USA.

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

Paltridge, S. (1999). OECD regulatory and statistical update, Telecommunication Policy, 23, 683-686. Park, S. K., Miller, K. W. (1988). Random Number Generators: Good ones are hard to find. Communication of the ACM 31 (10). Peterson, L. L., Davie, B. S. (1996). Computer Networks: A Systems Approach. Morgan Kaufmann, San Francisco, CA, USA. RIPE NCC - Réseaux IP Européens Network Coordination Centre. http://www.ripe.net/ Svensson, M. (1998). Social Navigation. Exploring Navigation; Towards a Framework for Design and Evaluation of Navigation in Electronic Spaces. N. Dahlbaeck. Kista, Sweden, Swedish Institute of Computer Science. UK’s National Statistics online (2001). Census of population of England and Wales. http://www.statistics.gov.uk Ziviani A., Fdida, S., Rezende, J. F., & Duarte, O. C. M. B. (2002). Placing Landmarks to Locate Internet Hosts. Workshop on Quality of Service and Mobility. WQoSM 2002, Angra dos Reis, Brazil.

Paraskevas V. Lekeas

Journal of Information & Knowledge Management, Vol. 3, No. 1 (2004) 97-106, ISSN: 0219-6492.

Appendix

Fig. 2 Infrastructure and internet traffic spatial map of a British ISP