Random Sampling of IP Addresses

in #funny8 years ago

Random Sampling of IP Addresses

We extracted all ranges of IP addresses used by Russian networks from the
IP-Country database [6]. There were totally around 10.5 millions of IPs at the
time of June 2005, N = 10.5 × 106. Then, n = 10.5 × 104 unique IP addresses
(1% of the total number) were randomly selected and scanned for active web
servers (tools we used for that are mentioned in Appendix A). We detected
1,379 machines with web servers running on port 80. For each of these machines
the corresponding hostnames were resolved2 based on a machine’s IP address.
Next step was crawling each host to depth three3. While crawling we checked
if links point to pages located on hosts on the same IP. To not violate the
sampling procedure (i.e., study only those IP addresses which are in the sample)
we ignored any page returned by a server on another IP. The automatic analysis
of retrieved pages performed by our script in Perl was started after that. All pages
which do not contain web forms and pages which do contain forms, but those
forms that are not interfaces to databases (i.e., forms for site search, navigation,
login, registration, subscription, polling, posting, etc.) were excluded. In order
to consider just unique search forms pages with duplicated forms were removed
as well. Finally, we manually inspected the rest of pages and identified totally
x = 33 deep web sites. It should be noted that unlike [3] we counted only the
number of deep web sites. The number of web databases accessible via found
deep web sites as well as the number of interfaces to each particular database
were not counted since we did not have a consistent and reliable procedure to
detect how many web databases are accessible via particular site. The typical
case here (not faced in this sample though) is to define how many databases are
accessed via a site with two searchable forms - one form for searching new cars
while another for searching used ones. Both variants, namely two databases for
used and new cars exist in this case or it is just one combined database, are
admissible. Nevertheless, according to our non-formal database detection, 5 of
33 deep web sites found had interfaces to two databases, which gives us 38 web
databases in the sample in total.
The estimate for the total number of deep web sites is DrsIP = 33×10.5×106
10.5×104 =

  1. An approximate 95% confidence interval4 for DrsIP is given by the following
    formula: DrsIP ± 1.96N(N−n)(1−p)p
    n−1 , where p = x
    n (see Chapter 5 in [11]).
    Thus, the total number of deep web sites estimated by the rsIP method is
    3300±1120.
    To our knowledge, there are four factors which were not taken into account in
    the rsIP experiment and, thus, we can expect that the obtained estimate DrsIP
    is biased from the true value. Among four sources of bias the most significant
    one is the virtual hosting. A recent analysis [12] of all second-level domains in
    the .RU zone conducted in March 2006 has shown that there are, in average, 7.5
    web sites on one IP address. Unfortunately, even with the help of advanced tools
    for reverse IP lookup (see Appendix B) there is not much guarantee that all
    hostnames related to a particular IP address would be resolved correctly. This
    means that during the experiment we certainly overlooked a number of sites,
    some of which are apparently deep web sites.