Previous Section  < Day Day Up >  Next Section

Hack 46. Spot Trends with Geotargeting

Compare the relative popularity of a trend or fashion in different locations, using only Google and Directi search results.

One of the latest buzzwords on the Internet is geotargeting, which is just a fancy name for the process of matching hostnames (e.g., http://www.oreilly.com) to addresses (e.g., 208.201.239.36) to country names (e.g., U.S.). The whole thing works because there are people who compile such databases and make them readily available. This information must be compiled by hand or at least semiautomatically because the DNS system that resolves hostnames to addresses does not store it in its distributed database.

While it is possible to add geographic location data to DNS records, it is highly impractical to do so. However, since we know which addresses have been assigned to which businesses, governments, organizations, or educational establishments, we can assume with a high probability that the geographic location of the institution matches that of its hosts, at least for most of them. For example, if the given address belongs to the range of addresses assigned to British Telecom, then it is highly probable that it is used by a host located within the territory of the United Kingdom.

Why go to such lengths when a simple DNS lookup (e.g., nslookup 208.201.239.36) gives the name of the host, and in that name we can look up the top-level domain (e.g., .pl, .de, or .uk) to find out where this particular host is located? There are four good reasons for this:

  • Not all lookups on addresses return hostnames.

  • A single address might serve more than one virtual host.

  • Some country domains are registered by foreigners and hosted on servers on the other side of the globe.

  • .com, .net, .org, .biz, or .info domains tell us nothing about the geographic location of the servers they are hosted on. That's where geotargeting can help.

Geotargeting is by no means perfect. For example, if an international organization such as AOL gets a large chunk of addresses that it uses not only for servers in the U.S., but also in Europe, the European hosts might be reported as being based in the U.S. Fortunately, such aberrations do not constitute a large percentage of addresses.

The first users of geotargeting were advertisers, who thought it would be a neat idea to serve local advertising. In other words, if a user visits a New York Times site, the ads they see depend on their physical location. Those in the U.S. might see the ads for the latest Chrysler car, while those in Japan might see ads for i-mode; users from Poland might see ads for "Ekstradycja" (a cult Polish police TV series), and those in India might see ads for the latest Bollywood movie. While such use of geotargeting might be used to maximize the return on the invested dollar, it also goes against the idea behind the Internet, which is a global network. (In other words, if you are entering a global audience, don't try to hide from it by compartmentalizing it.) Another problem with geotargeted ads is that they follow the viewer. Advertisers must love it, but it is annoying to the user; how would you feel if you saw the same ads for your local burger bar everywhere you went in the world?

Another application of geotargeting is to serve content in the local language. The idea is really nice, but it's often poorly implemented and takes a lot of clicking to get to the pages in other languages. The local pages have a habit of returning out of nowhere, especially after you upgrade your web browser. A much more interesting application of geotargeting is the analysis of trends, which is usually done in two ways: analysis of server logs and via analysis of results of querying Google.

Server log analysis is used to determine the geographic location of your visitors. For example, you might discover that your company's site is being visited by a large number of people from Japan. Perhaps that number is so significant that it would justify the rollout of a Japanese version of your site. Or it might be a signal that your company's products are becoming popular in that country and you should spend more marketing dollars there. But if you run a server for U.S. expatriates living in Tokyo, the same information might mean that your site is growing in popularity and you need to add more information in English. This method is based on the list of addresses of hosts that connect to the server, stored in your server's access log. You could write a script that looks up their geographic location to find out where your visitors come from. It is more accurate than looking up top-level domains, although it's a little slower due to the number of DNS lookups that need to be done.

Another interesting use of geotargeting is analysis of the spread of trends. This can be done with a simple script that plugs into the Google API and the IP-to-Country database provided by Directi (http://ip-to-country.directi.com). The idea behind trend analysis is simple: perform repetitive queries using the same keywords, but change the language of results and top-level domains for each query. Compare the number of results returned for each language, and you will get a good idea of the spread of the analyzed trend across cultures. Then, compare the number of results returned for each top-level domain, and you will get a good idea of the spread of the analyzed trend across the globe. Finally, look up geographic locations of hosts to better approximate the geographic spread of the analyzed trend.

You might discover some interesting things this way: it could turn out that a particular .com domain that serves a significant number of documents and that contained the given query in Japanese is located in Germany. It might be a sign that there is a large Japanese community in Germany that uses that particular .com domain for their portal. Shouldn't you be trying to get in touch with them?

The geospider.pl script shown in this hack is a sample implementation of this idea. It queries Google and then matches the names of hosts in returned URLs against the IP-to-Country database.

2.28.1. The Code

Save the following code ["How to Run the Hacks" in the Preface] as geospider.pl.

You will need the Getopt::Std and Net::Google modules for this script. You'll also need a Google API key (http://api.google.com) and the latest ip-to-country.csv database (http://ip-to-country.webhosting.info/downloads/ip-to-country.csv.zip).


#!/usr/bin/perl-w

#

# geospider.pl

#

# Geotargeting spider -- queries Google through the Google API, extracts

# hostnames from returned URLs, looks up addresses of hosts, and matches

# addresses of hosts against the IP-to-Country database from Directi:

# ip-to-country.directi.com. For more information about this software:

# http://www.artymiak.com/software or contact jacek@artymiak.com.

# 

# This code is free software; you can redistribute it and/or

# modify it under the same terms as Perl itself.

#

     

use strict; 

use Getopt::Std;

use Net::Google;

use constant GOOGLEKEY => 'insert key here';

use Socket;

     

my $help = <<"EOH";

----------------------------------------------------------------------------

Geotargeting trend analysis spider

----------------------------------------------------------------------------

Options:

     

  -h    prints this help

  -q    query in utf8, e.g. 'Spidering Hacks'

  -l    language codes, e.g. 'en fr jp'

  -d    domains, e.g. '.com'

  -s    which result should be returned first (count starts from 0), e.g. 0

  -n    how many results should be returned, e.g. 700

----------------------------------------------------------------------------

EOH

     

# Define our arguments and show the

# help if asked, or if missing query.

my %args; getopts("hq:l:d:s:n:", \%args);

die $help if exists $args{h};

die $help unless $args{'q'};

     

# Create the Google object.

my $google = Net::Google->new(key=>GOOGLEKEY);

my $search = $google->search( );

     

# Language, defaulting to English.

$search->lr(qw($args{l}) || "en");

     

# What search result to start at, defaulting to 0.

$search->starts_at($args{'s'} || 0);

     

# How many results, defaulting to 10.

$search->starts_at($args{'n'} || 10);

     

# Input and output encoding.

$search->ie(qw(utf8)); $search->oe(qw(utf8));

     

my $querystr; # our final string for searching.

if ($args{d}) { $querystr = "$args{q} .site:$args{d}"; }

else { $querystr = $args{'q'} } # domain specific searching.

     

# Load in our lookup list from

# http://ip-to-country.directi.com/.

my $file = "ip-to-country.csv";

print STDERR "Trying to open $file... \n";

open (FILE, "<$file") or die "[error] Couldn't open $file: $!\n";

     

# Now load the whole shebang into memory.

print STDERR "Database opened, loading... \n";

my (%ip_from, %ip_to, %code2, %code3, %country);

my $counter=0; while (<FILE>) {

    chomp; my $line = $_; $line =~ s/"//g; # strip all quotes.

    my ($ip_from, $ip_to, $code2, $code3, $country) = split(/,/, $line);

     

    # Remove trailing zeros.

    $ip_from =~ s/^0{0,10}//g; 

    $ip_to =~ s/^0{0,10}//g;

     

    # And assign to our permanents.

    $ip_from{$counter} = $ip_from;

    $ip_to{$counter}   = $ip_to;

    $code2{$counter}   = $code2;

    $code3{$counter}   = $code3;

    $country{$counter} = $country;

    $counter++; # move on to next line.

}

     

$search->query(qq($querystr));

print STDERR "Querying Google with $querystr... \n";

print STDERR "Processing results from Google... \n";

     

# For each result from Google, display 

# the geographic information we've found.

foreach my $result (@{$search->response( )}) {

    print "-" x 80 . "\n";

    print " Search time: " . $result->searchTime( ) . "s\n";

    print "       Query: $querystr\n";

    print "   Languages: " . ( $args{l} || "en" ) . "\n";

    print "      Domain: " . ( $args{d} || "" ) . "\n";

    print "    Start at: " . ( $args{'s'} || 0 ) . "\n";

    print "Return items: " . ( $args{n} || 10 ) . "\n";

    print "-" x 80 . "\n";

     

    map {

        print "url: " . $_->URL( ) . "\n";

        my @addresses = get_host($_->URL( ));

        if (scalar @addresses != 0) {

            match_ip(get_host($_->URL( )));

        } else {

            print "address: unknown\n";

            print "country: unknown\n";

            print "code3: unknown\n";

            print "code2: unknown\n";

        } print "-" x 50 . "\n";

    } @{$result->resultElements( )};

}

     

# Get the IPs for 

# matching hostnames.

sub get_host {

    my ($url) = @_;

     

    # Chop the URL down to just the hostname.

    my $name = substr($url, 7); $name =~ m/\//g;

    $name = substr($name, 0, pos($name) - 1);

    print "host: $name\n";

     

    # And get the matching IPs.

    my @addresses = gethostbyname($name);

    if (scalar @addresses != 0) {

        @addresses = map { inet_ntoa($_) } @addresses[4 .. $#addresses];

    } else { return undef; }

    return "@addresses";

}

     

# Check our IP in the

# Directi list in memory.

sub match_ip {

    my (@addresses) = split(/ /, "@_");

    foreach my $address (@addresses) {

        print "address: $address\n";

        my @classes = split(/\./, $address);

        my $p; foreach my $class (@classes) {

            $p .= pack("C", int($class));

        } $p  = unpack("N", $p);

        my $counter = 0;

        foreach (keys %ip_to) {

            if ($p <= int($ip_to{$counter})) {

                print "country: " . $country{$counter} . "\n";

                print "code3: "   . $code3{$counter}   . "\n";

                print "code2: "   . $code2{$counter}   . "\n";

                last;

            } else { ++$counter; }

        } 

    }

}

Be sure to replace insert key here with your Google API key.

2.28.2. Running the Hack

Here, we're querying to see how much worldly penetration AmphetaDesk, a popular news aggregator, has, according to Google's top search results:

% perl geospider.pl -q "amphetadesk"

Trying to open ip-to-country.csv... 

Database opened, loading... 

Querying Google with amphetadesk... 

Processing results from Google... 

--------------------------------------------------------------

 Search time: 0.081432s

       Query: amphetadesk

   Languages: en

      Domain: 

    Start at: 0

Return items: 10

--------------------------------------------------------------

url: http://www.macupdate.com/info.php/id/9787

host: www.macupdate.com

host: www.macupdate.com

address: 64.5.48.152

country: UNITED STATES

code3: USA

code2: US

--------------------------------------------------

url: http://allmacintosh.forthnet.gr/preview/214706.html

host: allmacintosh.forthnet.gr

host: allmacintosh.forthnet.gr

address: 193.92.150.100

country: GREECE

code3: GRC

code2: GR

--------------------------------------------------

...etc...

2.28.3. Hacking the Hack

This script is only a simple tool. You will make it better, no doubt. The first thing you could do is implement a more efficient way to query the IP-to-Country database. Storing data from ip-to-country.csv in a database would speed script startup time by several seconds. Also, the answers to address-to-country queries could be obtained much faster.

You might ask if it wouldn't be easier to write a spider that doesn't use the Google API and instead downloads page after page of results returned by Google at http://www.google.com. Yes, it is possible, and it is also the quickest way to get your script blacklisted for the breach of the Google's user agreement. Google is not only the best search engine, it is also one of the best-monitored sites on the Internet.

Jacek Artymiak

    Previous Section  < Day Day Up >  Next Section