Previous Section  < Day Day Up >  Next Section

Hack 35. SafeSearch Certify URLs

Feed URLs into Google's SafeSearch to determine whether they point at questionable content.

Only three things in life are certain: death, taxes, and accidentally visiting a once family-safe web site that now contains text and images that would make a horse blush.

As you probably know if you've ever put up a web site, domain names are registered for finite lengths of time. Sometimes registrations accidentally expire; sometimes businesses fold and allow the registrations to expire; sometimes other companies take them over.

Other companies might just want the domain name, some companies want the traffic that the defunct site generated, and in a few cases, the new owners of the domain name try to hold it hostage, offering to sell it back to the original owners for a great deal of money. (This doesn't work as well as it used to because of the dearth of Internet companies that actually have a great deal of money.)

When a site isn't what it once was, that's no big deal. When it's not what it once was and is now X-rated, that's a bigger deal. When it's not what it once was, is now X-rated, and is on the link list of a site you run, that's a really big deal.

But how to keep up with all the links? You can visit each link periodically to determine if it's still okay, you can wait for hysterical emails from site visitors, or you can just not worry about it. Or you can put the Google API to work.

This program lets you check a list of URLs in Google's SafeSearch mode. If they appear in the SafeSearch mode, they're probably okay. If they don't appear, they're either not in Google's index or not "safe" enough to pass through Google's filter. The program then checks the URLs missing from a SafeSearch with a nonfiltered search. If they do not appear in a nonfiltered search, they're labeled as unindexed. If they do appear in a nonfiltered search, they're labeled as "suspect."

2.17.1. Danger, Will Robinson!

While Google's SafeSearch filter is good, it's not infallible. (I have yet to see an automated filtering system that is infallible.) So if you run a list of URLs through this hack and they all show up in a SafeSearch query, don't take that as a guarantee that they're all completely inoffensive. Take it merely as a pretty good indication that they are. If you want absolute assurance, you're going to have to visit every link personally and frequently.

Here's a fun idea if you need an Internet-related research project. Take 500 or so domain names at random and run this program on the list once a week for several months, saving the results to a file each time. It'd be interesting to see how many domains/URLs end up being filtered out of SafeSearch over time.


2.17.2. The Code

Save the following Perl source code as a text file named suspect.pl:

#!/usr/local/bin/perl

# suspect.pl

# Feed URLs to a Google SafeSearch. If inurl: returns results, the

# URL probably isn't questionable content. If inurl: returns no 

# results, either it points at questionable content or isn't in

# the Google index at all. 

     

# Your Google API developer's key.

my $google_key = 'put your key here';

     

# Location of the GoogleSearch WSDL file.

my $google_wdsl = "./GoogleSearch.wsdl";

     

use strict;

     

use SOAP::Lite;

     

$|++; # turn off buffering  

     

my $google_search = SOAP::Lite->service("file:$google_wdsl");

     

# CSV header

print qq{"url","safe/suspect/unindexed","title"\n};

     

while (my $url = <>) {

  chomp $url;

  $url =~ s!^\w+?://!!;

  $url =~ s!^www\.!!;

     

  # SafeSearch

  my $results = $google_search -> 

      doGoogleSearch(

      $google_key, "inurl:$url", 0, 10, "false", "",  "true",

      "", "latin1", "latin1"

    );

     

  print qq{"$url",};

     

  if (grep /$url/, map { $_->{URL} } @{$results->{resultElements}}) {

    print qq{"safe"\n};

  } 

  else {

    # unSafeSearch

    my $results = $google_search -> 

        doGoogleSearch(

        $google_key, "inurl:$url", 0, 10, "false", "",  "false",

        "", "latin1", "latin1"

      );

     

    # Unsafe or Unindexed?

    print (

      (scalar grep /$url/, map { $_->{URL} } @{$results->{resultElements}}) 

        ? qq{"suspect"\n}

        : qq{"unindexed"\n}

      );

  }

}

2.17.3. Running the Hack

To run the hack, you'll need a text file that contains the URLs that you want to check, one line per URL. For example:

http://www.oreilly.com/catalog/essblogging/

http://www.xxxxxxxxxx.com/preview/home.htm

hipporhinostricow.com

The program runs from the command line ["How to Run the Hacks" in the Preface]. Enter the name of the script, a less-than sign, and the name of the text file that contains the URLs that you want to check. The program will return results that look like this:

% perl suspect.pl < urls.txt

"url","safe/suspect/unindexed"

"oreilly.com/catalog/essblogging/","safe"

"xxxxxxxxxx.com/preview/home.htm","suspect"

"hipporhinostricow.com","unindexed"

The first item is the URL being checked, and the second is it's probable safety rating as follows:


safe

The URL appeared in a Google SafeSearch for the URL.


suspect

The URL did not appear in a Google SafeSearch but did in an unfiltered search.


unindexed

The URL appeared in neither a SafeSearch nor unfiltered search.

You can redirect output from the script to a file for import into a spreadsheet or database:

% perl suspect.pl < urls.txt > urls.csv

2.17.4. Hacking the Hack

You can use this hack interactively, feeding it URLs one at a time. Invoke the script with perl suspect.pl, but don't feed it a text file of URLs to check. Enter a URL and hit the return key on your keyboard. The script will reply in the same manner that it does when fed multiple URLs. This is handy when you just need to spot-check a couple of URLs on the command line. When you're ready to quit, break out of the script using Ctrl-D under Unix or Ctrl-Break on a Windows command line.

Here's a transcript of an interactive session with suspect.pl:

% perl suspect.pl

"url","safe/suspect/unindexed","title"

http://www.oreilly.com/catalog/essblogging/

"oreilly.com/catalog/essblogging/","safe"

http://www.xxxxxxxxxx.com/preview/home.htm

"xxxxxxxxxx.com/preview/home.htm","suspect"

hipporhinostricow.com

"hipporhinostricow.com","unindexed"

^d

%

    Previous Section  < Day Day Up >  Next Section