Team LiB
Previous Section Next Section

Advanced String Comparison

It is somewhat difficult to make a computer "understand" strings the way a human being would. A typical example of this problem is spelling mistakes, particularly when you're dealing with names.

Although no solution exists that even approximates the capabilities of the human brain, several algorithms have been developed over the years to provide a way to measure the "similarity" between two strings in shades of gray instead of black and white.

One such example is the soundex algorithm, initially devised as a filing system for use in the U.S. Census at the end of the 1800s. Soundex works by assigning a value to each consonant of the alphabet and then calculating the total value of a word based on its initial and component syllables. The resulting soundex value is represented by the initial letter of the word and the combined value of its syllables.

The soundex algorithm, which is implemented in PHP through the soundex function, can be extremely valuable when searching for names based on their phonetic representations. For example, the word "Tabini" and "Tabani" have the same soundex values:

<?php
    echo soundex ('Tabini');
    echo "\n";
    echo soundex ('Tabani');
    echo "\n";
?>

which returns the following:

T150
T150

As a result, looking for a name becomes much easier even if its exact spelling is unknown.

A better algorithm for comparing two words based on their phonetic representation is metaphone, which was developed in 1990 by Lawrence Philips. The metaphone algorithm works by assigning a phonetic value to combinations of characters based on their typical use in the English language.

PHP provides an implementation of this algorithm through the metaphone function:

<?php

    echo metaphone ('Tabini');
    echo "\n";
    echo metaphone ('Tabani');
    echo "\n";

?>

The preceding script returns the metaphone value "TBN" for both strings.

Comparing Phrases

Other comparison functions deal with entire phrases. For example, the levenshtein() function calculates the "distance" between two phrases, defined as the minimum number of additions, deletions, or replacements to transform a string into another:

<?php
    echo levenshtein ('Tabini', 'Tabani');
    echo "\n";
?>

The preceding script will return U1 because it's necessary to change only the first "i" in Tabini to an "a" to obtain the string 'Tabani'. Although a lower Levenshtein distance generally means a closer similarity between the two parameters, the value returned by this function gives a better idea of the closeness of the two sentences when you compare it to the length of the first parameter:

<?php

    $lev = levenshtein ('Tabini', 'Tabani');
    $per = $lev / strlen ('Tabini') * 100;
    echo "$per\n";

?>

This results in a value that approximates the percentage of distance between the two parameters. The preceding script will return a distance of approximately 16.67%, which can be translated in a similarity of approximately 83.33% between the two strings by subtracting the distance value from one hundred.

Another way to determine the similarity between two strings is provided by the similar_text function, which computes the number of matches between two strings and determines their similarity:

<?php
    $matches = similar_text ('Tabini', 'Tabani', &$per);
    echo "Matches: $matches - Percentage: $per\n";
?>

Interestingly enough, running this script returns the following result:

Matches: 5 - Percentage: 83.333333333333

which is exactly what we had calculated earlier, based on our percentage transformation of the Levenshtein distance.

    Team LiB
    Previous Section Next Section