Advanced String Comparison
Although no solution exists that even approximates the capabilities of the human brain, several algorithms have been developed over the years to provide a way to measure the "similarity" between two strings in shades of gray instead of black and white.
One such example is the soundex algorithm, initially devised as a filing system for use in the U.S. Census at the end of the 1800s. Soundex works by assigning a value to each consonant of the alphabet and then calculating the total value of a word based on its initial and component syllables. The resulting soundex value is represented by the initial letter of the word and the combined value of its syllables.
The soundex algorithm, which is implemented in PHP through the soundex function, can be extremely valuable when searching for names based on their phonetic representations. For example, the word "Tabini" and "Tabani" have the same soundex values:
<?php echo soundex ('Tabini'); echo "\n"; echo soundex ('Tabani'); echo "\n"; ?>
which returns the following:
A better algorithm for comparing two words based on their phonetic representation is metaphone, which was developed in 1990 by Lawrence Philips. The metaphone algorithm works by assigning a phonetic value to combinations of characters based on their typical use in the English language.
PHP provides an implementation of this algorithm through the metaphone function:
<?php echo metaphone ('Tabini'); echo "\n"; echo metaphone ('Tabani'); echo "\n"; ?>
Other comparison functions deal with entire phrases. For example, the levenshtein() function calculates the "distance" between two phrases, defined as the minimum number of additions, deletions, or replacements to transform a string into another:
<?php echo levenshtein ('Tabini', 'Tabani'); echo "\n"; ?>
The preceding script will return U1 because it's necessary to change only the first "i" in Tabini to an "a" to obtain the string 'Tabani'. Although a lower Levenshtein distance generally means a closer similarity between the two parameters, the value returned by this function gives a better idea of the closeness of the two sentences when you compare it to the length of the first parameter:
<?php $lev = levenshtein ('Tabini', 'Tabani'); $per = $lev / strlen ('Tabini') * 100; echo "$per\n"; ?>
This results in a value that approximates the percentage of distance between the two parameters. The preceding script will return a distance of approximately 16.67%, which can be translated in a similarity of approximately 83.33% between the two strings by subtracting the distance value from one hundred.
<?php $matches = similar_text ('Tabini', 'Tabani', &$per); echo "Matches: $matches - Percentage: $per\n"; ?>
Interestingly enough, running this script returns the following result:
Matches: 5 - Percentage: 83.333333333333