Tag Archives: Fun

Building a spell-checker using soundex

One of the little know facts about MySQL is that it also provides SOUNDEX() function. Soundex is a well known phonetic algorithm that indexes different words such that ‘homophones’ get the same encoding. So, lets try to see it in action using a MySQL server.

mysql> SELECT SOUNDEX('hello');
+------------------+
| SOUNDEX('hello') |
+------------------+
| H400             |
+------------------+
1 row in set (0.01 sec)

mysql> SELECT SOUNDEX('hellow');
+-------------------+
| SOUNDEX('hellow') |
+-------------------+
| H400              |
+-------------------+
1 row in set (0.00 sec)

Great! As we can see soundex’s encoding has a proper format : Lnnn, i.e the first letter of the supplied word followed by a 3-digit number. This output should mostly be same for homophones (words with similar pronunciation). Lets us now harness this fact to build a ‘not-so-perfect spell checker’.

First, create a table that stores all possible valid words and their soundex code, for brevity I will just use ‘hello’.

mysql> CREATE TABLE `spell_checker` (`word` VARCHAR(50), `code` VARCHAR(4));
Query OK, 0 rows affected (0.13 sec)

mysql> INSERT INTO `spell_checker` VALUES ('hello', SOUNDEX('hello'));
Query OK, 1 row affected (0.00 sec)

Now, once we have the soundex store in place, we can use it as the spell-checker backend to provide the correct spelling for an invalid word.

# hello
mysql> SELECT `word` FROM `spell_checker` WHERE SOUNDEX('hello') = `code`;
+-------+
| word  |
+-------+
| hello |
+-------+
1 row in set (0.00 sec)

# hellow
mysql> SELECT `word` FROM `spell_checker` WHERE SOUNDEX('hellow') = `code`;
+-------+
| word  |
+-------+
| hello |
+-------+
1 row in set (0.00 sec)

# hallow
mysql> SELECT `word` FROM `spell_checker` WHERE SOUNDEX('hallow') = `code`;
+-------+
| word  |
+-------+
| hello |
+-------+
1 row in set (0.01 sec)

# hallo
mysql> SELECT `word` FROM `spell_checker` WHERE SOUNDEX('hallo') = `code`;
+-------+
| word  |
+-------+
| hello |
+-------+
1 row in set (0.00 sec)

As we can see the output is the correct word even it we feed an invalid one.
Sounds interesting.. Isn’t it!!