About Projects Blog Tags Email GitHub Twitter Press Feed

02 May 2014
Finding words that sound alike but are spelled wildly differently
codingnlp

I’ve been working on search stuff lately, and we needed some wordlists to help test search results that match only because they sound similar to the query, and not because they’re spelled similarly.

Turns out we couldn’t find a pre-existing wordlist of homophones (words that sound the same but are spelled differently) that are dramatically different in spelling. And our QA team especially wanted some examples of people’s names that meet those criteria.

So, sure, I figured that’d be fun and quick to throw together for them!

It’s a lot like finding anagrams - the basic structure was a dict (a hash map, for the non-Python folks reading this) keyed by the phonetic encoding of each word. Each key pointed to a nested dict, which included an array of words which phonetically matched the key and a bool indicating whether it fit my criteria or not. In the end, all matching words were spit into stdout as a list of comma-separated homophones.

I determined whether words were spelled differently enough by checking whether a small enough percentage of their trigrams were the same. (I also had a minumum length set, so I’d be sure to have enough trigrams per word to be worth checking for match percentage.

(It was kinda neat to find something that felt more like an interview puzzle than anything else, but was actually useful for my day job. Oh hey, look, those skills are occasionally actually useful! Now you don’t have to feel weird about all the time you spent learning how to solve these sorts of puzzles!)

Sweet and simple and fun! Here’s my script and a few of the wordlists I created with it, since I figure other people may also find this sort of thing useful when testing search implementations. (FYI, if you’re using something other than a metaphone/doublemetaphone soundalike algorithm and trigrams for misspellings, you may want to make some adjustments.)

About Projects Blog Tags Email GitHub Twitter Press Feed