A while ago I came across the very interesting Soundex algorithm. It’s a way to find similarity between words based on how they sound – I’ll let Wikipedia explain:
Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms (in part because it is a standard feature of popular database software such as DB2, PostgreSQL, MySQL, Ingres, MS SQL Server and Oracle) and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms.
So basically Soundex can help you fix spelling mistakes by finding the word you meant to use based on how the words sound – so if you accidently search the internet for “Drawer Hellber” you’ll still be able find my blog:
Actually you won’t, but you get the point
It’s fairly easy to follow the steps of the algorithm (as defined by Wikipedia):
In a clear case of “When you have an hammer everything looks like a nail” I thought to myself – why not implement this algorithm in LINQ. And so I came up with the following code:
public string Encode(string word) { if (IsNullOrEmpty(word)) { return Empty; } return word .Select((ch, index) => EncodeCharacter(word, ch, index)) .Where((encodedChar, index) => encodedChar.IsValidEncoding && encodedChar.Curr != encodedChar.Prev) .Select(arg => arg.Curr) .Concat(Enumerable.Repeat("0", MaxEncodedLength)) .Take(MaxEncodedLength) .Aggregate((I, j) => I + j); }
If you’re interested the entire Encoder.cs code file can be found here.
Although the algorithm seems simple enough, I had four bugs in my initial implementation but they were quickly squashed using OzCode with it’s new LINQ debugging feature (in the Early Access Preview). Now that I’ve got it working, I’m going to use OzCode to show you how the Soundex algorithm processes “Drawer” and “Dror” (one of which is not my name) and check that they both provide the same results.
Let’s start with “Drawer” – it should be encoded as D660.
Stopping the debugger at the beginning of the LINQ query shows us that indeed the result of the LINQ query is “D660” but it also shows us numbers at the end of each operator – those numbers indicate how many items were returned from each LINQ operator.
Looking at these numbers, I can tell that from the 6 characters in the beginning, only three were left after Where. The rest were either Invalid or the same as the letter before (rules 1-3). Then we concatenated 4 ‘0’ and then taken the first four characters (Take) and from there it was a simple case of aggreagating all of the letters into a single string.
Let see it step by step – using OzCode’s Detailed LINQ tool window:
Step 1: First letter saved and encode the rest of the letters according to rule #2 (‘*’ means invalid encoding):
As you can see, ‘D’ was kept and the two ‘r’s were encoded as ‘6’s while ‘a’, ‘w’ and ‘e’ were replaced with stars (invalid character).
Next we get rid of invalid & duplicate letters:
As you can tell by the red X’s all of the invalid characters were thrown away.
The Select is a simple conversion from my result type into simple strings.
With your permission I’ll jump directly to Aggregate where I stitch the strings together:
Simple.
In the case of “Dror” the result is similar
How cool is that?
If you want to find out more about OzCode and LINQ debugging – try the EAP, or better yet come to visit us at the OzCode booth at NDC Oslo – we’ve just landed and we plan on having a great week.