Thursday, October 28, 2010

Porting C# Code to PHP (Porter Stemming Algorithm)

Jeff Mose wrote absolutely great article Notes from porting C# code to PHP. It is perhaps the most objective article about PHP ever.

I was inspired to review porting of Kamil Bartocha C# implementation of the English (Porter2) stemming algorithm that I did few years ago.

My native language is PHP and, of course, my view on the problem is from the opposite direction.

PHP and C# are similar. It was not hard to understand C# code and port it to PHP. However, I do know that it will be much, much harder for me to do it vice versa.

The most notable difference was set of string manipulation functions that comes out of box in PHP. Although there was no need to port C# StringBuilder class I did it for the sake of clarity.

I made few changes to make it run on PHP 5.3 and added the test page. Porter2 Stemming Algorithm in PHP 5.3 project is available at Kenai.

Unfortunately, I made bad choice to use flat ini formatted text file to store sample English vocabulary and its stemmed equivalent. As a consequence, I have had to remove around 20 words like false, true, on, off, etc.

Since the vocabulary has almost 30.000 words if you decide to run the test you will have to increase maximum execution time in php.ini.

I am working on both problems. Most likely I will try to store the vocabulary and its stemmed equivalent in XML format. Further on, I am looking into option to use Dojo Data Grid to display test results.