Foreign languages, and the art of pain

I've been pretty lucky in my coding career so far, in that I've pretty much managed to avoid writing code that has had to work in foreign languages. Well, actually, that's not strictly true, but some languages are more foreign than others when it comes to storing and manipulating text within a program.

Ever since I can remember, with only minor exceptions text has been stored in ASCII or ANSI form within all of the programs that I've worked upon. This uses a single byte for each character stored within a string of text. This is so simple and has been the way of things since time immemorial. This doesn't give you a very good range of visible characters admittedly, and you're restricted to just a couple of hundred displayable text glyphs. Luckily however, all of the English and most of the romantic European language accented characters are captured within the ANSI standard, and life has been good. ANSI, has served me well.

But no longer. Recently I was asked to write a web site that had to support so many languages that there was no way ANSI was going to cut it. Hindi, Korean, Chinese and Japanese are just a few of the languages that I needed to support within the code infrastructure. Each of these languages in themselves has several hundred characters alone. Luckily this was web code though, and all written from scratch with internationalisation in mind from the very start. It was a relatively simple job to just store the PHP and Javascript in UTF8 format while encoding all of the translated text into a MySQL database. The database itself had to be setup for Unicode but this too was no big deal. All in all, the language aspect of the web site went rather well, it was just having to have everything that is displayed to the user translated into so many languages in the first place that was the real problem there.

Enter another scenario, and one much scarier this time. Take 1,000,000 lines of code, written in C++, using a combination of C null terminated strings and C++ string objects and convert that from ANSI to Unicode. Hell fire! There can be fewer more painful jobs in coding than this. This is precisely what I have had to do on a project recently and it really was something that could literally make your mind implode.

I daren't even try and contemplate just how many lines of code I have had to change, but I suspect it's up there in the tens of thousands. Honestly, the number of complex links between layers and layers of code is just mind-blowing. Days and days of constant edit / compile cycles without actually seeing a single difference in the running code. Remember, all of the text displayed in this project is currently all in English. All the work I have been painstakingly progressing day in, day out for the last few days and I can't see a single tangible result for all of my monumental efforts. Just the single fact that I know I can render Chinese text when the time comes is all that I have to cling to.

All that discrete null terminated string manipulation code that riddled the old project has had to be completely ripped out and replaced with C++ Unicode strings and their associated member function calls instead. Just working out what the old code actually did was mind-numbing enough, carefully reading through what all of those character pointers were really used for and how the strings were being manipulated with the pointer dereferencing etc. What a holy ball-ache.

And all those (LPARAM)s that I've had to track down by hand too. Why is the Windows messaging system so utterly sh#t when it comes to text? They made Windows nice and Unicode friendly, then they made the messaging system completely typeless so that catching problems with changing string formats becomes just that much more painful.

All of those stored file names and text strings within your data files though has got to be the worst thing of all. Oh God the agony of having to version your new data files with the new Unicode strings but having to keep old loaders for previously stored ANSI strings and having to convert those up at load time. Not the clean lines you were hoping to achieve with all of this code refactoring.

And then there's the inevitable stuff that just cannot be stored in Unicode at all so you have to convert it every time you display it. The whole thing has been an utterly miserable experience from start to finish.

But now I'm done. The project is running again and I can actually get on with coding new stuff instead. What a release. But, there is more pain to come it seems. Now, I have the depressing job of tracking down all of the hard coded text within the project and changing it to reference a translatable resource file instead.

Sometimes, coding just isn't any fun at all ...

[Posted 06/17/2010]

Index