2008-05-06

Mr Donald, Please Correct the Alphabet First!

I have already replied to Mr Donald Gaminitillake's mudslinging campaign against Sinhala Unicode, which he wields through akuru.org web site and by hijacking discussions on various blogs and forums.

Mr Donald's motives are quite clear. He claims that every Sinhala character shape needs an individual "code point", and has applied for a patent for this "invention". With Sinhala Unicode becoming mainstream, avenues for making money with his pending patent are going thin.

So he is doing what any desperate human being (or animal for that matter) would do; try everything to remove the "opponent".

One of the examples Mr Donald always uses is the absence of character "du" in the Sinhala Unicode codepage.

Of course he conveniently forgets to mention that "da" and "papilla" are in fact available. Well, it requires a bit of brains to put them together. ;-)

Mr Donald, there are lots of missing characters in the Sinhala Hodiya (alphabet), including your infamous "du", let alone "yansaya" and "rakaransaya". If you love the Sinhala language so much as you claim, please start a campaign to "fix" Hodiya!

I have previously pointed out this similarity between Hodiya and Sinhala Unicode, and why "du" + "papilla" is as good as "du". This blog post discusses technicalities in detail including the matter of "yansaya" and "rakaransaya".

Unfortunately for Mr Donald, his "opponent", namely Sinhala Unicode, is growing stronger day by day. Implementations are maturing, more standards compliant fonts are beginning to appear, and as I wrote earlier, more web sites and blogs are now Unicode compliant (e.g.: Sinhala Bloggers, Sinhala Wikipedia, Sinhala Blogs and of course our own Sinhala GNU/Linux).

13 comments:

කාලිංග said...

well sad anuradha, this fellow just don't understand what even he's taking about. rather then doing some to to help Sinhalese this guys is just talking and talking, if he has a better solution why doest he take the time and make a product of out it, the time he taking to fight with people over this Unicode should have put to do the product ? if he can make a better product people will use it, but only words, its simple as that. කටින් බතල හිටවිම කියන්නෙ ඔකට තමයි.

Sean said...

Nice blog!! අනිත් අයගෙත් මේ ගැන ඇස් ඇරලවන්න මේ blog එක නම් අපූරූය්...

Anuradha Ratnaweera said...

This is for those who are not aware of the history of this topic.

We tried to initiate a friendly discussion with Mr Donald. Here is the full thread. However, not only did he continue to repeat the same story without listening at all, but also selectively quoted and published from the discussion to his advantage. Here is the same page on archive.org as of April 2007 in case it changes.

I didn't want to write this post nor my earlier blog post on Mr Donald's claims, but as he continues his mudslinging campaign which confuses newcomers to the topic, I couldn't resist.

Sam said...

I’m not a Unicode expert.
How can I get character count 3 for [රාබොමු] instead 6? (Just an example). Or is it supposes to be 6 actually? Or should we use some sort of logic to ignore all that legs and caps and flags and what not before do character count? Also I try to sort character list with JAVA and it failed miserably, but excel did a good job – you guys have any idea for the reason behind that?
Just curious..

Anuradha Ratnaweera said...

Dear Sam,

රාබොමු is supposed to be counted as 6 characters. If you can give me an example where you need it counted as 3, perhaps we can figure out a way.

Most programming languages I use (C, Perl etc) are doing a decent job when it comes to sorting, there is something called collation data needed to make it perfect. As for UNIX C library, this should be a part of the Sinhala locale. I am not sure about Java.

As far as I know, the UCSC did a comprehensive research to finalize the collation order (yes, there were some very fine points), and they also made collation data. I am not sure how much of it has made to the language packs. I'll try to get someone more knowledgeable on that subject to comment here.

Anuradha Ratnaweera said...

Harshadeva also has blogged on this topic.

Sam said...

Well. If we treat that as 6 letters, then it is 6 letters. I have no issue with it.
If that is the case, we may have to start addressing other none Unicode related issues. May be other countries (languages) may already have solve those issues, and we can follow them.

For an example, how much character length we should practically allocate for the last name field or street name field, for an intense, in the immigration database. If you seen an immigration form, it have boxes for every letter. So we may write in English like |R|A|B|O|M|U| and in Sinhala |රා|බො|මු|
now the possible issue we have here is, English have 6 letters (9 bytes on the HDD) and Sinhala human have very short excellent 3 letters (12 bytes??), but Sinhala Computer have lengthy 7 characters (21 bytes) – more than double the size of what we humans may read. Now my issue is, since Sinhala human characters and Sinhala computer characters are technically deferent, I have to allocate at least 5 times the length of the field in the database (Should it be 5 or more?) for possible combination of characters coming from each letter space I have given in a form (such as immigration form).

Sinhalese are the type of a people who like to include their total family tree in their last name and mark streets by the name of the head monk of the village temple, whom happened to have a name any 5th grader may not able to write. We write lengthy names, complex sounds, may be because so far we had the luxury of writing complex sounds using minimum number of letters, which we are not going to have with computers any more. Blogging in Sinhala is quite fun, since I don’t have to pay for each character I use, but if I have to write my address in a SMS, were each character cost considerable amount, all those Hats, Flags and Shoes going to be expensive. The cost may look very tiny in the first glance, but considering UPS saved couple of million dollars avoiding their delivery trucks taking right turns in intersections last year – and American airline saving few million dollars reducing one olive from a salad, in the long run, those extra 4 characters will cost us quite large sum.

But if that is how it is, then that is how it is. I know Korean letters also have Hats, Flags and Shoes like we do, but I don’t know how they handle those. Do you guys have any idea?

Outstander said...

Hmm... good thinking!
But with the boom in storage devices and memory technologies, Gigabytes have become surprisingly cheaper. When it comes to Mobile devices, still there is a problem, but not for long.

Anuradha Ratnaweera said...

Hi Sam,

First, apologies for not replying earlier.

Yes, you have a very valid point. Let's explore the options to write රාබොමු.

- Present system needs six characters (not seven) = 18 bytes

- If we go with a one-character-per-shape, we will have to go further down in the Unicode range, and would probably require 4 bytes per character like old Chinese. This will be 12 bytes.

- If we go with our own standard, one character per shape, we will need 2 bytes per character, so 6 bytes to write රාබොමු.

- We we go with our own standard, but in a similar manner to Unicode, we can use one byte per character and we will still use 6 bytes.

Summary:
- Unicode - present sytem: 18 bytes
- Unicode - one char per shape: 12 bytes
- Custom - one char per shape: 6 bytes
- Custom - Unicode like system: 6 bytes

However, custom systems will not be accepted by software written in other countries, or Free And Open Source Software like GNU/Linux.

So we have only first two options. Each has its advantages and disadvantages. The example you mentioned is a strong point against the present system, not just Sinhala, but all other South Asian scripts, Thai and others.

The advantages include the possibility of creating any complex character. For example, recently I saw an old document which had a "repaya" on top of "thayanna bandi mahaprana thayanna"! This is a combination of three consonants.

Let's say somebody found a new shape like the above.

With present Unicode, we can easily encode this. Just three characters joined with ZWJ.

If we have a character per shape system, a committee will have to decide a new codepoint, submit it to Unicode, wait for it to be included in Unicode, a long process.

And that will have to be done for every new character shape found like the above.

So there we are, two systems with advantages and disadvantages in each. Obviously we have to go with one.

Sam said...

By any mean I do not suggest we should find own paths, instead Unicode.
Talking about long process, I guess we have time. We already have 2000 years or more in our behind, and hope we may have similer in our front. The whole point of going to Unicode is, it is been easier to globally update over the time.


Again, I’m not a Unicode expert. I’m talking in a user’s point of view.


Is it possible to have a combination of both of one and two?
If somebody wrote something utterly unique in back in the days, we should be still able create those characters? Assuming, character combinations do not have any artistic or magical values, and we use character combinations for communication alone, if we can still write those old character combinations with the current characters without losing sound or meaning, then ignoring those combinations may not be practically big issue either? Is it?


Even though Outstanders point very hopeful, I don’t think it will help us a lot in some cases. Take eBay for an example. eBay allow only 55 characters in it is item Title. The reason they do that is large number of records, in long period of time, use lot of processing power, even for a big organization like eBay. Cotton may practically weigh less, but cotton venders sell cotton in Kg. Quantity makes anything tiny in to big numbers. It is not only that, it is interface issues and marketing issues too.

So the problem is not alone storage, we cannot satisfactorily restrict our application interfaces for 55 LETTERS. Think about very common everyday situation, we have to make a form and say next to the field (අකුරු පනහක් පමනයී). We cannot technically do that right now. (I may be wrong – hope I’m wrong, please correct me if I’m wrong.). So now we have to say something technical like (50 Characters only). Now my question is how we can translate “50 Characters only” in to Sinhala? Any idea?


I bring up this question because I was trying to translate a web application, in to Sinhala. And I face those sorts of issues. I have 20 character lengths for first name field, fair enough length for English. But if I use Sinhala Unicode, I have to change that in to 100, for possible combination of characters may come from 20 Sinhala letters. Practically it is not worth for me to expand my database, since I’m paying premium price for my database storage. The workaround I have for this is, writing a converter to convert all the Unicode LETTERS (multiple characters) in to my own personal character table, and back. That way, I can exactly calculate 20 letters and restrict the user typing more than that, and also I don’t have to expand database size. But then again, if I have to do that, then I’m not using Unicode in my back end, but only in my front end. It looks like lot of programming to do, but can be commercially viable product if we stay with option number one. But, I still like the second option since we don’t have go through any of those troubles.

SeeJay said...

I think we all know what Donald Gaminitillake is up to. Don't let these money minded people slow you down.
Keep up the Great work!

madura said...

This guy is spread pure bull**** he can make new glyphs with a font of his own no need of a separate character set he'd have to draw those glyphs that he says that needs to be separate, well he can try and output a ttf of like 100mb and use a super computer to render a sinhala article ...while we can stay with unicode made for sane people
and sorting has no problem as i see it i get all files with sinhala names sorted correctly..

donald gaminitillake said...

My Patent was given the approval.
SLSI 1134 and Sinhala unicode does not represent our language correctly.
Patent is a public document you can read the text at the Patent Office
The number is 13120

Donald Gaminitillake