2006-03-20

Is Sinhala Unicode Incomplete?

"The SLSI 1134 is incorrect & incomplete and it should be corrected immediately.", claims Mr Donald Gaminitillake, who is trying to ignite a campaign against Sinhala Unicode standard through www.akuru.org (history of the site), and frequent newspaper articles.

We of the Sinhala GNU/Linux project think otherwise. And we are not alone. Language Technology Research Center of the University of Colombo School of Computing, research groups from the University of Moratuwa and Arthur C Clarke Center for Modern Technology, Microsoft, Microimage, Science Land also think that the standard is correct. A full list would be quite long.

GNU/Linux was the first platform to implement Sinhala Unicode rendering. We dind't find any issues about encoding or displaying those characters Mr Donald claims are impossible - yansaya, rakaaransaya, reepaya, joint letters and all that. Then Microsoft also released a "Sinhala Enabling Kit for Windows". Most vendors today support Sinhala Unicode. None of them, who actually got their hands dirty by writing actual code to implement the standard, see any missing "letters" in the standard.

Implemenation is proof for most poeple. But for some not-so-obvious "reason" Mr Donald continues to say that certain characters are missing!

Our first encounter with Mr Donald hapenned when I wrote an open letter to him which became a lengthy debate (more, more and a seperate archive) on our project mailing list. Harshula, our standards expert, tried to explain to Mr Donald how the basic Unicode code-page and cartesian products of various "sets" create the complete Sinhala character set. However Mr Donald never tried to cooperate with us in "understanding" it, and the discussion led to nowhere.

However, Mr Donald selectively quoted some parts of the discussion on his site... ;-)

Recently, Niranjan Meegammana, creator of kaputa.com, started a Google Group to communicate in Sinhala - using Unicode. This group has now grown to a very interesting community of a unique, intellectual and polite culture. Although the group members use diverse technologies to write and read Sinhala Unicode, we find the standard quite functional and interoperable. And we use yansaya, rakaaransaya and other "special characters" every day.

Most of us on this group have a great passion for language and literature, and therefore the discussions are very interesting and intellectually rich.

This Google group was meant to communicate in Sinhala Unicode to popularize it, and to act as a test bed for implementations. Mr Donald recently joined the group, not to communicate in Sinhala Unicode, but to start another debate. He continues to repeat the same old story and conveniently ignores some of our questions.

Here is a couple of Mr Donald's claims and what I think of them.

Donald G: SLS1134 doesn't contain all the Sinhala characters

Wrong. Here is why:

Most of Western languages contain simple alphabets. Even with the upper and lower case variants, and some "odd" characters with bubbles and hats, the number of character don't exceed 50-100.

However, Asian languages are different. Most characters have different "forms", either phonetically (e.g.: Sinhala, Tamil and Hindi), or by the location of the word (e.g.: Arabic). Therefore, it's impractical to allocate characters for each variant.

Think of atoms and molecules. There is a limited number of atoms, and molecule names can be formed by putting together the names of atoms. I have never heard of a "Chemists' Revolution" demanding a symbol for each molecule....;-)

Unicode is very similar to chemistry in that sense. Each language is assigned a "code page", typically containing 128 "code points". They form the basis to build more complex character variants, i.e., actual characters seen by the eye, sometimes referred to as "glyphs".

In Western languages, "characters" and "glyphs" and "code points" are the same thing: because they don't need variants. For example, english character, or glyph, "A" maps to code point 65 - one to one.

For complex languages, only basic characters are represented by code points. Variants are produced by sequences of code points. For example, character "da" (as in "dambana") is directly mapped to code point 0DAF, whereas "du" (as in "dumriya"), which is a variant of "da", is produced by the sequence of two code points 0DAF ("da") and 0DDF ("papilla"). More complex characters (glyphs) are formed by longer sequences.

Most modern operating systems have rendering engines that can display proper glyphs from these sequences of code points (e.g.: Pango, QT, ICU on GNU/Linux, Uniscribe on Windows). Therefore, each glyph not having an individual code point is not a problem.

In Unicode, some characters are directly mapped to code points, while others are produced by sequences of two or more code points.

Deciding which characters should be basic code points, and which characters should be produced by combining code points is a different question, and is obviously dependant on the language, and likely to be subjective. Input from several Sinhala scholars and experts have been taken into account to decide that repaya, rakaaransaya and yansaya should not be basic code points, but should be produced by using sequences of code points, as they are linguistically alternatives forms. In other words, they are there as sequences of code points, not as single code points. Nevertheless, they are there, so the claim is wrong.

If Mr Donald's claim is "yansaya, rakaaransaya and reepaya should be individual code points", that would be more valid. However, somebody has to eventually decide what's basic and what's not, and it has already been done. Technically, this is not an issue at all.

Donald G: Unicode can't produce a matrix of 1600+ characters needed by OCR etc

Wrong. Here is why:

I am not an expert on OCR, but if Mr Donald claims that OCR requires a matrix of 1600+ characters, that's exactly what Sinhala Unicode is. Only it doesn't list all the 1600+ characters, but defines the basic code points (not characters) and the way to generate all the other characters by using sequences of them.

Even a primary school kid can understand something like this: "ka and paapilla produces ku, and this rule applies to all the consonents." It would rediculous if the document describing the standard includes a 1600+ table listing each variant (ka + papilla = ku, kha + papilla = khu...la+paapilla=lu and so on)... ;-)

Showing the basic code points and claiming "not all the characters are here" for the first time is fine. Second time is still fine, IMHO. But 100+th time is definitely a joke... ;-)

Donald G: SLS 1134 doesn't consider Tamil

There is no need.

Character representation in SLS 1134 almost identical (if not identical) to Sinhala subset in Unicode. As the only country that has a major Sinhala speaking population, it's SLSI's responsibility to contribute to Sinhala in Unicode, and SLSI does this through SLS 1134. Developers eventually use Unicode. To my knowledge, none of the FOSS packages found in a typical GNU/Linux system refer to SLS 1134. In other words, SLS 1134 more of an intermitent standard.

India has a much bigger Tamil speaking population, and the Unicode code page for Tamil has already been worked out. Therefore, there is absolutely no need to create a seperate standard for Sri Lanka. Sinhala Unicode is not a Sri Lankan standard either.

Donald G: Sinhala Unicode doesn't have yansaya on the keyboard

Wrong. here is why:

Unicode is about representing characters. How they are typed using the keyboard is completely upto the keyboard driver. There are different keyboard drivers, some are classic Wijesekara, some modified Wijesekara, and some are transliterated (somewhat "singlish"). Some driver authors include yansaya etc on the keyboard itself whereas others provide ZWJ as an alternative to type them.

Whatever the keyboard is, yansaya, rakaaransaya and repaya can be typed, and eventually represented and displayed in the same code point sequences.

Other claims

There are so many other claims on akuru.org. For example, Mr Donald from time to time challenges that certain words can't be "written" in Sinhala Unicode (latest being the name of the President). When we send him screenshots to show that it's possible (with and without joint characters), he claims that they are fake!!!

Hidden agenda?

There is a saying that it's easy to wake up a sleeping person, but it's very difficult to wake up someone who pretends to sleep.

Mr Donald has applied for a patent for his "system". Although he doesn't seem to have implemented it, he has promissed to deliver results if given an "opportunity" (as far as I know, nobody is holding him). And as Sinhala Unicode is becoming mainstream, his "pending" patent is going to be worthless, unless... oh, well!

Update: 2006-03-21 08:00

There are "valid" articles on akuru.org. Some are about the history of characters, and some are good articles by others authors. For example, articles written by Mr Aelien Silva, one of my favourite writers and linguists who has created so many good Sinhala technical words (e.g.: "manu", "thekala"), brings out very good points about technology localization. In fact, I have often quoted Mr Aelien Silva on the Sinhala Unicode list and elsewhere (need to enable Sinhala Unicode to read it, instructions are here for GNU/Linux and here for Windows, not sure how to do it on Mac... :-( ). However, I belive that hosting such articles is just an attempt to make akuru.org more authentic, which would otherwise be totally useless.

19 comments:

Ragazzo Freddo said...

Well said Anurudha. But I have doubt that Mr. Donald can understand this fact. because he is in his blind believes. As I see his final target is make Sinhala Unicode unpopular and give his system a validity.

His site called Akuru.org but I cant see any sinhala text on his site. Its a another subsidiary joke site of crazylanka

JC said...

I am reading this battle about Sinhala Unicode with some apprehension. I hope we could keep the rhetoric civil.

Your technical explanation seems reasonable and practical. You point to a link that has instructions on how to enable Sinhala Unicode for Windows: fonts.lk
I went to its Sample Sinhala Pages and see the Sinhala words mangled. (e.g. kombuva following the vyaçjanaya).

It seems that I need the code point sequencing software to fix the words.

Is there some place where all the characters that have their own code points and more importantly, those that would be generated by sequences of code points displayed? Or is this a job for the keyboard layout drivers?

Thank you.

Donald Gaminitillake said...

Why not Ragazzo call this Crazy Sinhala Unicode SLSI 1134

I quote from your text:

"Sinhala Unicode is not a Sri Lankan standard either."

--- you have admitted the fact your unicode is incorrect ----

Even arabic ISO/IEC 10646-1 1993
these were parts of characters (glyphs)
They found the problem and re did it
ISO/IEC 10646-1 1993

I QUOTE JUST ONE NUMBER FROM UNICODE ISO/IEC 10646-1 1993

Table 121 -row FE Arabic presentation form -A (....B)

BFDF = arabic ligature JALLAJALALOUHOU


I quote from unicode
If Sinhala is registered in Unicode paste a location for "DU"

I quote from Unicode Consortium webpage (www.unicode.org)

"quote"
What is Unicode?
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Fundamentally, computers just deal with numbers. They store letters and
other characters by assigning a number for each one. Before Unicode was
invented, there were hundreds of different encoding systems for
assigning these numbers. No single encoding could contain enough
characters: for example, the European Union alone requires several
different encodings to cover all its languages. Even for a single
language like English no single encoding was adequate for all the
letters, punctuation, and technical symbols in common use..............

"unquote"

Answer the question ----- "DU" "Yansaya" "repaya" in four digits in
unicode
"Yaksha" = (word) therefore two sets of unicode numbers = Ya = ,
Ksha = .

You went and registered wrong sinhala character set in Unicode
Consortium. SLSI 1134
Now we all suffer and you ruin my language Sinhala.
We got to correct it to save Sinhala Language.
I request to join me and voice to correct the SLSI 1134.

Anuradha said...

Here are two more myths:

Unicode provides a unique number for every character

This is just playing with words by quoting one remote sentense. here is why:

First, this is supposed to be quoted from official Unicode site. However, the word "character" in this sentense has an implicit "basic" prefix, which is understood by anyone who has studied the standard in detail.

To put it more precisely, Unicode provides a unique code point only for basic "characters". Other characters are then be generated by sequences of them.

This is not something special for Sinhala. Most of the other Asian languages are also quite happy with basic code points.

Mr Donald: Answer the question - "DU" "Yansaya" "repaya" in four digits in
unicode


I have asked three times: "why four digits? what's wrong with six, eight or hundred?" And now I am going to ask for the fourth time.

Anuradha said...

This FAQ page on official Unicode site explains why Mr Donald's "each character should have a unique code point" claim is a myth. Notably, the first two entries, and the example of a Devanagari (script used to write Hindi and Sanskrit) "ka" variation:

Quoting from that page:

Q: Does "text element" mean the same as "combining character sequence"?

A: No, this is a common misperception. A text element just means any sequence of characters that are treated as a unit by some process. A combining character sequence is a base character followed by any number of combining characters. It is one type of a text element, but words and sentences are also examples of text elements.

Q: So is a combining character sequence the same as a "character"?

A: That depends. For a programmer, a Unicode code value represents a single character (for exceptions, see below). For an end user, it may not. The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

For example, å (A + COMBINING RING or A-RING) is a grapheme in the Danish writing system, while KA + VIRAMA + TA + VOWEL SIGN U is one in the Devanagari writing system. Graphemes are not necessarily combining character sequences, and combining character sequences are not necessarily graphemes. Moreover, there are a number of other cases where a user would not count "characters" the same way as a programmer would: where there are invisible characters such as the RLM used in BIDI, compatibility composites such as "Dz", "ij", or Roman numerals, and so on.

Donald Gaminitillake said...

Hope you have the unicode locations

Have you seen Table 3 row 1 Latin Extended A and Table 4 row 1 Latin Extended B
(ISO /IEC 10646-(E))

"DU" is not registered in Unicode
That is why you are unable to give the location.

Anuradha said...

"DU" is registered in Unicode. I have already mentioned in the article that it is the sequence 0DAF ("da") followed by 0DDF ("papilla").

Explaination:

Strictly speaking "DU" is a grapheme rather than a character, and that's why it doesn't need to have a single code point. See this FAQ from the official Unicode site. According to the answer to the question 2 (quoted above, too):

The better word for what end-users think of as characters is grapheme (as defined in the Unicode glossary): a minimally distinctive unit of writing in the context of a particular writing system.

"DU" is a "minimally distinctive unit in writing" to an end user, therefore it's a grapheme.

0DAF ("da") followed by 0DDF ("papilla") is a "combining character sequence", which generates the grapheme "DU". See the answer to the question 1 of the same FAQ:

A combining character sequence is a base character followed by any number of combining characters.

0DAF ("da") is the "base character" here, and 0DDF ("papilla") is a "combining character".

Anuradha said...

Answer to JC's question

I can't give a complete answer for the Sinhala Unicode enabler kit for Windows, as I use only GNU/Linux on my desktop. However, I have seen people installing it and rendering works after that. As far as I understand it, this kit installs a Unicode Sinhala font, adds Sinhala rendering to Uniscribe and also installs a keyboard driver.

To view Sinhala Unicode, you don't need the keyboard driver.

Prasad Gunaratne said...

Well said Anuradha. I really appreciate what you are doing.

I further read about Unicode on Unicode.org site. It clearly indicates there is nothing wrong with specification for Sinhala.

I found that Unicode FAQ pages answer questions raised by Donald.
Firstly 'Where is my character' FAQ explains that not all gyphs are encoded.
http://www.unicode.org/standard/where/
There are various good examples given. But the best example on that page is 'ch' is a considered a character in Slovak and Traditional Spanish. But it is not allocated a code point and instead uses 0063 and 0067 i.e. the code points for 'c' and 'h'. There are other examples for Indian scripts as well.

Secondly the claim by Donald that current spec will break sorting. I also thought that there is some truth to this. But not any more. Because see following page.
http://www.unicode.org/faq/collation.html
I quote:

--start quote

My script does not sort right because the characters were assigned to Unicode code points in the wrong order. What can I do about that?

A: There is a misunderstanding here: Linguistically meaningful sorting is done not by comparing code point values (an approach which would fail even for English), but by assigning multi-level weights to characters or sequences of characters and then comparing those weights on each level. There are many algorithms and implementations for this; the standard Unicode Collation Algorithm (UCA) comes with a default weight table for all assigned characters as well as a tailoring mechanism that describes how this table can be modified to conform to local conventions, where necessary.

--end quote

Donald is clearly on an other agenda. Gald that none of the developers fall in to his trap.

JC said...

Thank you for your reply, Anuradha.

Where can I get the Unicode enabler kit for Windows that you mention? I have Windows 98 (Reg. Ver. & SE), NT Server, Windows XP and Macintosh.

Your answer would be much appreciated.

JC

Donald Gaminitillake said...

What do you mean by "
Donald is clearly on an other agenda. Gald that none of the developers fall in to his trap."

TRAP

All the developers are in a mess without proper SLSI for Sinhala.

I am the only person exposing the truth.

I "quote"

The Chairman of ICTA is also the former chairman of CINTEC who created the current problem with Sinhala fonts. He is the least likely to do anything to resolve it, because he will then be exposing his earlier bungling.

"unquote"

You all are covering and messing more and more - your path is to distroy the Sinhala Language

I quote from your own text again and again

Quote
"Sinhala Unicode is not a Sri Lankan standard either."
unquote

This is the truth. I want to correct it.

Readers will have the freedom to read and decide.

--end of my postings--

Anuradha said...

I said:

Sinhala Unicode is not a Sri Lankan Standard either.

This is good, and there is absolutely no need to "correct it".

If we have a standard for Sri Lanka only, it is unlikely to be supported by international software. But Sinhala Unicode being an international standard, every software written all over the world that suppport Unicode automatically support Sinhala.

Being an international standard, anyone living anywhere in the world can communicate in Sinhala Unicode. It has already become a reality thanks to the implementations. We have practically proved it through the Sinhala Unicode Group and elsewhere.

That's why I said SLS 1134 is an intermittent local standard, whereas Unicode is going to be the eventual international standard, although both are identical.

JC said...

Disclosure:
I developed romanized Sinhala. It is mainly for users of Pali and Sanskrit outside of Sri Lanka.

My Congratulations:
After following this debate, I am finally convinced that ligatures are supported by Sinhala Unicode fonts.

Questions:
Since on my computers, the kombuva shows after the consonant, it is presumed that one needs a special driver to show Unicode Sinhala font correctly. Is this right? If so, where does one get the special software?

Another question more applicable to Pali transliterators in the US:
Is it possible for the group that developed Unicode font to port the font to match code positions of romanized Sinhala in Latin-1 code page? (This could help in graceful fallback of Sinhala to Latin-1 when the special font is missing).

For us, it would be a great help for all non-Sinhala speaking transliterators who can cross check their work against Sinhala font simply by switching fonts.

Perhaps providing an alternative Sinhala font mapped to Latin-1 code page has its compelling merits for the Sinhalese too.

The transliteration alphabet is essentially Icelandic added with the letters, ñ, µ, ç and ø.
It is as follows:
a æ i u
e o
á í ú ó (binduva)
k kh g gh ñ
c ch j jh ç
t th d dh µ
þ th ð ðh n
p ph b bh m
y r l v
z x s h
ø (muurdhaja layanna)
ä (visarjaniiyaya with a)
f (upadhmaaniiya)
q (jihvaamuuliya - Sinhala character is shaped like X and sounds like Greek X!)

Prenasalized letters are digraphs in romanized Sinhala. (Too bad. However, these are unimportant for us because Pali/Sans. do not have them. Stated here for completeness):
ñg, µd, nð, mb

JC said...

Well, I have to change my question regarding kombuva etc. showing after the consonant:
I believe that the font I have in my computer is an older version that was replaced by the newer Unicode font. So, my question is now, where are these fonts distributed at?

I went and read part of the Unicode standard and microsoft's pages on fonts.

Mr. Gaminitilleka may want to read the following pages by Microsoft:

About Open Type
On Ligature making
And the follwing two technical Reports by Unicode:
Unicode: The Unicode character proerty model
Unicode: How ligatures are made

Also, Unicode: the Unicode database might be a good reference point.

As you might see Mr. Gaminitillake, the arguments about the Sri Lankan standard is irrelevant. Thank you for leaving a link to the proposed standard on your site. I read it oo and it does not pose any problem for font designers. One thing that I disagree with the standard is that the Bureau has unnecessarily introduced a confusion about þaaluja sañyuga naasikyaya (I am using romanized Sinhala here -- a bit of my ego. Note that þ is not p but it is the dental voiceless plosive consonant).

We shold set aside this fruitless debate. Both you gentleman have one thing at heart: the progress of the Sinhala people in the era of Information Technology.

My two-cents worth:
I believe that making Sinhala fonts for the Latin-1 Extension Block is a better proposition than making them for the Sinhala Block. Having both is even better.

Pubudu said...

Anuradha,
I understand that unicode stadard does not represent all the singhala characters. But it is possible to represent all of them using combinations. Would you be able to show those combinations ?(I mean a link), since I am not a language specialist. I could not find any document that tells this how characters can be generated with combinations. How can I find the singhala glyph table ?. I opened a unicode singhala font, but it shows the same what I see on unicode standard. Where are those glyphs defined ? Is it something hidden for the public ?
Because my application needs all the possible unicode combinations in singhala. I posted this at many places, but did not get any fair answer.

Pubudu

Anandawardhana said...

හිතවත් අනුරාධ,
යුන්කෝඩ්බලට ස්තුතිවන්ත වෙන්න මම අද සිංහලෙන් බ්ලොග් ලියනවා.
ඔබේ බ්ලොග් අඩෙවියේ පළ වූ දෑ කියවීමෙනුත් මට මේ සදහා බොහෝ තොරතුරු සොයාගන්න පුළුවන් වුණා. එයට බොහොම පින්!

Thusitha said...

අනුරාධ‍ගේ කීම හරි බව තමයි මගේ වැටහීම.සිංහල යුනිකෝඩිවලට පින්සිදුවන්න මම අද බේ‍ලොග් ලියන්නේ සිංහලෙන්.
මෙහි පලවුදෑ කියවීමෙන් ලැබුණු දැනුම බොහෝයි.ජයවේවා

කාලිංග said...

කියවන්නනම් ඕන කෙනෙකුට පුලුවන් කරන්න ඉන්නේ කව්ද ? Unicode නිසා දැන් mobile එකත් සිංහල වෙලා.... Unicode is the way to go ! who says its not cross platform incompatible ?

රට ඉදිරියට යනකොට මේයක්කු ඒකට උදැල්ල දානවා. google එකේ සිංහල අවේ කොහොමද ? thanks to Unicode !

අපි සිංහලෙන් search කරන්නේ කොහොමද ?
thanks to Unicode !

Harshana Weerasinghe said...

සටහනෙහි අන්තර්ගතය නම් මරු,
අනේ මන්දා, ඩොනල්ඩ් මහත්මයා තාක්ෂණයේ දියුණුව ගැන වැරදි වැටහීමක් අරන් වගේ. 3rd Normal Form ගැන සහ දත්ත සමුදාය කළමනාකරණය ගැන මොහු අවබෝධකරගත්තේ නම් මෙවැනි කතා කරන්නෙ නෑ කියලයි මම සිතන්නෙ.
ඒ වගේම සිංහල යුතිකේත අසම්පූර්ණයි කියල මොහු දකින්නෙඔහුට වාසියක් අත් කර ගැනීමට බවයි මා නම් සිතන්නෙ

මා සිතන විදිහට නම් සිංහල යුනිකේත දැනට තියෙන විදිහ හොදටම හොදයි යන්නයි.

Related Posts with Thumbnails