ASTERISK-14329: [patch] TW is not an ISO Language Code

[Home]

Summary: ASTERISK-14329: [patch] TW is not an ISO Language Code

Reporter: Vincent Olivier (volivier) Labels:

Date Opened: 2009-06-17 16:49:34 Date Closed: 2009-06-30 13:44:37

Priority: Minor Regression? No

Status: Closed/Complete Components: Core/Internationalization

Versions: Frequency of
Occurrence

Related
Issues:

Environment: Attachments: ( 0) 20090617__issue15346__1.4.diff.txt
( 1) 20090617__issue15346__1.6.0.diff.txt
( 2) 20090617__issue15346__1.6.1.diff.txt
( 3) 20090617__issue15346__1.6.2.diff.txt
( 4) 20090617__issue15346__trunk.diff.txt

Description: Tawanese actually is standard mandarin chinese that is categorized as "zh-tw" or simply "zh" and NEVER "tw". Should simply be "zh" in Asterisk because Asterisk is not concerned with the actual script of a locale (the way it is written).

The reason why setting taiwanese as "tw" is a problem, is because it fragments the efforts to localize in the standard mandarin chinese language (which is coded as "zh") and is a bad internationalization practice altogether.

Comments: By: Vincent Olivier (volivier) 2009-06-17 16:55:39

Actually, I'm sorry for the confusion. TW IS an ISO 639-1 code, but for a totally unrelated language, the Twi dialect of the Akan language : http://en.wikipedia.org/wiki/Twi

So it's actually very very very wrong to use it.

It is related to the following issues :

https://issues.asterisk.org/view.php?id=6135
https://issues.asterisk.org/view.php?id=12319
https://issues.asterisk.org/view.php?id=1615
https://issues.asterisk.org/view.php?id=7827
https://issues.asterisk.org/view.php?id=9964
https://issues.asterisk.org/view.php?id=9963
By: Tilghman Lesher (tilghman) 2009-06-17 18:36:43

You are incorrect, as noted on the ISO site here:
http://www.iso.org/iso/english_country_names_and_code_elements
By: Vincent Olivier (volivier) 2009-06-17 18:44:22

The list you provided is for countries, not languages. ISO 3166 is a standard for country codes.

ISO 639, a completely unrelated standard, is for language codes. See here : http://www.loc.gov/standards/iso639-2/php/code_list.php

And as you can see, the language code "tw" is for Twi, not Taiwanese.
By: Vincent Olivier (volivier) 2009-06-17 18:45:46

And Taiwanese has no language code, because Taiwanese, in its spoken form is standard mandarin Chinese, for which the ISO code is "zh".
By: Tilghman Lesher (tilghman) 2009-06-17 19:06:32

As the earliest version with this support is 1.4, I am uploading a patch for this version.
By: Vincent Olivier (volivier) 2009-06-17 19:23:20

Good! My company is also looking to contribute a full Chinese Asterisk recording sound set. Who is the best person to talk to for that (and I understand that some synchonization with the people responsible for the actual code is required as welll)??

Thanks!
By: Tilghman Lesher (tilghman) 2009-06-17 19:44:59

In terms of contributing sounds, I'll have to get back to you on that. It's not currently straightforward; however, the standard method of contributing support is to create a spreadsheet, similar to the existing files in doc/lang/ (Hebrew and Urdu), containing module name, file name, and the native script of the text for the indicated language.

There's a specific contribution license for sounds, which differs from the standard code contribution, which is why I don't currently have a straightforward answer.
By: Tilghman Lesher (tilghman) 2009-06-17 19:47:40

Technically, I do know that for our sound library, raw recordings are done in uncompressed 48000Hz audio, to allow for the greatest fidelity recordings, when audio is compressed into the various formats. However, there is still the licensing question that needs to be resolved. I will respond later with a more specific answer about licensing.
By: Vincent Olivier (volivier) 2009-06-17 20:14:35

Cool, thanks!
By: David Woolley (davidw) 2009-06-19 06:18:03

Spoken Taiwanese is a language called Hokkien ?????. It is not Mandarin ?????, as spoken in northern China. This is more than a dialect issue. Because of its history, people in Taiwan would normally also speak a dialect of Mandarin. An example of the difference is that Hokkien in Mandarin is pronounced Fujian Hua.

As far as I can tell, the ISO codes are only meaningful for written languages, and conventionally zh_TW means written Chinese using traditional style characters, although there are some dialect implications, as well.

?????'s are an attempt to enter the proper simplified characters for the names.

By: Vincent Olivier (volivier) 2009-06-19 07:26:22

But 1) is TW in Asterisk Hokkien? And also, if the syntax of Hokkien matches the one of other Chinese dialects (because Hokkien is still ethnographically considered Chinese), then all the work on Taiwanese should benefit ALL the Chinese variants that share the same syntax, and therefore should me coded zh as well.

Also, ISO is not only for written languages and zh_TW means more than mandarin written with traditional characters. For instance, I live in Quebec, Canada, and the language here is coded as fr_CA, sometimes referred to as Canadian French, but it has more implications than just the glyphs that are used, the spoken language is also different, in its pronunciation and vocabulary.
By: David Woolley (davidw) 2009-06-19 08:07:45

My understanding is that the syntax of written Chinese is relatively consistent, but still has variations. Spoken forms tend to vary a lot more.

In practice, ISO don't code spoken Chinese languages. I doubt that Cantonese speakers from Guangdong would be happy with the idea that they should use zh_HK, especially when they would have to use zh_CN when writing, and I didn't see any codes for Shanghaiese or Hokkien.

The big complication with Chinese is that, because it is ideographic, there is a disconnect between written and spoken langauges, and zh_TW doesn't mean Mandarin with traditional characters, it means written Chinese with traditional characters, and as used in Taiwan. In practice I'd expect people using Chinese on computers, in Malaysia, to use zh_TW, even though they actually spoke one of several languages (Cantonese, Hokkien or Mandarin and maybe others).

On computers, zh_CN tends to imply a phonetic input method, although even then not necessarily, whilst zh_TW input is independent of pronunciation, as the input methods encode the character structure.

One of the catches for a PABX is that spoken Mandarin tends to use a special pronunciation for the number one, in phone numbers, but not in numeric numbers. That's the sort of thing that may vary regionally.

Some direct local knowledge from several countries would be useful here.

By: David Woolley (davidw) 2009-06-19 13:30:21

It looks as though one has to go to ISO 693-3's 3 letter codes before you can distinguish between different spoken forms of the language that is written as Chinese, or to distinguish between the written and spoken styles. <http://www.sil.org/iso639-3/documentation.asp?id=zho>. The original Taiwanese would be "nan" and Mandarin would be "cmn". Formal written Chinese would be "lcn". Cantonese is "yue".
By: Tilghman Lesher (tilghman) 2009-06-19 13:47:41

I'm fine with doing whatever you find to be compliant with ISO.
By: Vincent Olivier (volivier) 2009-06-19 13:58:56

I think it's a VERY bad idea to go with the 3-letter codes for the simple reason that the effort for syntactic localization that is made in one language variant will not apply in another even if syntactically, they are equivalent as far as Asterisk is concerned. It is possible to achieve differentiation with the 2-letter ISO language codes This can be done with the following notation (like it is used in Java and other fundamentally internationalizable systems) : %LANGUAGE%_%VARIANT%, where %LANGUAGE% is a 2 letter ISO language code, and VARIANT can be anything including a ISO country+region code with even an additional prefix for very precise variations. It allows to share the syntax structure code (for all "zh" languages, the C code remains the same), while giving you the possibility of having different recording sets for each variation ("zh_CN", "zh_TW").

See the Java doc for more information : http://java.sun.com/j2se/1.4.2/docs/api/java/util/Locale.html
By: Vincent Olivier (volivier) 2009-06-19 14:13:45

Or, even better yet, but this requires difficult implementation, is to use the macro/individual language mappings. : http://www.sil.org/iso639-3/download.asp

But, do you want to go there?
By: Vincent Olivier (volivier) 2009-06-19 17:35:52

I'm volunteering for helping on this if we go for the most elegant solution possible. But I would need some explaining on the current L10N/I18N architecture from someone before that.
By: David Woolley (davidw) 2009-06-23 13:37:45

You couldn't use zh_CN or zh_TW as they already imply something other than specific variants of Chinese. Note that Chinese variants are mutually incomprehensible and do have grammatical differences, although we really need someone with first hand knowledge. I only have second hand knowledge and I think the same is true for volivier.

I have a feeling that you might have to end up with the three letter codes, e.g. zh_CMN, zh_NAN, or zh_YUE, if you insist on keeping the zh_ format. Someone once old me that language boundaries don't follow administrative boundaries in the Chinese mainland, so you cannot use those to qualify the language.

Incidentally, the specific special rule that you need for Mandarin is to use yao instead of yi when spelling out telephone numbers. I don't know if that is true of other Chinese variants, although one might provide the structure to allow it and then treat others as degenerate cases.

I think the reasoning is that yi is basically a pure vowel. If you use it in numeric numbers, that isn't a problem, as the multipliers delimit the digits, e.g. yi bai yi shi yi for 111. But if you spell out the number, you would end up with yi yi yi, which could easily sound like eeeeeeee.

PS I think yi shi may be a special case, where the yi is dropped.

By: Vincent Olivier (volivier) 2009-06-28 21:23:01

Actually, even if I can't have a philosophical conversation I do have first hand experience with Chinese and, in particular, I have one year of mandarin grammar classes. And I do know, for instance that the yi/yao pronunciations for one (1) are both also applicable in mandarin, that's why the biggest social network in China is called 51.com, because 5-1 is theoretically pronounced "wu-yi", but you can also pronounce it "wo-yao" which is homophonic to "I want", "wo yao". And what I know for CERTAIN is that ALL the main Chinese dialects can share the Asterisk C logic syntax. FOR SURE. Again, I'm not talking about the recordings, here....

Furthermore, the "zh_CMN, zh_NAN, or zh_YUE" breaks the ISO logic. It's better to go with the macro/individual language mappings even if it's harder. Heck, it's better, even, to go with the Java notation which, as far as I know works VERY well in both PRC and Taiwan and Honk Kong and Singapore and Guangdong, etc.

I think we are going in circles here. I think the safest and best decision is to go with the Java notation. There is also the ICU package which could help. There is a java and a C/C++ version : http://demo.icu-project.org/icu-bin/locexp

But I can have our Chinese partners to look at this, if you want....

By: Tilghman Lesher (tilghman) 2009-06-29 10:42:04

Whether you use 3 letters or 2 letters is immaterial to the Asterisk project, as long as a) it's consistent, and b) you use an underscore to delineate the regional differences from the main language. However, using 2 letters would probably be better, as other languages are implemented this way (e.g. pt_BR).

If you're fine with the patches already uploaded, please say so, and we'll get them committed.
By: David Woolley (davidw) 2009-06-29 11:50:51

It's wrong or incomplete if 1.6.1.1 is the same in this respect as 1.6.1.0, as the code doubles to code for another bogus language, with slightly different handling for zero (this may be historic, and unreachable, as I can't see a difference between if and else branches):

2279 }
2280 if (strcasecmp(language,"twz") == 0)
2281 snprintf(fn, sizeof(fn), "digits/%d", num);
2282 else
2283 snprintf(fn, sizeof(fn), "digits/%d", num);

Also, I don't think it should be saying Taiwanese in the comments, as it is talking about Chinese as a language, not about administrative units.

The only clue as to the actual language, for which the code was originally written, is the use of "wan", which is consistent with Mandarin, but might be more universal.

ast_say_digit_str_full needs changing to use yao. As I say it really needs native speakers to determine whether this special case applies to zh_*. This is probably a different issue, but potentially affects whether one can use a single set of syntax rules for zh_*.

One big problem may be tone sandhi. It's possible that say_digits mode can treat each character in isolation, but say number really ought to apply the rules for combining tones (which are, unfortunately, rather more complex than those given in the average text book for foreigners). The tone systems are definitely not consistent across zh_*.

For the record, my Chinese level was HSK Basic Level 2 about 3 years ago, which is one level below the minimum level for to take undergraduate science courses and well below that to be certain about language issues. I believe the grading scheme has changed since.
By: Tilghman Lesher (tilghman) 2009-06-29 12:21:24

This issue was opened to clarify an administrative issue of "tw" being misused as a language code. The code in question solves this issue. Any other contribution for questions which are not yet agreed upon can be resolved with another issue. I think the code contribution solves the problem that this issue raises. You're welcome to continue this discussion off the issue tracker, but this issue shall not be extended to cover all issues with the Chinese language in Asterisk, unless you have candidate code within 5 business days.

I have to put my foot down and ensure that issues are not inappropriately extended for the long term with no end in sight.
By: Tilghman Lesher (tilghman) 2009-06-29 12:22:52

I can certainly remove the "twz" exception, if that's the only thing that raises your hackles.
By: David Woolley (davidw) 2009-06-29 12:51:19

Definitely remove the "twz", which is, I think dead anyway.

Some of the text, including text aimed at users, is wrong, e.g. zh is not equivalent to Mandarin, and is not intended to be, and the other comments are also misleading for similar reasons, but maybe, if we get the code on the right track, I can re-raise those when I have time, as it is really something I ought to do outside working hours.
By: Vincent Olivier (volivier) 2009-06-29 13:58:20

I agree that our discussion goes beyond the simple issue here. And I agree with the "tw" to "zh" patches as they were submitted. Can you tell us where it would be more appropriate to discuss broader G11N issues? Thanks!
By: Tilghman Lesher (tilghman) 2009-06-29 14:47:36

The asterisk-dev list should be used to discuss such issues. When you finally have candidate code, it can be posted on the issue tracker, under a new issue.
By: Digium Subversion (svnbot) 2009-06-30 13:23:36

Repository: asterisk
Revision: 204469

U branches/1.4/UPGRADE.txt
U branches/1.4/main/say.c

------------------------------------------------------------------------
r204469 | tilghman | 2009-06-30 13:23:35 -0500 (Tue, 30 Jun 2009) | 11 lines

"tw" is the language specification for Twi (from Ghana) not Taiwanese.
(closes issue ASTERISK-14329)
Reported by: volivier
Patches:
20090617__issue15346__1.4.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__trunk.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.0.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.1.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.2.diff.txt uploaded by tilghman (license 14)
Tested by: volivier

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=204469
By: Digium Subversion (svnbot) 2009-06-30 13:36:24

Repository: asterisk
Revision: 204470

_U trunk/
U trunk/UPGRADE.txt
U trunk/apps/app_voicemail.c
U trunk/main/say.c

------------------------------------------------------------------------
r204470 | tilghman | 2009-06-30 13:36:24 -0500 (Tue, 30 Jun 2009) | 18 lines

Recorded merge of revisions 204469 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r204469 | tilghman | 2009-06-30 13:23:35 -0500 (Tue, 30 Jun 2009) | 11 lines

"tw" is the language specification for Twi (from Ghana) not Taiwanese.
(closes issue ASTERISK-14329)
Reported by: volivier
Patches:
20090617__issue15346__1.4.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__trunk.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.0.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.1.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.2.diff.txt uploaded by tilghman (license 14)
Tested by: volivier
........

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=204470
By: Digium Subversion (svnbot) 2009-06-30 13:44:16

Repository: asterisk
Revision: 204471

_U branches/1.6.0/
U branches/1.6.0/UPGRADE.txt
U branches/1.6.0/apps/app_voicemail.c
U branches/1.6.0/main/say.c

------------------------------------------------------------------------
r204471 | tilghman | 2009-06-30 13:44:16 -0500 (Tue, 30 Jun 2009) | 25 lines

Recorded merge of revisions 204470 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r204470 | tilghman | 2009-06-30 13:36:24 -0500 (Tue, 30 Jun 2009) | 18 lines

Recorded merge of revisions 204469 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r204469 | tilghman | 2009-06-30 13:23:35 -0500 (Tue, 30 Jun 2009) | 11 lines

"tw" is the language specification for Twi (from Ghana) not Taiwanese.
(closes issue ASTERISK-14329)
Reported by: volivier
Patches:
20090617__issue15346__1.4.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__trunk.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.0.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.1.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.2.diff.txt uploaded by tilghman (license 14)
Tested by: volivier
........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=204471
By: Digium Subversion (svnbot) 2009-06-30 13:44:26

Repository: asterisk
Revision: 204472

_U branches/1.6.1/
U branches/1.6.1/UPGRADE.txt
U branches/1.6.1/apps/app_voicemail.c
U branches/1.6.1/main/say.c

------------------------------------------------------------------------
r204472 | tilghman | 2009-06-30 13:44:26 -0500 (Tue, 30 Jun 2009) | 25 lines

Recorded merge of revisions 204470 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r204470 | tilghman | 2009-06-30 13:36:24 -0500 (Tue, 30 Jun 2009) | 18 lines

Recorded merge of revisions 204469 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r204469 | tilghman | 2009-06-30 13:23:35 -0500 (Tue, 30 Jun 2009) | 11 lines

"tw" is the language specification for Twi (from Ghana) not Taiwanese.
(closes issue ASTERISK-14329)
Reported by: volivier
Patches:
20090617__issue15346__1.4.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__trunk.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.0.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.1.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.2.diff.txt uploaded by tilghman (license 14)
Tested by: volivier
........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=204472
By: Digium Subversion (svnbot) 2009-06-30 13:44:36

Repository: asterisk
Revision: 204473

_U branches/1.6.2/
U branches/1.6.2/UPGRADE.txt
U branches/1.6.2/apps/app_voicemail.c
U branches/1.6.2/main/say.c

------------------------------------------------------------------------
r204473 | tilghman | 2009-06-30 13:44:36 -0500 (Tue, 30 Jun 2009) | 25 lines

Recorded merge of revisions 204470 via svnmerge from
https://origsvn.digium.com/svn/asterisk/trunk

................
r204470 | tilghman | 2009-06-30 13:36:24 -0500 (Tue, 30 Jun 2009) | 18 lines

Recorded merge of revisions 204469 via svnmerge from
https://origsvn.digium.com/svn/asterisk/branches/1.4

........
r204469 | tilghman | 2009-06-30 13:23:35 -0500 (Tue, 30 Jun 2009) | 11 lines

"tw" is the language specification for Twi (from Ghana) not Taiwanese.
(closes issue ASTERISK-14329)
Reported by: volivier
Patches:
20090617__issue15346__1.4.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__trunk.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.0.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.1.diff.txt uploaded by tilghman (license 14)
20090617__issue15346__1.6.2.diff.txt uploaded by tilghman (license 14)
Tested by: volivier
........
................

------------------------------------------------------------------------

http://svn.digium.com/view/asterisk?view=rev&revision=204473