TDIL: Technology Development for Indian Languages Programme, India

STATUS OF LOCALIZATION : INDIAN PERSPECTIVE

What is Localization?
Localization is a process of adapting an INTERNATIONALIZED application to a specific cultural/lingustic environment like the development/transmission of a TV programme to suite the cultural requirements of a region, without changing the basic technology.

Localization vs Internationalization

Internationalization is a process of producing an application platform which can be capable of being localized for any cultural environment. This application might be developed earlier for some cultural environments and languages.

International Scenario
In countries like Japan, China, Korea, Thailand etc. English is the second language, hence there is a significant demand for the development of IT systems in the local language to be used by common people.

Japan

Almost all applications of IT have the local language support such as:

Billing Businesses: Taxes, fees for public services, Bank transaction statements

Online processing: Railway reservation, ATMs in banks, library catalog searching

Internet: E-mail, web browsers in development

Systems: Mostly PCs with DOS/Windows, UNIX machines and Machine frames.

User interface/Local: It is impossible to sell PC based language support product without Japanese support.

Popular packages: MS word, Lotus1-2-3, Oracle etc.

Packages without Japanese support: Specialized software like FTP.

Main Issues addressed/ and the approach for Localization:

Standardization of Characters and cultural conventions: Constitution of JIS (Japanese Industrial Standards) Board The s/w developers will contact the JIS before designing the s/w package for standardizing the terminology/conventions.

JIS X0201 7-bit and 8-bit character sets: First version came in 1969, second version in 1976 and the latest revision is in 1997.

Based on Latin alphabets which define 63 Kata-Kana characters, JIS X0208 7-bit and 8-bit double byte KANJI characters. First version in 1978, revised in 83, 90 and final in 97 Defines the combined character set, Kanji and also non Kanji. It also defines shifted coded expression (Shift JIS)

JIS X 0212 Supplementary Graphic characters: in 1990 Established in 1990. 5,801 Kanji characters and 245 non-kanji characters are defined.

JIS X0221 Universal Multiple-Octet coded character set. First version established in 1993 and Finally in95 This includes all the characters defined by JISX0201, 208 and 212. This can support 20,902 characters out of about 34,000 characters in total.

Code characters sets are different from product to Product and platform to platform. Code conversion programs/overlays exist for all.

Major US suppliers are now supplying APIs for multilingual information processing. Multicharacter sets are now available like Netscape to be used for more than one language.

Other Issues such as:

Rendering of mixed scripts, fonts, text layouts etc. Sorting/ordering of characters NLP Still contemplating.

China

There two character sets: one is the traditional and the other is simplified. Simplified set is being used in urban areas of china. Traditional set is being used in Hong-kong, Thaiwan.

The first GB character set was established in 1988. This is a 7bit character coding scheme. GB2312 is a simplified character set, digits and also Latin & Greek alphabets.

Two byte character set is being used in DOS, Windows, and Open Windows

The latest version is in 1995 ISO/IEC 10646

For Internet RFC1922 'Chinese Character Encoding for Internet Messages"

Status of Localization: Most of the applications for common man like WP, DB, email etc. are available

Thailand

Character display is Four levels. Earlier Character code is designed based on Display. Realized need for separate Overlays for NLP.
Two types of Standards. Most of the systems support both the codes
Status of Localization: Office Systems, DBMS available with Thai support.
Authoring tools for CAI: Author ware, Toolbook Commercial OCRs for 90% accuracy.

Korea

WordProcessing, Spread sheets, DBMS are localized in DOS, Windows3.X and Windows 95 platforms. A few MT systems Japan-Korea, English-Korea, have been developed for limited vocabularry.

Three Codes are popular Wansung, Johab, Han. A separate code for the Web, email etc.

Two types of Keyboards: KS and Hung.

Speech processing is only at lab level. Speech synthesis is still far off from humans

Mangolia, Vietnam, Brunei, Singapore etc.

Realized the need for localization a few years ago Already localized word processors available

Brunei expressed their problem in representing JAWI.

Indian Perspective

What level of localisation do we require?
DIR --- 'SOOCHI'
COPY --- 'NAKAL'
WINDOW ---- 'KHIDIKI'
Not at this moment

What are the expectations?

IT Expert -- Able to develop application system in the language selected by the client.

Eg. A financial statement/database for inventory/ Employees in a company etc. can be developed by the programmer in English. The package should allow him to design the I/O screens, Data entry forms, Menus queries/outputs etc. in the language of choice.

Word-processing: Preparation of letters etc. in local/ regional language.

Applications: Inventory/payroll/fin.accounting etc. for entrepreneurs. Is it possible to develop these in one lang. (Eng.) and the user can access in another IL.? Transliteration does help?

Post offices: Registration/Speed post/M.O.s in local languages.

Word processing and Spread sheets with local lang. Menus may be retained in Eng. but transliterated.

Issues: Language/Cultural diversification 18 languages with 10 scripts

Hindi being National Lang. and 40% population are Hindi speakers
Is it worth develop all applications in Hindi in its first phase and then in other languages.?
Only about 10-20% of IT communities are from Hindi
People Working in Lang. technology/NLP for IL are getting only second class treatment in IT

What is to be done?

Standardization

Standardization of Characters: Perusal of international scenes reveals that many countries have devised the code keeping in view the display. Later realized the problem of CV separation and revised the code for characters.
Three levels are better, Level1: Basic Characters, Level2: Rendering rules/overlays for Display, Level3: Rendering rules/overlays for Speech/NLP

Standardization of Cultural conventions: For each region/language we should have a standard way of representing the following: Month/date/year/era formats, week days, Currency representations, Measurements, Units Different symbols, numbers etc. Names, Transliteration rules.

National Standardization:

Two levels:
Level 1: Low level working group Tasks: Preparation of concatenation rules for Transliteration, cultural conventions
Level 2: High level - Policy making level Decision Making, Protocol Maintenance with S/w developers and MNCs.
Identify resource centres/Language Engg. Centres
* Needed for alpha, beta testing.
* R&D/ Productionization, Market survey
* Customer support etc.
* Frequently interacting with customers and give feedback to the Standards Committee.