FAQs related to Indian Language Computing

Frequently Asked Questions

Language Technology Industry
Localization
Standards
Tools and Technologies
Research Areas

Language Technology Industry

What is Language Technology?
Language technology researches computer systems, which understand and/or synthesize spoken and written human languages. Included in this area are speech processing (recognition, understanding, and synthesis), information extraction, handwriting recognition, machine translation, text summarization, and language generation.

What is Computational Linguistics?
Computational linguistics (CL) is a discipline between linguistics and computer science which is concerned with the computational aspects of the human language faculty. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer science that is aiming at computational models of human cognition.
Natural language interfaces enable the user to communicate with the computer in German, English or another human language. Some applications of such interfaces are database queries, information retrieval from texts and so-called expert systems.
Computational linguists have created software systems which can simplify the work of human translators and clearly improve their productivity. Even though the successful simulation of human language competence is not to be expected in the near future, computational linguists have numerous immediate research goals involving the design, realization and maintenance of systems which facilitate everyday work, such as grammar checkers for word processing programs.
Computational linguists develop formal models simulating aspects of the human language faculty and implement them as computer programmes. These programmes constitute the basis for the evaluation and further development of the theories. In addition to linguistic theories, findings from cognitive psychology play a major role in simulating linguistic competence.

What is the best way to deal with encoding issues in forms that may use multiple languages and scripts?
The best way to deal with encoding issues in (X)HTML forms is to serve all your pages in UTF-8. UTF-8 can represent the characters of the widest range of languages. Browsers send back form data in the same encoding as the page containing the form, so the user can fill in data in whatever language and script they need to.
It is important to tell the browser that the form page is in UTF-8. There are various ways to tell the browser about the encoding of your page. This is important in any case, but the page itself doesn't contain any characters outside US-ASCII, but your users may type in other characters.
It may be a good idea for the script that receives the form data to check that the data returned indeed uses UTF-8 (in case something went wrong, e.g. the user changed the encoding). Checking is possible because UTF-8 has a very specific byte-pattern not seen in any other encoding. If non-UTF-8 data is received, an error message should be sent back.

Localization

What is software localization?
Software localization is the process of adapting a software product to the linguistic, cultural and technical requirements of a target market. This process is labour-intensive and often requires a significant amount of time from the development teams.

What is Internationalization?
Definitions of internationalization vary. This is a high-level working definition for use with W3C Internationalization Activity material. Some people use other terms, such as 'globalization' to refer to the same concept.
Internationalization is the design and development of a product, application or document content that enables easy localization for target audiences that vary in culture, region, or language. Internationalization is often written "i18n", where 18 is the number of letters between 'i' and 'n' in the English word.

Why localize any product?
Localization, or L10N, is the process of adapting a product or content to a specific locale. Translation is one of several services that form the localization process. So in addition to translation, the localization process may also include adapting graphics to the target markets, modifying content layout to fit the translated text, converting to local currencies, using of proper formats for dates, addresses, and phone numbers, addressing local regulations and more. The goal is to provide a product with the look and feel of having been created for the target mark to eliminate or minimize local sensitivities.

What are the guidelines to create XML document types that will be easier to localize?
The W3C ITS Working Group is working to produce such guidelines. Some of the main aspects of the guidelines are the following:
• Avoid using attributes for translatable data.
• Provide a way to specify the language of your elements, and use xml:lang for this.
• Provide specific elements to delimit content that is coming from an external source (e.g. error messages or prompts from a resource file).
• Provide a mechanism of IDs for translatable elements.
• When naming your elements think about what is there purpose, not how you imagine the rendering of their content. For example: if an element is used to emphasis a text run call it < emph > not < bold > .

Is there a standard set of localization directives?
Yes and no. There is a standard called the Internationalization Tag Set (ITS) that is a W3C Recommendation. While ITS is not exactly a standard for localization directives, some of its features can help you with this. ITS can be used as a namespace in any XML document.

Do you have any information on the best practices for in-context review of content?
In-context review of localized content is a vital step in any localization process. In general localizers should be provided with as much context as is reasonably possible and reviewers should be able to see all content in its final context. If it is not possible to provide all context, you should provide a description of anything not available so that the localizer and reviewer can better understand their job.

What is Localization Testing?
After an application has been localized, it must be tested before market release. While some may worry that testing increases time-to-market, it should be noted that the cost of correcting a problem increases dramatically over time. There is a slight but significant difference between localization and linguistic testing. Here are simple definitions:
• Localization testing focuses on the correct functionality, appearance and completeness of the localized product.
• Linguistic testing takes care of ensuring the correct language rules are being used and focuses on correct in-context linguistic usage.

Testing has often been considered only for software that is localized. But, in fact, all localized content should be tested to make sure it is correct. Whether the localized content runs a version of software for Asian audiences, or whether it appears on the side of a box containing the company's product or in an online ad, it represents the company and should be considered as important as the original content.
Localization testing focuses primarily on user interface but it also reaches farther in fact, the localization process can introduce severe functionality problems to the software. Those problems can be caused either by over-translation of some system variables that are invisible to the target user and must not be translated, or by modified functionality, which sometimes must be implemented to the product to meet local market expectations. Letter wizards and spell checkers could be the typical examples.
Localization testing requires both source and target language versions of the product installed on the environment that a typical user would use. Therefore attention must be paid to the correct version of the operating system, language, regional settings and more. The builds used for this testing must also match in terms of functionality localization starts at an early stage of product development where all features are not yet implemented, and mismatched localized and English builds cannot provide the expected functionality testing consistency.

How does software localization differ from traditional document translation?
Software localization is the translation and adaptation of a software or web product, including the software itself and all related product documentation. Traditional translation is typically an activity performed after the source document has been finalized. Software localization projects, on the other hand, often run in parallel with the development of the source product to enable simultaneous shipment of all language versions. Translation is only one of the activities in a localization project, there are other tasks involved such as project management, software engineering, testing and desktop publishing.

What is the standard software localization process?
A software product that has been localized properly has the look and feel of a product originally written and designed for the target market. Here are just a number of points that have to be considered, as well as the language, in order to effectively localize a software product or website: measuring units, number formats, address formats, time and date formats (long and short), paper sizes, fonts, default font selection, case differences, character sets, sorting, word separation and hyphenation, local regulations, copyright issues, data protection, payment methods, currency conversion, taxes.

Standards

What is ISCII?
Bureau of Indian Standards formed a standard known as ISCII (Indian Script Code for Information Interchange) for the use in all computer and communication media, which allows usage of 7 or 8 bit characters. In an 8 bit environment, the lower 128 characters are the same as defined in IS10315:1982 (ISO 646 IRV) 7 bit coded character set for information interchange also known as ASCII character set. The top 128 characters cater to all the Indian Scripts based on the ancient Brahmi script. In a 7-bit environment the control code SI can be used for invocation of the ISCII code set and control code SO can be used for reselection of the ASCII code set.
There are 22 officially recognized languages in India. Apart from Perso-Arabic scripts, all the other 10 scripts used for Indian languages have evolved from the ancient Brahmi script and have a common phonetic structure, making a common character set possible. An attribute mechanism has been provided for selection of different Indian script font and display attributes. An extension mechanism allows use of more characters along with the ISCII code. The ISCII Code table is a super set of all the characters required in the Brahmi based Indian scripts. For convenience, the alphabet of the official script Devnagari has been used in the standard. The standard number IS1319:1991 issued by Bureau of Indian Standards is the latest Indian Standard for Information Interchange, and is being widely used for development of IT products in Indian Languages.

What is ACII Script Code?
Alphabetic Code for Information Interchange (Pronounced as "Ae-Kee). This is a 8-bit code, containing the ASCII character set in the bottom half. The top half contains the ACII characters. PC-ACII Script code is the version of ACII script code where the characters are split in the upper-half for compatibility with IBM PC. This splitting is necessary in order to keep intact the Line Drawing characters which are located in middle of the upper-half of the character set.

What is UNICODE?
Unicode Standard is a 16-bit storage encoding standard, which is being used internationally by the Industry for the development of Multilingual Software. Unicode standard is the Universal character encoding standard, used for representation of text for Computer Processing. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. The Unicode standards provide information about the character and their use. Unicode Standards assigns each character a unique numeric value and name. The Unicode standard and ISO10646 Standard provide an extension mechanism called UTF-16 that allows for encoding as many as a million.

What is unicode policy for character encoding?
Unicode consortium has laid down certain policy regarding character encoding stability by which no character deletion or change in character name is possible only annotation update is possible
1. Once a character is encoded, it will not be moved or removed.
2. Once a character is encoded, its character name will not be changed.
3. Once a character is encoded, its canonical combining class and decomposition (either canonical or compatibility) will not be changed in a way that would affect normalization.
4. Once a character is encoded, its properties may still be changed, but not in such a way as to change the fundamental identity of the character.
5. The structure of certain property values in the Unicode character database will not be changed.

What is the basic difference between Unicode and ISCII code?
Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Unicode Standards assigns each character a unique numeric value and name. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. ISCII uses 8 bit code which is an extension of the 7 bit ASCII code containing the basic alphabet required for the 10 Indian scripts which have originated from the Brahmi script. There are 22 officially recognized languages in India. Apart from Perso-Arabic scripts, all the other 10 scripts used for Indian languages have evolved from the ancient Brahmi script and have a common phonetic structure, making a common character set possible. The ISCII Code table is a super set of all the characters required in the Brahmi based Indian scripts. For convenience, the alphabet of the official script Devnagari has been used in the standard.

Are ISO/IEC 10646 and Unicode the same thing?
No. Although the character codes and encoding forms are synchronized between Unicode and ISO/IEC 10646, the Unicode Standard imposes additional constraints on implementations to ensure that they treat characters uniformly across platforms and applications. To this end, it supplies an extensive set of functional character specifications, character data, algorithms and substantial background material that is not in ISO/IEC 10646.

What is the role of W3C India Office?
W3C India Office is apex body of W3C activities in India and would act as Single Window for Bi-directional communication between Stake holders and W3C Consortium. W3C India Office Promote and proliferate W3C Standards. It generates national recommendations for specific Standards through Stake holder consultation. For further information, Please refer http://www.w3cindia.in/

What is WCAG 2.0 Standard?
The Web Content Accessibility Guidelines (WCAG) 2.0 documents explain how to make Web content accessible to people with disabilities. Web "content" generally refers to the information in a Web page or Web application, including text, images, forms, sounds, and such. For further information, Please refer http://www.w3.org/TR/WCAG20/

What is Inscript Keyboard Layout?
InScript (Indian Script) is a touch typing keyboard layout scheme for inputting Indic text on computer. This keyboard layout is standardized by Government of India for Indic Computing. InScript has common layout for all the Indic scripts. For the data entry in Indian languages and all the ten Indian Scripts, the default option is INSCRIPT (Indian SCRIPT) layout. This layout uses the standard 101 keyboard. This overlay fits on any existing English keyboard. The mapping of the characters is such that it remains common for all the Indian languages (written left to right). This is because of the fact that the basic character set of the Indian languages is common. InScript keyboard now comes inbuilt in all of the newer operating systems including Windows (2000, XP, Vista), Linux and Macintosh.

Tools and Technologies

What is Open Type Fonts?
Open Type is a registered trademark of Microsoft Corporation. Because of wide availability and typographic flexibility, including provisions for handling the diverse behaviors of all the world's writing systems, Open Type fonts are used commonly today on the major computer platforms. OpenType support consists of three types: basic OpenType support (the fonts work like any other fonts); Unicode support (access to extended language character sets); and OpenType layout support (support for advanced typographic features). Some operating systems (or operating system extensions) can provide support for one or more of these, but support for Unicode and layout features requires that an application be programmed to provide this functionality. OTF and OFF are technically synonymous.

What is True Type Fonts?
TrueType is an outline font standard originally developed by Apple Computer in the late 1980s as a competitor to Adobe's Type 1 fonts used in PostScript. TrueType offered font developers a high degree of control over precisely how their fonts are displayed, right down to particular pixels, at various font sizes (with widely varying rendering technologies in use today, pixel-level control is no longer certain).

What are dynamic fonts?
Dynamic fonts are the technology used for delivering windows true type fonts on the client side in transparent way. If the user has a facility of viewing the pages in Indian Languages then fonts can be delivered to the client in EOT and PFR format.

When I used dynamic fonts on a colored background, the color around the text is different from the rest of the background, only in a Netscape browser.
With versions of Communicator 4.04 and earlier ones, some 256 color systems have trouble displaying text with an explicitly declared background color. This problem has been fixed in version 4.05 of Communicator and Navigator. Check with Netscape to see if the updated version of the software is available for your system. You may also want to see if you can set your display adapter to 16-bit color (65,336 colors) or higher. When building your pages, for best results on 256 color systems, we recommend using one of the following named background colors: aqua, black, blue, cyan, fuchsia, gray, green, lime, magenta, maroon, navy, olive, purple, red, silver, teal, white, yellow You can use RGB equivalents, such as: #000000 (black), #FF0000 (red), #00FF00 (green), #0000FF (blue), #FFFF00 (yellow), etc.

What can I do with my dynamic font documents for browsers that do not support dynamic fonts?
You can specify alternate fonts in FONT FACE tags and Cascading Style Sheets (CSS). You can use fonts that are readily available in most operating systems. The most common fonts are: Windows Mac UNIX (XWindows) Sans Serif Arial Helvetica Helvetica Serif Times New Roman Times Times Fixed Pitch Courier New Courier Courier In FONT FACE, for example, you would declare alternate fonts like this: If the first font is not available, then the second font is used, and so on. As far as we know, there is no limit on the number of alternate fonts you can list. More than three is probably not practical. For Cascading Style Sheets, look through your HTML Editor's documentation for advice on specifying alternate fonts in the Font Family tag of a CSS.

Research Areas

What is machine translation?
Machine translation is also called "automatic translation" or simply translation software. Machine translation software translates text in one natural language into another natural language, taking into account the grammatical structure of each language and using rules to transfer the grammatical structure of the source language (text to be translated) into the target language (translated text). Machine translation cannot replace a human translator for demanding applications such as legal or literary work, nor is it intended to. Many companies that represent themselves as MT providers are actually selling "word by word" translation. Make sure you know what you are getting and ask a lot of questions about upgrade paths and integration.

What is HWR?
Handwriting recognition (HWR) is the software process by which handwritten characters are analyzed and displayed as computer text characters.

What is "word by word" translation?
Word by word translation translates each word or phrase that it understands, but does not take grammar into account. Word by word translators are generally not as effective as automatic or machine translators, but still can be very useful, for instance as a translation aid.

How can I translate letters and other paper documents with my computer?
You can scan it using optical character recognition (OCR) software and then use a translation program to translate it. You may need special OCR software that is designed to recognize the source language. You can also re-type it into your computer, although this may be impractical if you are not familiar with the language or do not have the proper equipment such as multilingual word processing software, special keyboards, etc.

What quality of translation can I expect from translation software?
This depends on many factors such as the translation program, the type of translation, the grammar of the document to be translated, the use of a specialty dictionary or glossary, among other factors. The quality of the engine of the translation program and the size of its dictionary are usually the most important factors. Generally, you can expect draft-quality translations: the result can be readily understood, but will need editing and correction for professional use. Again, a professional translator or firm should be used for demanding or mission-critical applications such as legal or literary work.

Can I translate web pages and e-mail?
Yes! Several of the programs we shall allow you to translate web pages and/or e-mail directly. Other programs require you to cut and paste the information into the translation program before translating it.

What is OCR?
Optical Character Recognition (OCR) is a process of converting printed materials into text or word processing files that can be easily edited and stored.

What is the best scanning resolution for OCR?
Most OCR engines are optimized for 300 dpi images. Scanning at true 300 dpi optical resolution is very important. Scanning at a lower resolution and then using scanner software to increase the dpi later on does nothing for OCR. In cases where the font size of characters on an image are very small (point size of 4 or less), scanning images in at 400 dpi can improve character recognition. This again would require a scanner that supports true 400 dpi optical resolution.

What is the difference between Forms-based OCR and Full-Text OCR?
A typical form has a structured page layout that contains both static and variable information. If the variable information on the form has been filled in using machine printed characters, the form is a candidate for Forms-based OCR. If each page you want to OCR always has the same Form (i.e., the layout of text on the every page is the same), you can create a zone "template" that OCR can use to extract the data you are looking for. Full-Text OCR just means that you intend to OCR the entire page, without prior zoning. In affect, the entire page is treated as a single zone. There are cases, however, when zoning is valuable even in a full-text environment.