1. Common Locale Data Repository

CLDR provides key building blocks for software to support the world's languages, with the largest and most extensive standard repository of locale data available. This data is used by a wide spectrum of companies for their software internationalization and localization, adapting software to the conventions of different languages for such common software tasks.

CLDR contains the following types of data:

  • Locale-specific patterns for formatting and parsing: dates, times, time zones, numbers and currency values
  • Translations of names: languages, scripts, countries and regions, currencies, eras, months, weekdays, day periods, time zones, cities, and time units
  • Language & script information: characters used; plural cases; gender of lists; capitalization; rules for sorting & searching; writing direction; transliteration rules; rules for spelling out numbers; rules for segmenting text into graphemes, words, and sentences
  • Country information: language usage, currency information, calendar preference and week conventions, and telephone codes
  • Other: ISO & BCP 47 code support (cross mappings, etc.), keyboard layout

Technology Development for Indian Languages (TDIL) programme ( of Ministry of Electronics and Information Technology are focusing on the development of futuristic technologies for all 22 constitutionally recognized languages and standardization issues which will act as enablers in deployment of these technologies. TDIL has initiated the process of building the locale data for Indian languages. This Common locale data repository (CLDR) is being maintained by the UNICODE consortium at the international level.

We invite you to fill the CLDR data in the respective language available in this data format so that this standard would also become a reference guide for localized application developers such as e-governance applications. The CLDR data of other languages is available at for your reference.

2. Web Payment

Web and Digital payment Web payment contains a set of rules for the execution of payment transactions that are followed by adhering entities (payment processors, payers and payees), where transactions take place over networks (such as the Web). Some digital payment schemes make use internally of payment instruments from other payment schemes.
The Web Payments ecosystem strives to support fundamental Web principles by:

  • Adhering to Web architecture fundamentals
  • Supporting network and device independence
  • Providing for payers and payees with differing physical and cognitive abilities
  • Being machine-readable where possible to enable automation and engagement of non-human entities

The WSI proposed work on localization of Web payments activity in Indian languages. The expert committee and the background document have been evolved. The following organizations are taking part in this activity:

  • NPCI
  • SBI
  • Canara bank
  • HDFC Pay Zapp
  • Bill Desk
  • TCS
  • ReBIT

The background document on Web payment

3. Internationalization

I. Character Model for the World Wide Web: String Matching
The goal of this Group note is to be able to transmit and process the characters used around the world in a well-defined and well-understood way. It describes the ways in which texts that are semantically equivalent can be encoded differently. It provides a common reference for consistent, interoperable text manipulation on the World Wide Web. For more details please visit:
II. Readymade Counter Styles
The W3C “Ready-made Counter Styles” working group note provides code snippets for user-defined counter styles used by various cultures around the world, and can be used as a reference for those wishing to create their own user-defined counter styles for CSS style sheets. There is a need to define alphabetic listing for Indian languages for those who wish to use Indian languages alphabetic listing for CSS. For more details please visit:
III. Text layout Requirements for Arabic script (Urdu, Kashmiri, Sindhi)
This document describes the basic requirements for Arabic script layout and text support on the Web and in eBooks. According to the W3C Language matrix of the typography on the web, the following areas still need investigation for Urdu/Kashmiri/Sindhi language:

  • Fonts, Fonts style
  • Glyph control
  • Cursive text
  • Text decoration
  • Line breaking
  • Hyphenation
  • Styles initials Application Developers, E-Publishing, Browsers etc.
  • Lists, counters
  • Letter spacing & word spacing etc.
For more details please visit:
IV. Unicode Technical Reports
Requirements need to be identified in Technical annexures of Unicode w.r.t. Indian languages such as text segmentation, Line breaking, Emoji, vertical alignment etc.

4. CSS & Digital Publishing

CSS is the abbreviation for Cascading Style Sheet. A style sheet simply holds a collection of rules that we define to enable us to manipulate our web pages. CSS can be applied to our web pages in many ways; however the most powerful way to employ CSS rules is from an external cascading style sheet. When used in this manner, the full power of CSS can be used to control the design and appearance of our work from a single controlling location, which makes it easy to update our site on a global basis. Each cultural community has its own language, script and writing system. In that sense, the transfer of each writing system into cyberspace is a task with very high importance for information and communication technology.

Current work based on Indic layout requirements in CSS & Digital Publishing are shown below :

1. First draft of Indic layout requirements

2. Minimum Requirements of E-publishing for Indic (pdf file) 926 KB

5. Mobile Technology

SMS standard

The mobile technology is an important means of communications today. With the accelerating growth of this technology in India, the number of subscribers from rural areas will grow manifold for the simple reason that English literacy is relatively low in rural areas. In other words, unless Indian language messaging support is improved significantly, a large number of subscribers will be deprived of the benefits of SMS.

In the Mobile technology, the multilingual data handling becomes vital across different layers. Any chosen encoding scheme for data transmission should consider the following:

1. The data encoding scheme should support all possible characters, character combinations in Indian Languages as per Unicode standard

2. There should be a provision to change languages within a single message.

3. The encoding should be flexible for future Unicode standard.

Currently prevalent 3 SMS encoding schemes in India are :

ISCII based 7-Bit encoding

7-bit default alphabets as per GSM standard


The GSM standard supports 7-bits default alphabet and UCS2. For Indian languages, these encodings have their own pros and cons; especially when it comes to number of characters, standard implementation etc. The 7-bit EA-ISCII is capable of handling all the intricacies of Indian languages but it lacks the flexibility and at present does not support all the Unicode characters.But adopting 7-bit standards to cater growing demands of Indian Languages will not make mobile devices truly localized for Indian languages.

Meetings based on SMS Standard for Indian languages

1st Meeting (pdf file) 142 KB

2nd Meeting (pdf file) 133 KB

3rd Meeting (pdf file) 130 KB

4th Meeting (pdf file) 345 KB

Consultation papers

1. Consultation paper for Mobile Manufacturers (pdf file) 865 KB

2. Consultation paper for Mobile Service Providers (pdf file) 652 KB

3. Consultation paper for VAS (pdf file) 895 KB

6. Semantic web

E-Goveranance in India has recently gained momentum through various National and State Level Mission Mode Projects, having the objective of better citizen centric services and long term vision of 'Digital Unite for All'. However the accessibility of data is still a major concern as most of the data are coupled with applications and not reusable for better planning and coordination. To overcome this barrier, National Data Accessibility Policy has been framed, which will enable use of open linked data and other semantic web technologies to create an ecosystem of open framework data publishing and accessibility. we shall eleucidate the present state of implementation of E-Governance in India and future direction towards Open Linked Data for reaching out towards masses.

1. Background paper on Semantic Web (pdf file) 144 KB

2. Presentation on NLP Interchange format(NIF) (pdf file) 1.15 MB

3. Research Paper on Semantic Web (pdf file) 334 KB

7. Internationalization teg set 2.0

ITS 2.0 is a framework to add metadata to Web content, for the benefit of localization, language technologies, and internationalization [1]. The Internationalization Tag Set (ITS) 2.0 addresses some of the challenges and opportunities related to internationalization, translation, and localization. ITS 2.0 in particular contributes to concepts in the realm of metadata for internationalization, translation, and localization related to core Web technologies such as XML.The ITS 2.0 specification both identifies concepts (such as “Translate”) that are important for internationalization and localization, and defines implementations of these concepts (termed “ITS dataInternationalization Tag Set (ITS).

ITS 2.0 requirements for Indian languages (pdf file) 267 KB

8. Speech Technology

Speech processing provides powerful capabilities for improving the interaction between humans and machines, and between humans using machines. Speech processing can also be enhanced with Natural Language Processing (NLP) technology to model the human capacity to comprehend and process the content of human language, and to enable translation of a spoken sentence from one language to another, and many other intelligent linguistic applications. Speech Tools help in a great extent for providing information access interface to differently-abled persons such as people with visual and cerebral disability. Since Speech Resources are the key building blocks for development of speech based systems, initiatives are being taken to develop speech resources for Indian Languages. To develop Speech Resources for synthesis, recognition and speaker identification, different standards and methodologies are required. WSI currently focuses on the W3C speech Standards such as Pronunciation Lexicon Specification(PLS), SSML, SRGS, SISR etc.

1. SSML & SRGS requirements for Punjabi (pdf file) 348 KB

2. SSML & SRGS requirements for Bengali (pdf file) 800 KB

3. Pronunciation Lexicon Specification (PLS) Draft (pdf file) 493 KB

4. Speech Synthesis Markup Language (SSML)requirements for Indic (pdf file) 74 KB

5. Report of Workshop on PLS for Indic languages (pdf file) 170 KB

