Application developed to calculate Khasi dialect of Meghalaya

A researcher from Assam Don Bosco University (ADBU) – Dr. Medari Janai Tham, has developed a Natural Language Processing (NLP) application – “Tham Khasi Annotated Corpus” with the aim of calculating the Khasi dialect.

Dr Medari Janai Tham – Developer of the NLP application “Tham Khasi Annotated Corpus”

It is a set of computational approaches for analyzing and synthesizing human language, including speech and text.

Additionally, generating a corpus – a collection of machine-readable content is an important step in developing NLP systems for a language.

The British National Corpus (BNC) is the most widely used corpus in English, which is popular among academics due to its accessibility.

Since there is no publicly available corpus for Khasi, it is classified as a low-resource language.

However, the publication of “Tham Khasi Annotated Corpus”, accessible via the European Linguistic Resources Association, has made a significant contribution to this subject (ELRA).

In order to ensure standardized tagging with other Indian dialects, the corpus was manually linked using the PoS (Parts-of-Speech) system formulated from BIS (Bureau of Indian Standards).

Tham was awarded the PhD degree from the Department of Computer Science and Engineering at ADBU for his thesis “Shallow Parsing for Khasi” under the supervision of Professor Pushpak Bhattacharyya from IIT Bombay.

Details of the corpus, including the annotation scheme and development of Khasi NLP tools, can be found in research papers published as part of his PhD and available at, which also serves as a companion website from the book “Ka Grammar Khasi Da Ka Jingdro” published by Macmillan Education, India.

The BIS Khasi tag set, a Khasi PoS hybrid tagger, a Khasi PoS HMM tagger, a Khasi POS NLTK tagger, a Khasi HMM surface parser, and a Khasi surface parser using the closed bidirectional recursive unit; are among the other contributions made by Tham.