A Sri Lankan research team from the University of Moratuwa and local AI firm Decryptogen has built a Sinhala large language model that scored 4.5 out of 5 in blind evaluation, compared with just 1 out of 5 for Meta’s base Llama 3.1 on the same Sinhala prompts — a result that has been accepted for publication in the peer-reviewed IEEE Access journal.

The team, working with just two GPUs, cut the model’s perplexity — a standard measure of how well an AI understands a language — by close to 90 percent. In practical terms, the model can hold a natural conversation in Sinhala, follow instructions, and stay coherent across long responses, the project statement said. Major US AI labs use thousands of GPUs and spend hundreds of millions of dollars on comparable training runs.

IEEE Access, an open-access journal of the Institute of Electrical and Electronics Engineers, carries a 3.6 Journal Impact Factor and an h5-index above 200. The journal operates a binary review policy — manuscripts are either accepted as submitted or rejected outright, with no revision cycle — a quality bar few low-resource-language AI papers have cleared. The full paper, “End-to-End Adaptation of LLMs for Low-Resource Languages,” will appear under DOI 10.1109/ACCESS.2026.3693119.

The team had to build its training corpus from scratch. They scraped Sinhala news sites, books and online sources and used Hindi datasets — Hindi and Sinhala share Indo-Aryan roots — as a starting point, assembling 3.6 million question-answer pairs and 4 billion tokens. The dataset, one of the largest public Sinhala AI corpora, has been released on Hugging Face.

A key engineering breakthrough was tokenizer redesign. The original Llama tokenizer used an average of 91 tokens per Sinhala sentence and failed on 97.5 percent of Sinhala characters at the byte level. After adding around 35,000 Sinhala-specific tokens, that dropped to 23 tokens per sentence with zero byte-level failures.

The work was conducted at the Department of Electrical Engineering, University of Moratuwa. It was led by Decryptogen CEO Sanjeewa Alwis; Dr. Chathura Wanigasekara, Senior IEEE Member at the Institute of Maritime Technologies and Propulsion Systems at the German Aerospace Centre (DLR), Geesthacht; and Dr. Logeeshan Velmanickam, Senior Lecturer at the Department, with Wanigasekara and Velmanickam as the corresponding authors. Core engineering — model training, dataset construction, tokenizer redesign and evaluation — was carried out by P.K. Udith I. Sandaruwan, Nimesh M.A. Fonseka and Pamith C. Salwathura, all University of Moratuwa graduates, working with Decryptogen’s R&D team. A preliminary version was presented at the IEEE AIIoT Congress in Seattle in 2025.

The team frames the achievement as one of sovereignty rather than just performance. Sinhala is spoken by more than 20 million people but is barely represented in the training data of leading commercial AI systems. Even when foreign tools work in Sinhala, the model weights, training data, safety rules and ultimately the off-switch sit with companies in the United States or China. A locally-hosted Sinhala model could be audited and fine-tuned on the island and continue to operate regardless of foreign-policy, pricing or licensing shifts.

Identified applications include Sinhala-medium government services, educational tools for Sinhala-medium students, healthcare information for elderly and rural users, accessibility tools for citizens who do not speak English, and natural-sounding customer service for local businesses. Next steps include longer training runs, larger and more diverse Sinhala datasets, and deployments in assistive technologies and conversational systems.