Fine-Tuned Indic Llamas are ‘Utter Garbaggio’ (2024)

Last updated June 19, 2024
In

Models build on top of Llama 2 with 2 billion tokens of different language cannot be called as a product.

Share

Fine-Tuned Indic Llamas are ‘Utter Garbaggio’ (1)

Published onJune 19, 2024

byMohit Pandey

Fine-Tuned Indic Llamas are ‘Utter Garbaggio’ (2)

Fine-Tuned Indic Llamas are ‘Utter Garbaggio’ (3)

It has been reiterated several times that the existing AI models from Google, Meta, and OpenAI are not inherently good when dealing with Indian language data, or any data in any language other than English. Worse is that even with expanding the models’ capabilities by showing Indic language data, the quality does not necessarily improve.

Raj Dabre, a prominent researcher at NICT in Kyoto, adjunct faculty at IIT Madras and a visiting professor at IIT Bombay, recently posted similar thoughts on X. “People be taking llama2, expanding vocabulary, pretraining on 2B tokens of a language and calling it a product,” he said, adding that he has already trained around 50 such models.

People be taking llama2, expanding vocabulary, pretraining on 2B tokens of a language and calling it a product. Bruh I have trained like 50 such models but I can tell you that outside of being useful to answer some research questions they are utter garbaggio.
— Raj Dabre (@prajdabre1) June 18, 2024

He further added that apart from answering some research questions, such models built on top of existing models, such as Mistral, Gemma, or Llama 2, are “utter garbaggio”.

Much of this is pointed towards the rise of open-source Indic language models such as Tamil Llama, Telugu Llama, Kannada Llama, and many such open-source offerings, which are built on top of open-source English language-based models.

This sentiment that India is not innovating in the AI space and merely building on top of existing models from the West has been echoed several times. When talking about India’s future in AI being bleak, several AI experts from India said that most LLMs produced in India are built on top of the already-available LLMs and cannot be called fundamental research.

Though there are others such as Pratik Desai from Kissan AI or Anubhav Sabharwal from CoRover.ai, who believe that building on top of existing open source models is good enough for making the models proprietary and building for specialised use cases.

Though startups like Sarvam AI are planning to build foundational models in Indic language, the current OpenHathi model is built on top of Meta’s Llama 2. Meanwhile, Soket AI Labs has already launched the Pragna-1B open source foundational model, but that is also yet to see a lot of adoption.

Much of this is because of the lack of adoption of Indic language models in the country. Even though everyone wants Indian models, the industry is not adopting them so readily.

Researchers are also content with fine-tuning with new languages on top of English-based models and trying them out for specific use cases as training a frontier foundational model would be a waste of resources for them.

The Creators Agree About the Adoption Problem

There is a widespread idea that open source is a good enough start for India as the adoption rate is too low. As Nandan Nilekani recently said that India’s focus should be on using AI to make a difference in people’s lives. “We are not in the arms race to build the next LLM, let people with capital, let people who want to pedal ships do all that stuff… We are here to make a difference, and our aim is to put this technology in the hands of people.”

Definitely, the factor of cost plays a big role, along with the flexibility of the open source models created by big-tech companies. When speaking with AIM, Adarsh Shirawalmath, the creator of Kannada Llama agreed that most of the problem within the country is that the industry is not willing to adopt the models created which are being built on top of existing models.

In a recent podcast with AIM, Arjun Rao, the founding partner of Speciale Invest, said that he would not be interested in investing in a company which is not doing foundational research and just building wrappers or models on top of existing open source offerings.

Earlier, in a conversation with AIM, Dabre also discussed the complexities of building models for Indic language. “These models [GPT-3] have seen close to tens of trillions of tokens or words in English. Unless you have seen the entirety of the web, or more or less all of it, none of these models will be able to actually solve the generative AI problem for that [Indian] language,” said Dabre.

Dabre rued that chatbots for Indian languages are still a dream. “You will see a lot of people claiming that they can make a chatbot or LLM for Indian languages, but 99% of those things are transient. They are not going to be too useful in production, because nobody has solved the data problem yet,” said Dabre. The biggest missing link here is the lack of Indic language data, which still needs to be solved.

For now, even though one can say that these fine-tuned models cannot be classified as products, they are ideal for research for students in universities. If companies such as Sarvam AI, Kissan AI, and Krutrim are still struggling to build foundational Indic language models, the individuals experimenting with such models should definitely be pushed further. Though not to be called products.

Access all our open Survey & Awards Nomination forms in one place

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words.