Here at Ultimate, we’re building the world’s most powerful virtual agent platform. For the past 5 years this has been our vision. And part of making that vision a reality is being able to seamlessly support multiple languages with a single bot. That’s why we’re so excited to announce that we’ve developed a new language detection model that innovates above and beyond the current tools available on the market.
We can now proudly say that it’s the most accurate model in the field, with the best performance on short messages (which is key for those quick customer support chat conversations). In fact, on many standardized NLP benchmarking tests, it reached nearly 100% accuracy.
So speak Spanish to us (or is that Italian?): our state-of-the-art language detection model will be able to tell the difference. And now your bot can too.
Why we embarked on this language detection research
Language detection is not the hardest task in Natural Language Processing (NLP). When it comes to long, well-written texts, most language detection AI models can recognize what language a text is written in 99% of the time.
But live chat conversations with customer service are rarely ever long or well-written.
And therein lies the issue. Language detection was supposed to be a solved problem. But fastText — the go-to, open source language detection tool that is considered the industry standard — didn’t work quite as well for our industry. So given that we have an in-house team of AI experts, we thought we’d train our own language detection model. And this one would be purpose-built for customer support.
Read more about our AI
The challenges of creating a new language detection method
When we set about creating a new language detection model, there were a few challenges we needed to overcome: the new model would have to cover a wide array of languages, be very accurate — even if the message contained only one or two words, be small in size to avoid overloading the system memory, and work in real-time to avoid blocking the flow of chat conversations.
To overcome these challenges, we did a few key things:
- We developed a preprocessing method that removes “noise” — the elements of text like numbers and punctuation that aren’t needed to recognize language — because they often lead to false predictions.
- We created a large greeting dictionary because statistical models tend to have trouble with this, which makes sense since so many languages have similar words for saying hello.
- And lastly, we designed and trained the machine learning model on the types of messages it would actually be dealing with: chat dialogues, including short sentences in many languages, with the same amount of training data for each language, so there would be no bias for English.
How the new language detection model performs
The new language detection model is incredibly accurate: it makes up to 71% less mistakes than the pre-trained fastText model. Below you can see the comparison of how each model performed on single-word, word-pair, and sentence tasks.
What’s more, the new model is a lot fairer for smaller languages than the original fastText, greetings can be detected reliably (which is normally the toughest challenge), and overall the model is better able to handle short messages that are typical in chat. The new model covers 109 languages, and more languages can easily be added on demand.
See it in action at our recent launch event.
Of course, no statistical model is perfect, but since some languages have overlapping words (we’re looking at you, Serbian and Croatian) and a lot of messages contain only one or two words, the model is about as good as it can get. And if there’s any doubt, your bot will always ask the user which language they prefer to speak.
So, are you ready to scale your support by speaking your customers’ language? Ultimate’s state-of-the-art language detection model combined with our powerful conversational AI technology can help you do just that.