NLU Benchmarking: Here’s Why Ours Is the Ultimate Conversational AI Platform

BlogHeader_ 5August2022_1200x628_2x

We’ve been saying for years that our conversational AI is industry-leading. Here’s the data to back it up.

How does conversational AI work in customer support automation?

When we think about AI, many of us still think about robots, cyborgs, and voice assistants that suddenly develop feelings. But as our Chief Science Officer Jaakko Pasanen has written elsewhere, AI is actually much less scary, much more prevalent in our everyday lives, and much simpler than that.

Take conversational AI’s revolution of customer support, for example: The subfield of AI that’s responsible for that sea change is called Natural Language Understanding, or NLU. And the sole reason NLU has made it so much easier for businesses to scale with automation comes down to a very simple task it excels at: Grouping customers’ messages (expressions) into categories (intents).

Today’s chatbots, or intelligent virtual agents, leverage the power of NLU to create better customer support.

Say a customer on a website types into a chat widget, “Where can I see the books I ordered?”, or “Has my book been shipped yet?” The intent, or category, for both expressions could be “Order status”. And NLU-powered AI can recognize that connection within a fraction of a second.

What’s more, a virtual agent can not only understand and classify what customers are asking, but also provide immediate responses based on that understanding – provided it is properly trained to maximum accuracy.

What is a key differentiator of conversational AI?

The key differentiating factor when it comes to comparing conversational AI solutions is how accurately they classify intents. The more accurate and reliable the AI, the more powerful the solution that it drives, and the more satisfied your customers.  Accurate AI will enable you to serve them exactly the information they’re looking for, reducing frustration and waiting times, and servicing a higher number of incoming queries, much higher than an agent could. 

And all of that reflects on your overall customer experience: According to a recent study by researchers at the University of Oslo, efficiency is the number one attribute customers look for in automated support interactions, and the key to effective and efficient conversational AI is – you guessed it, accurate intent classification.

So how can you measure your NLU engine’s accuracy?

To test your AI’s accuracy, you’ll need two things:  A set of data to train your AI engine, and samples of natural language that your engine hasn’t interacted with before to run and test its mettle. And that’s exactly what we did with our AI engine in a recent experiment. Read on for a rundown and key results!

Comparing NLU engines: How good a matchmaker is our AI?

The experiment we ran on our NLU engine is based on a previous experiment by Cognigy, which worked like this: Two datasets (one small, one large) were tested using 64 intents to see how reliably their AI model would correctly match these intents with randomly chosen expressions, or sample sentences.

One of the ways to determine accuracy was to look at the number of “false positives” – instances where the AI would incorrectly match random expressions with an intent even though there wasn’t one. The lower the number of false positives, the more accurate the AI model.

For the small data set, Cognigy used 64 intents and 10 randomly chosen expressions to test their NLU, as well as 1,076 example sentences that had not been included in the original training set. For the larger set, they went for 30 expressions and 5,518 example sentences. And we followed suit.

All of these expressions were grammatically correct and the test data set featured short sentences in English with all relevant details to help identify the intent. While this varies slightly from real-life chatbot interactions, which are a little more messy (and international), it still gives us a good idea of the model’s general accuracy. 

NLU and AI Metrics: The proof is in the precision

To measure our conversational AI model’s quality, we relied on two metrics: accuracy and the F1 score.

Accuracy refers to the percentage of test sentences correctly matched with the underlying intent. A score of .51 would mean that 51% of sentences were successfully paired to an intent.

F1 is the harmonic mean, or average value, between two additional metrics called precision and recall.

Precision shows us how many expressions that the engine identified were actually relevant to our intent. We measured this by identifying how many of the expressions that the engine marked as “weather-related” were actually weather-related. 

Recall shows us the rate of relevant instances that were retrieved: Out of all weather-related queries available in the test data set, how many did the engine correctly pair to a weather-related intent?

Calculating the F1 score let us balance false positives and false negatives, providing an even better benchmarking of the NLU engine by giving equal weight to both precision and recall.

Results: Told ya our conversational AI is the best!

In the table below, you can see how Ultimate’s NLU engine performed on both smaller and larger data sets and how it stacked against other AI engines. 

In the table below you can see how Ultimate’s NLU engine performed on both smaller and larger data sets and how it stacked against other engines.

5August2022_InTextImage_848x748_1_2x

19% reduction in error rate, while narrowing the F1 error by 17%

5August2022_InTextImage_848x748_2_2x

As you’ll see from the tables, it’s clear that Ultimate's NLU engine outperformed all other systems examined. 

What should you do with this information?

A reliable AI model is indispensable for creating better and faster customer support with automation – but it’s not the only piece of the puzzle. Having the right team in place to wield conversational AI for maximum value is just as important. Check out our overview of conversational AI here.

Want to know how NLU-powered automation can help you scale?