Technology behemoths like Google and Facebook have got us used to, even fatigued by, their never-ending string of impressive announcements of progress in the AI field. Nevertheless, when Google announced that it has built a “conversational agent that can chat about… anything,” even the most jaded amongst us had to pay attention.
Since I work in the field, helping organizations build conversational solutions, I was particularly intrigued. One of the biggest challenges for bots is to handle the infinite possible phrases that a user might say and respond appropriately. A bot that can chat about anything seems like just the thing we would need to solve this challenge. So the question becomes, exactly what impact Google’s new bot, called Meena, will have on organizations looking to deploy conversational AI applications. Have we found the holy grail? Will our bots finally stop saying “I’m sorry, I didn’t quite understand that”? Well, the short answer is that no, we are not quite there yet. Nevertheless, Meena is incredibly impressive and represents a fascinating attempt to solve the problem. In the next few paragraphs, I will summarize what Google did and how this might impact conversational AI in the days, months, and years to come.
What is Meena?
Let’s start by analyzing what we are dealing with here. What did Google invent?
Well, Meena is a 2.6 billion parameter end-to-end trained neural conversational model. The best version of Meena, according to Google, was trained over 30 days using 2,048 tensor processing units (Google’s dedicated AI-specific chip) on a dataset of 40 billion words. Not just random words. Google mined public domain social media for “multi-turn conversations,” where a turn is a pair of phrases in a conversation. So Google went out and got our conversations, 40 billion words worth of them, and trained a neural net to reply by showing it seven turns of a conversation as the input. By any measure, Meena is vast. Even if Google released all its code, which it hasn’t, only a few organizations would be able to train a similar Meena-like model. That is the first thing to understand. Meena is very much still in the lab and is very very complex to manage. You cannot incorporate it in a tool just yet, and it is unlikely Google will make it available as a service soon. So in the short-to-medium term, our bots will have to survive without Meena’s help, I am afraid.
What can Meena do?
Meena is not immediately available, as is typically the case with cutting-edge research. It takes time to make its way to actual products. But what does this research tell us about what direction our products might take in the medium-to-long term? Will we really be able to chat about anything, and is Meena really the best chatbot out there? Here is where things get interesting.
First, to claim that Meena outperforms other chatbots, we would need some evaluation criteria. Google introduced a new metric to help with this called the “Sensibleness and Specificity Average” or SSA. The innovation this score introduces is that it measures both whether a bot’s answer is sensible, i.e. what a human would reasonably rate as an appropriate response, and whether the response is specific. Typically, bots employ “tricks” to make you think they are keeping up with the conversation, when in practice they are just giving generic responses that are not necessarily specific to the context. For example, you could have a conversation such as:
Human: I really love spy films!
Bot: Amazing. Tell me more!
This is a sensible response, but it is not specific. The same answer works for any number of statements from a person. A more specific reply would be:
Human: I really love spy films!
Bot: Amazing. I like all the Mission Impossible spy films. Which is your favorite?
Google scored Meena, and a group of other chatbots (XiaoIce, Mitsuku, CleverBot, and DialoGPT) on specificity and sensibleness, and the average of that gave the final SSA score. There is some nuance in the numbers as Google describes them, but roughly:
Meena scored 79% SSA, Mitsuku and Cleverbot 56%, DialoGPT 48%, and XiaoIce 31%. Given that this metric can be used to evaluate human conversations as well, Google measured the average human SSA at 86%, so Meena gets tantalizingly close to that.
To summarize, based on Google’s own scoring approach that directly measures whether the responses of the bot are both sensible and specific in conversations of up to 7 turns, Meena scores higher than the other chatbots out there. To give some context, Mitsuku is the winner of the Loebner Prize Turing Test, and XiaoIce powers an immensely popular Microsoft service that converses with hundreds of millions of users. Even though one can easily find weaknesses with the scoring approach and argue about the objectivity of Google using a metric it came up with itself, what Meena did is impressive. Even more so when we consider that Meena is an end-to-end trained neural net model while Mitsuku and XiaoIce are hybrid systems with much more human intervention.
What is the impact?
Meena can chat, over a few turns of a conversation, believably. Meena, however, cannot reliably teach you anything. Meena is not trying to help you finish a task or learn something new specifically. It converses with no explicit goal or purpose. While we probably spend too much of our time chatting about not much of importance, we tend to be looking for something specific when interacting with a bot-powered digital service. We want to get a ticket booked or a customer support issue resolved. Or we want to get accurate information about a particular domain or emotional or psychological support for a challenge we are facing.
Conversational products have a purpose, and even if they fail at the more open-ended questions, they are trying to work with you to complete a task. Meena places the human-likeness of the conversation above all. However, there is much for us to learn about what is an appropriate conversational approach given different types of tasks. There is research that shows that more “robot” like responses are preferable in certain situations (especially where sensitive personal information is involved) and that being human-like is not the end-all and be-all of bots. Where does Meena, with the conversations it has learned from social media interactions, find a role? And if it is plugged into a conversational experience, how do we guarantee that inappropriate things are not said? Are the millions of public domain social media conversations the right dataset for the best chatbot in the world?
The bottom line
Meena is a fantastic contribution to the chatbot space. It is hard to capture the enormity of the task Google has achieved here. But we need to be careful about how we communicate the results of that research. Descriptions such as “the bot that can chat about anything” or “the best chatbot” are not necessarily useful. They distract from what is really important about this research — defining human-like conversation and exploring what role or importance there is in the chatbot world for that type of conversation. As more and more conversational AI solutions enter our daily lives, we need to focus on what is most valuable for us as humans. Meena moves us closer to that goal but doesn’t quite get us there yet.