As a linguist, I am fascinated by language. Communicating by combining separate sounds into larger parts and hereby creating something meaningful, with nothing but your voice, is magical to me.
But we humans have our brain. A phenomenal organ, mysterious and impressive. It enables us to do what we do and be who we are. Sure, you need your speech organs to talk; but the managing factor is your brain. The interaction between computers and language might therefore be something even more mysterious. Computers are dumb. They do not have an impressive brain like us. So how do we make them understand language?
A beautiful example of computer-language interaction is the virtual assistant. Every phone nowadays is equipped with a hidden secretary, ready to send emails, answer questions of the utmost importance (“Siri, do you know how to beatbox?”) or make a phone call. You don’t even need to open the corresponding app: these assistants can be addressed purely by speech. They usually need an activation trigger, like saying ‘hey Siri’ or double-tapping the home-button. Right after you’ve ‘awoken’ them, they are at your service. You can verbally ask them what you need them to do. But how do these secretaries know what you mean?
There are three different stages in this process [1]. The first step is to break down what you’re telling your device. As you might know, the way we transcribe our language is kind of strange. You might’ve wondered how ‘I’ and ‘eye’ have the same pronunciation, but a different spelling. This blog is not about the arbitrariness of our spelling system (thank God, I could go on for days…), but understanding the existence of phonemes is necessary to comprehend these Virtual Assistants.
Phonemes are abstract units of language. The initial consonants of ‘circle’ and ‘sir’ are written in different ways but sound the same. This is because they go back to the same phoneme. When your VA hears your question, the first thing it does is connotate the different phonemes. These are formed into words. The problem presented before, regarding words that sound alike, is solved by Trigram analysis [2]. This way of analyzing focuses on patterns which tend to occur often: “what am…” will more likely result in “I” than “eye”. These three-word clusters and their occurrence frequency are crucial to understanding your question.
Now that we have a transcription of your question, it needs an answer. That’s the second step: determining what the purpose of your question was. For every possible answer, based on your question, software like IBM DeepQA [3] creates a different thread. Every thread is assigned a value. This is based on various factors, like relevance and reliability. The software has ‘learned’ what answers are probable, due to previous exposure (Machine Learning). Based on these assigned values, one thread is chosen and marked as “winner”.
Once you have your answer, the only thing left is action. This is because some commands like ‘call Tony’ require not only an answer, but a specific operation. Specific word clusters or sentences trigger these actions. For example, if you want to order an Uber using Alexa, Alexa demands you phrase the command correctly: “Alexa, ask Uber to request a ride.” [4] These will result in opening a different app or performing a certain action.
Language is hard to understand, even for those with brains. Computers have an even harder time, as brainless devices.
Luckily they have us.
Sources:
1 https://www.marshall.usc.edu/blog/how-do-digital-voice-assistants-eg-alexa-siri-work
2 https://www.britannica.com/technology/speech-recognition
3 https://www.techrepublic.com/article/ibm-watson-the-inside-story-of-how-the-jeopardy-winning-supercomputer-was-born-and-what-it-wants-to-do-next/
4 https://www.lifewire.com/virtual-assistants-4138533
Love it 🙂