Playing with VOSK in Vector Wirepod
How to play around with audio processing in the Vector robot
Introduction
We have been discussing about the Vector Wirepod in the last few articles including a summary of the Knowledge Graph implementation in Wirepod, and a recent overview of Wirepod. One of the unique contributions of Wirepod is that it implements the chipper server which is responsible for audio processing and speech recognition. This article discusses how chipper uses the Vosk speech recognition toolkit, and the options you have to play around with other speech recognition modules.
Audio Processing in Vector Robot
First, let’s visit the basics of how Vector processes audio. As we all know, audio is captured by 4 microphones in the 4 corners of Vector’s head. Any word that you say after the wake word “Hey Vector” is sent offline by the Vector robot for speech analysis. (Please note that recognition of Hey Vector is implemented in Vector’s firmware, and therefore cannot be changed).
The server responsible for handling the speech analysis is called chipper. In the traditional implementation from Anki, chipper was implemented in the Amazon Web Services (AWS) cloud and interacted with a bunch of services such as Amazon Lex for natural language processing, Houndify for answering questions via a knowledge bank, and IBM Weather to answer weather related questions. The chipper server would then return a text back to the Vector robot, and then Vector would use the text to understand the speaker’s request or answer his question. Details of this implementation can be found in Chapter 35 of Randall Mass’s Vector Bible.
The current owner of Vector, Digital Dream Labs, made the chipper server code open source, but that was mostly shell code and didn’t implement any logic for audio processing. Wirepod re-implements the chipper server code to include a speech recognition modules, which can be installed during setup time. One typical module is Vosk.We have been discussing about the Vector Wirepod in the last few articles including a summary of the Knowledge Graph and a recent overview of Wirepod.
Vosk
Automatic Speech Recognition (ASR) is a very old and researched problem in computer science. Numerous techniques have been developed to solve this problem accurately… the problem has always been complex because of the diversity of accents that people have in spoken language. There are also a huge range of applications for ASR because voice assistants are a very common part of our daily lives… think about phone assistants, digital assistants, etc. Vosk is a practical and enterprise grade ASR. The good part about Vosk is that it is very easy to use because of easy integrations with a large number of programming languages. The following article may be useful to understand more about Vosk.
Vosk in wirepod
When you start the wirepod server, you can see the details of which vosk model is being used. Here are the logs from my wirepod installation.
$ sudo chipper/start.sh
.....
Initiating vosk voice processor with language en-US
Opening VOSK model (../vosk/models/en-US/model)
Initializing VOSK recognizers
VOSK initiated successfully, running speed tests...
Running general recognizer test...
(General Recognizer) Transcribed text: how are you
General recognizer test completed, took 406.67ms
The VOSK model used is at: vosk/models/en-US/model
If we look through the details in this directory, the VOSK model used is vosk-model-small-en-us-0.15 which is a lightweight wideband model for usage in Android and Raspberry Pi platforms and 40MB in size. This is a pretty small model in the context of speech recognition. This small model is likely chosen as default because small platforms like Raspberry Pi are the target installation of wirepod. A smaller model can run fast and consume low computational resources.
Playing with other Vosk Models
But, you can always play with the model and install other options available at Vosk. Using a larger model is expected to increase the accuracy of speech-to-text conversion, but at the cost of higher latency. As an example, I could easily download the large English model vosk-model-en-us-0.22 which is 1.8 GB in size and try it by linking the model in wire-pod. My VOSK model directory in Wirepod now looks like:
/wire-pod/vosk/models/en-US$ ls -ltr
lrwxrwxrwx 1 amitabha amitabha 60 Nov 7 12:50 model -> /home/amitabha/allhome/data/Downloads/vosk-model-en-us-0.22/
where the vosk-model-en-us-0.22
contains the unzipped model downloaded from the Vosk website.
To use this new model, I just need to restart wire-pod. You will note that the time it takes for the model to process the test input of “how are you?” doubles from 406ms to 953ms. Thus on my computer, the bigger model leads to 2x higher latency.
Opening VOSK model (../vosk/models/en-US/model)
Initializing VOSK recognizers
VOSK initiated successfully, running speed tests...
Running general recognizer test...
(General Recognizer) Transcribed text: how are you
General recognizer test completed, took 953.221599ms
Other Languages?
VOSK supports 20 different languages, so if you are a non-English speaker, you can download a model in your language. You can then use a Large Language Model (LLM) based on your language to support the knowledge bank. This can enable Vector to answer questions in your Language. Vector already has the ability to speak in a few different languages. Our next posts will explore this integration of Vector with non-English languages.
Conclusion
Wire-pod opens a large number of new possibilities to experiment with the Vector robot. Playing with speech processing is one of the options you can try. If you tried out Wire-pod with a different Vosk model, please share your thoughts with us via comments below.