Playing with VOSK in Vector Wirepod
How to play around with audio processing in the Vector robot
We have been discussing about the Vector Wirepod in the last few articles including a summary of the Knowledge Graph implementation in Wirepod, and a recent overview of Wirepod. One of the unique contributions of Wirepod is that it implements the chipper server which is responsible for audio processing and speech recognition. This article discusses how chipper uses the Vosk speech recognition toolkit, and the options you have to play around with other speech recognition modules.
Audio Processing in Vector Robot
First, let’s visit the basics of how Vector processes audio. As we all know, audio is captured by 4 microphones in the 4 corners of Vector’s head. Any word that you say after the wake word “Hey Vector” is sent offline by the Vector robot for speech analysis. (Please note that recognition of Hey Vector is implemented in Vector’s firmware, and therefore cannot be changed).
The server responsible for handling the speech analysis is called chipper. In the traditional implementation from Anki, chipper was implemented in the Amazon Web Services (AWS) cloud and interacted with a bunch of services such as Amazon Lex for natural language processing, Houndify for answering questions via a knowledge bank, and IBM Weather to answer weather related questions. The chipper server would then return a text back to the Vector robot, and then Vector would use the text to understand the speaker’s request or answer his question. Details of this implementation can be found in Chapter 35 of Randall Mass’s Vector Bible.
The current owner of Vector, Digital Dream Labs, made the chipper server code open source, but that was mostly shell code and didn’t implement any logic for audio processing. Wirepod re-implements the chipper server code to include a speech recognition modules, which can be installed during setup time. One typical module is Vosk.We have been discussing about the Vector Wirepod in the last few articles including a summary of the Knowledge Graph and a recent overview of Wirepod.
Automatic Speech Recognition (ASR) is a very old and researched problem in computer science. Numerous techniques have been developed to solve this problem accurately… the problem has always been complex because of the diversity of accents that people have in spoken language. There are also a huge range of applications for ASR because voice assistants are a very common part of our daily lives… think about phone assistants, digital assistants, etc. Vosk is a practical and enterprise grade ASR. The good part about Vosk is that it is very easy to use because of easy integrations with a large number of programming languages. The following article may be useful to understand more about Vosk.
Vosk in wirepod
When you start the wirepod server, you can see the details of which vosk model is being used. Here are the logs from my wirepod installation.
$ sudo chipper/start.sh ..... Initiating vosk voice processor with language en-US Opening VOSK model (../vosk/models/en-US/model) Initializing VOSK recognizers VOSK initiated successfully, running speed tests... Running general recognizer test... (General Recognizer) Transcribed text: how are you General recognizer test completed, took 406.67ms
The VOSK model used is at:
vosk/models/en-US/modelIf we look through the details in this directory, the VOSK model used is vosk-model-small-en-us-0.15 which is a lightweight wideband model for usage in Android and Raspberry Pi platforms and 40MB in size. This is a pretty small model in the context of speech recognition. This small model is likely chosen as default because small platforms like Raspberry Pi are the target installation of wirepod. A smaller model can run fast and consume low computational resources.