Building a Voice-Controlled Front End to IoT Devices

Apple, Google and Amazon are taking voice control to the next level. But can voice control be a DIY project? Turns out, it can. And, it isn't as hard as you might think.

Siri, Alexa and Google Home can all translate voice commands into basic activities, especially if those activities involve nothing more than sharing digital files like music and movies. Integration with home automation is also possible, though perhaps not as simply as users might desire—at least, not yet.

Still, the idea of converting voice commands into actions is intriguing to the maker world. The offerings from the big three seem like magic in a box, but we all know it's just software and hardware. No magic here. If that's the case, one might ask how anyone could build magic boxes?

It turns out that, using only one online API and a number of freely available libraries, the process is not as complex as it might seem. This article covers the Jarvis project, a Java application for capturing audio, translating to text, extracting and executing commands and vocally responding to the user. It also explores the programming issues related to integrating these components for programmed results. That means there is no machine learning or neural networks involved. The end goal is to have a selection of key words cause a specific method to be called to perform an action.

APIs and Messaging

Jarvis started life several years ago as an experiment to see if voice control was possible in a DIY project. The first step was to determine what open-source support already existed. A couple weeks of research uncovered a number of possible projects in a variety of languages. This research is documented in a text document included in the docs/notes.txt file in the source repository. The final choice of a programming language was based on the selection of both a speech-to-text API and a natural language processor library.

Since Jarvis was experimental (it has since graduated to a tool in the IronMan project), it started with a requirement that it be as easy as possible to get working. Audio acquisition in Java is very straightforward and a bit simpler to use than in C or other languages. More important, once audio is collected, an API for converting it to text would be needed. The easiest API found for this was Google's Cloud Speech REST API. Since both audio collection and REST interfaces are fairly easy to handle in Java, it seemed that would be the likely choice of programming language for the project.