Voice assistants are spreading in various environments, such as houses and cars, bringing the possibility of controlling heterogeneous Internet of Things devices with simple voice commands. However, massive use of the cloud connection for speech processing requires an efficient and robust Internet connection and raises concerns in terms of privacy. Therefore, we propose an end-to-end solution able to work totally offline, based on a system architecture combining different Deep Learning models to implement all the steps of the speech elaboration process. Being interested in targeting the Italian language, we exploited the transfer learning paradigm, which allows leveraging models trained in English on large datasets and fine-tuning them to the target language on a smaller dataset. The proposed system architecture is configurable and easily extensible to other languages. Experimental results in an automotive application use case show that our solution outperforms the other embedded models and achieves performance comparable to state-of-the-art cloud-connected solutions for Automatic Speech Recognition. Moreover, overall latency is significantly reduced by eliminating the need to connect to the cloud.
An embedded end-to-end voice assistant
Lazzaroni L.;Bellotti F.;Berta R.
2024-01-01
Abstract
Voice assistants are spreading in various environments, such as houses and cars, bringing the possibility of controlling heterogeneous Internet of Things devices with simple voice commands. However, massive use of the cloud connection for speech processing requires an efficient and robust Internet connection and raises concerns in terms of privacy. Therefore, we propose an end-to-end solution able to work totally offline, based on a system architecture combining different Deep Learning models to implement all the steps of the speech elaboration process. Being interested in targeting the Italian language, we exploited the transfer learning paradigm, which allows leveraging models trained in English on large datasets and fine-tuning them to the target language on a smaller dataset. The proposed system architecture is configurable and easily extensible to other languages. Experimental results in an automotive application use case show that our solution outperforms the other embedded models and achieves performance comparable to state-of-the-art cloud-connected solutions for Automatic Speech Recognition. Moreover, overall latency is significantly reduced by eliminating the need to connect to the cloud.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.