Proof of Concept for a Real-Time AI Voice Translation App That Successfully Attracted Investors
About Our Client
The Client is an UAE-based innovative startup developing AI-powered voice translation and video dubbing solutions.
Need for a Real-Time Voice Translation App with Voice Cloning Capabilities
The Client had an idea for a web app that would translate live speech and clone the speaker’s voice to generate matching audio output in the target language. To attract investors, the Client needed a demo app version. As the Client lacked in-house resources to deliver the required functionality, it turned to ScienceSoft for AI consulting and development services.
App Conceptualization and AI Strategy Development
ScienceSoft assembled a team of one project manager and four AI developers for the project. The Client and ScienceSoft agreed to start with a proof of concept (PoC) and develop it into a full-scale app after the startup attracts investors.
While designing the app, ScienceSoft prioritized output quality and the app’s flexibility for future evolution. Though the obvious solution was to enable direct audio-to-audio translation, our team decided to include intermediate translation steps. The app would convert audio into text, translate the text, and transform the translation into speech while also cloning the speaker’s voice. The suggested approach has several benefits:
- The text translation AI models are more accurate than their audio translation counterparts.
- The text translation models cover more languages than audio models.
- Breaking down the translation process into several steps allows for more flexibility in future app development. With a modular architecture, it would be easier to add new languages or update and replace AI models.
Next, ScienceSoft chose technologies for each functional block. We compared popular market options against the following parameters: output quality, output speed, ease of integration, ease of customization, and service costs. Azure AI Speech Service demonstrated an optimal combination of these characteristics. For voice cloning, our experts selected AI Voice Generator by ElevenLabs as it has more customization flexibility than Azure.
Implementing the voice cloning functionality was going to be the most costly and time-consuming part of the future app. To optimize the PoC project timeline and budget, ScienceSoft suggested selecting five diverse voice samples and making them available in the app. When users try out the demo, they will be able to record their speech and choose one of the five voices for translation. As the startup attracts investors, the AI abilities will be expanded to direct cloning of any speaker’s voice.
AI Voice Translation App Providing Output in Under 3 Seconds
ScienceSoft developed a web-based PoC for the AI-powered voice translation app. The app is built in the Azure cloud and is powered by four AI services: Azure Speech to Text, Azure Text Translation, Azure Text to Speech, and ElevenLabs AI Voice Generator. The first two services are deployed in individual containers, and the last two share a container. The containers communicate via transcription, translation, and speech-to-text clients. To provide automated voice translation across more than 100 languages, the application performs the following steps:
Speech recording
In the web app, users define the source and target languages, choose the voice they want to clone, and record their speech. The app records user speech from a microphone or a headset in OPUS format with the help of MediaRecorder.
Speech transcription into text
The recording goes to the transcription client via WebSocket. The client utilizes FFmpeg as an audio processing tool to convert the OPUS file to PCM format and sends the recording to the Azure speech-to-text container, where audio is converted into text.
Text translation
The text gets to the Azure Text Translation container, where it is translated into the target language and directed to the text-to-speech container.
Text-to-speech transformation and voice cloning
Azure Text to Speech service transforms the translated text into audio. The output mimics the voice of the chosen speaker with the help of ElevenLabs AI Voice Generator (in the full-scale version, the app will clone the voice of the person who recorded the audio). The audio translation is sent to the UI via WebSocket. Users can play, pause, stop, and replay the recording. It takes the app no more than 3 seconds to translate 1–2 sentences.
AI Voice Translation App PoC Successfully Attracted Investors
Within just 8 weeks, the Client received a PoC of an AI-powered web application that enables real-time audio translation across more than 100 languages. The app provides output in under 3 seconds and is able to clone five preselected voices. Using the PoC as the app demo, the Client was able to attract investors for further product development. As of February 2024, the Client and ScienceSoft are collaborating on evolving the application into a full-scale product with comprehensive voice cloning capabilities and mobile versions.
Technologies and Tools
JavaScript, Azure Cloud, Azure Speech to Text, Azure Text Translator, Azure Text to Speech, ElevenLabs AI Voice Generator.