In Javascript, I’m attempting to stream audio created in the format Audio16Khz32KBitRateMonoMp3 by the “microsoft-cognitiveservices-speech-sdk” SpeechSynthesizer via express to a react frontend app. The first couple of sentences sounds just fine but after that the speech is very distorted.
Here is the code that sends the audio:
synthesizer.synthesizing = function (s, e) {
currentAudioChunk = {
audio: Buffer.from(e.result.audioData),
offset: e.result.audioDuration / 10000, // Convert to milliseconds
};
sendEvent("audioData", {
audio: currentAudioChunk.audio.toString("base64"),
//his audioOffset data is null, and I'm sending it as a placeholder for now
audioOffset: "0",
});
currentAudioChunk = null;
};
When the audio is good the string sent looks like this:
“//NIxCElU/5IAY+IAYJ/2sMoSxHk0RH9BVzA0GTFmDQIAQ4Xv8iBn5oEqAUwtPDlBnyDh9f+ukXFoLdRAhSgoYRmQwQAIIJ0DVH/+eez/FwFkR+DcgYghGI/ImQQ0kU/////LRuTiDFw8fLRmk9Bv////v/+hTTN0Cuzk4TZ8qGCFf6UeMhhgB8BjVMd/t5V//NIxA4g2tqkAc9YAD7MqNv/AVisORkvL3sJA7Fje5e+5N7r/ZNxXO9991vl98n9A+bqGiY3myJPJAOEDjHGJfRMWtAsfGoO9ejd0GhoOy32zfEvfHvZL4vfvTvfV5xO2TeymMriv6//j+v5/j2MNzc+w4oEElx7v6u/2t/UcW0Tg+oDGEGSAs30kiaCvEsa//NIxA0f6tbFtGsQsB/wSwiDZ+tZaEobTf5rXCqf/XqycoAzWXtvSKlDpriVZX/mmfM+YFhrAo1ookTMDgiGJRKHii37Q6Eh+fFkyMUXF4c2/fTr+0rpvsyXQevBDueX8PPz/PPX88VXXxySFgZQJQqJRW71f/Ss+4XRpRipAO1XGSKjcylABX/VcJmGEBBG”
but when the distortion starts there are lots of repeated letters that appear to be junk, like this:
“//NIxHwAAANIAAAAAFVVVVVVVVVMQU1FMy4xMDBVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV//NIxHwAAANIAAAAAFVVVVVVVVVMQU1FMy4xMDBVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV//NIxHwAAANIAAAAAFVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV”
The repeated numbers are in the raw data received from Azure: it’s not an artefact of conversion to string.
How can I get clean audio from Azure TTS?
I tried stripping out the Vs but that just corrupted the data entirely.