I use microsoft-cognitiveservices-speech-sdk (1.38.0) in order to do real time speech to text.
It seems like the offset is right when I send a full audio but it is wrong when I send it cut in a lot of audio chunks.
The more there is audio chunks the more inaccurate the offset is :
- No chunks : 1 726 300 000
- 369 chunks of 0.5 seconds : 1 729 600 000
- 923 chunks of 0.2 seconds : 1 744 600 000
- 1443 chunks of 0.1 seconds : 1 757 900 000
To reproduce here is some piece of code :
const speechConfig = SpeechConfig.fromSubscription(<KEY>, <REGION);
const pushStream = AudioInputStream.createPushStream();
const audioConfig = AudioConfig.fromStreamInput(pushStream);
const speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);
speechRecognizer.recognized = async (recognizer, event) => {console.log(event)}
speechRecognizer.canceled = async (recognizer, event) => {console.log(event)}
speechRecognizer.startContinuousRecognitionAsync();
for (let i = 1; i <= 1443; i++) {
const formattedNumber = i.toString().padStart(4, '0');
const buffer = fs.readFileSync(`/var/tmp/chunks/output_${formattedNumber}.wav`);
pushStream.write(buffer);
}
To create the audio chunks :
ffmpeg -i <INPUT_FILE> -f segment -segment_time 0.1 -c copy output_%04d.wav
Here is the audio link : https://drive.google.com/file/d/1H_RJuqMiBaVkpo9XHrgp1bpuFdgQl64O/view?usp=sharing
Thanks for your help