I am recording voice, on the client side, using MediaRecorder, and sending the resulting blob of (webm, opus) bytes to the server using a WebSocket, with this code:
<script type="text/javascript">
navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
const mediaRecorder = new MediaRecorder(stream, { mimeType: "audio/webm;codecs=opus"});
var socket = io();
mediaRecorder.addEventListener("dataavailable", event => {
socket.emit("message", event.data);
});
const latency_ms = 100;
mediaRecorder.start(latency_ms);
});
</script>
I am wondering how I should process the received data on the server side to be able to input it to Whisper for transscription. I am using Flask, and the code looks something like this:
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-small",
chunk_length_s=30,
device=device,
)
@socket.on("message")
def message(audio):
prediction = pipe(audio, batch_size=8)["text"]
print(prediction)