How to use grammar text editors for speech to text documents in JavaScript / NodeJS

I’m relatively new to the programming (1 year working as a intern, and finishing grad), and might be biting more than I can chew, this is also my first interaction here (yey) So let me explain the problem thoroughly:

I’m currently using Google Speech to Text API to get transcribed documents of my interviews, they are conducted on English, Spanish and Portuguese.

The English translation and transcribing is perfect, I don’t need to actually update or fix anything. But, the Spanish and Portuguese transcribing is lacking on punctuation and speaker diarization/labeling, since those are not available on the Google Speech2Text API, and some grammatical errors, such as repetition and tangling speakers words and utterances (Especially when it comes to questions)

I’m using Javascript and NodeJS and im not sure with packages, methods, API’s, libraries should I use.

So to fix this I landed on 2 concepts that I would like to integrate on my service:

Correctness – Eliminates grammar, spelling, and punctuation mistakes and ensures word choices sound natural and fluent.

Clarity – Makes every sentence concise and easy to follow and rewrites hard-to-read sentences.

Without losing some of the key words for qualitative/quantitative research.

I got some answers and resolutions, but I don’t know which one do I use. Implement a ABNF grammar, implement a pre-trained NLP for those languages, or utilize other paid services (that would generate more expenses and might not be great for my growth as a developer). Those terms and concepts are completely new to me and I got super lost on documentations trying to determine the best way to deal with it.

That being said:

How I make use of ABNF / NLP?
There is some way to make speaker labeling without having access to the audio channels or using ABNF / EABNF , NPL , Document AI or any logical grammatical timing ?
How to standardize the text cluster/segmentation to be able to recognize punctuation, using grammar or audio frequency offset?
What packages would help me?
What other ways could/should I resolve this?

I know this sounds like overkill (and it probably is), but I got really interested, it would help me a lot on the company that I’m working in, and get somewhat of an experience on problem solving.

I’d be super grateful if anyone could help sort any of these problems out.

When the conversation/interview is between 2 people, and back and forth dialog, question and answer, this is not a problem at all, is just a matter of:

const wordsInfo = result.alternatives[0].words;

wordsInfo.forEach(a =>
  console.log(` word: ${a.word}, speakerTag: ${a.speakerTag}`)
);

It is a solution, but not for all problems, and i’m still getting used to the colleges architecture. It’s a project with no updates since a long time ago, and i’m problably using an outdated version of Google speech2text API.

So i’ve found a way to get the result time and clustering using xslx and docx:

formatResultTime(resultEndTime) {
    const resultInSeconds = parseFloat(resultEndTime.replace('s', ''));
    let second = String(Math.trunc(resultInSeconds % 60));
    second = second.length === 1 ? `0${second}` : second;
    let minute = String(Math.trunc(resultInSeconds / 60));
    minute = minute.length === 1 ? `0${minute}` : minute;
    let hour = String(Math.trunc(minute / 60));
    hour = hour.length === 1 ? `0${hour}` : hour;

    return `${hour}:${minute}:${second}`;
  }

  getTxts(results) {
    const txts = results.reduce((acc, { alts, endTime }) => {
      const txt = alts.reduce((paragraphs, alternative) => {
        if (alternative.transcript) {
          paragraphs.push(alternative.transcript);
        }
        return paragraphs;
      }, []);

      acc.push({
        text: text.join(' '),
      endTime: this.formatTime(endTime),
      });

      return acc;
    }, []);
    return texts;
  }

But company QA asks for fix:words error/identification, separate paragraphs and time stamps in a better way.
Here are an example of my outputs:


[00:00:16]  Good morning, Anita.
[00:00:19]  Good morning, Lucas.
[00:00:21]  How's it going?
[00:00:31]  I have a feeling that something is going to happen today.[Should break here] Why do you think that? [Should break here*] I don't know. It's just my gut feeling.
[00:00:34]  Don't worry, it's going to be okay.
[00:00:36]  I hope so.
[00:00:43]                   // [Empty time stamp]
[00:00:54]  Good morning. How may I help you? [Should break here] I'm a guest calling from room 703. My TV remote is not working.
[00:00:57]  Could you please describe your problem in detail?
[00:01:12]  I haven't been able to use the control since last night, every time I want to change the channel. I have to run back and forth and press the, but this makes me very upset. Please get someone to fix it right away.
[00:01:20]  I'm sorry for the inconvenience. I will send the technician up to you right away.  [Should break here] Alright, thank you.
[00:01:34]  Excuse me. Hello, sir. How may I help you? [Should break here] I'm a guest of room 615. My room is right next to an elevator.
[00:01:42]  Yes, I remember. Is there something wrong last night? [Should break here] I kept hearing loud. [ Incorrect punctuation ] Talking nearby.
[00:01:55]  Not only that, but the sound of the elevator moving is also annoying me. Then I don't understand what's wrong with you. Out to the staff kept moving furniture all day.
[00:02:02]  The noise of all these things is very disturbing to me, makes it impossible to sleep.
[00:02:04]  I am sorry to hear that.
[00:02:23]  But yesterday, the staff of the hotel did not move the furniture. Maybe they were just moving the luggage for guests.[Should break here] I don't need to know. I want to change the room immediately. [Should break here] I'm so sorry for the bad experience that you went through, but there are no rooms available now.

And how it should be:

[00:00:16] [Lucas]: Good morning, Anita.
[00:00:19] [Anita]: Good morning, Lucas.

[00:00:16] [Speaker0]: Good morning, Anita.
[00:00:19] [Speaker1]: Good morning, Lucas.

How do I make it follow a pattern? (This problems usually happen in portuguese and spanish) I speak all 3 languages some what, so if you want to answer in any of them it’d be okay, of course the general preference for the community is english.