Google TTS stealing the first few ms of a script – but not the way I would think

I’ve built a PHP service that calls Google Cloud Text-to-Speech and writes the audioContent (base64-decoded) directly to an MP3 file. When I prepend a hard-coded phrase (e.g. “Ein Moment der Ruhe…”m which is the same phrase that laters apears as the dynamic text) it comes through perfectly–no words lost–but as soon as my dynamic text begins, the first 1–2 words are always faded in or cut. So they are not the first words anymore but are still “faded in”.

I’ve tried adding an SSML in the beginning or in between, but the same issue persists whenever I play back the MP3 that Google returns.

I’m on shared hosting so I’d prefer to consume Google’s MP3 directly instead of running a conversion pipeline myself. What else can I try to guarantee the very first syllable of dynamic text isn’t lost?

namespace AppServices;

class GoogleTtsService extends TtsService
{
    private const API_URL = 'https://texttospeech.googleapis.com/v1/text:synthesize';

    public function synthesize(string $text, array $options, string $outputPath, string $apiKey): bool
    {
        $breakMs      = defined('TTS_INITIAL_BREAK_MS') ? TTS_INITIAL_BREAK_MS : 100;
        // This hard-coded phrase plays fine
        $ssml = '<speak>'
              . 'Ein Moment der Ruhe...'
              . '<break time="'. $breakMs .'ms"/>'
              . htmlspecialchars($text, ENT_QUOTES|ENT_XML1, 'UTF-8') // this never plays fine
              . '</speak>';

        $payload = [
            'input'       => ['ssml' => $ssml],
            'voice'       => [
                'languageCode' => $options['language'] ?? 'de-DE',
                'name'         => $options['voice']    ?? 'de-DE-Standard-A'
            ],
            'audioConfig' => [
                'audioEncoding'   => 'MP3',
                'speakingRate'    => $options['speed']  ?? 1.0
            ]
        ];

        $response = $this->postJson(self::API_URL . '?key=' . urlencode($apiKey), $payload);
        if (empty($response['audioContent'])) {
            return false;
        }

        $audioData = base64_decode($response['audioContent']);
        return file_put_contents($outputPath, $audioData) !== false;
    }
}

Relevant debug log:

[28-Jun-2025 00:15:21] TTS chunk for job 311: text length=4159
[28-Jun-2025 00:15:21] SSML sent (first 100 chars): 
   <speak>Ein Moment der Ruhe...<break time="100ms"/>Ein Moment der Ruhe am Morgen. Ein bewu…