How to Stream Responses from vllm API Server and Display in Flask App?

I am using the vllm API server with the following setup:

python -m vllm.entrypoints.api_server --model=mistralai/Mistral-7B-Instruct-v0.3 --dtype=half --tensor-parallel-size=4 --gpu-memory-utilization=0.5 --max-model-len=27000

I am sending requests to the server using this Python function:

def send_request_2_llm(prompt: str):
    url = "http://localhost:8000/generate"
    if len(prompt) > 27_000:
        prompt = prompt[:27_000]
    payload = {
        "prompt": prompt,
        "stream": True,
        "min_tokens": 256,
        "max_tokens": 1024
    }
    response = requests.post(url, json=payload, stream=True)
    return response

I want to display the streamed response on my Flask app’s screen. The issue I’m encountering is with the structure of the streamed responses. The API server returns the response in a sequence of JSON objects like this:

{"text": "SYSTEM_PROMPT + hello"}
{"text": "SYSTEM_PROMPT + hello how"}
{"text": "SYSTEM_PROMPT + hello how are"}
{"text": "SYSTEM_PROMPT + hello how are you"}
{"text": "SYSTEM_PROMPT + hello how are you?"}

On my Flask app, I want to print only the final text (“hello how are you?”) on a single line, in a streaming fashion. I believe I can slice the “text” by SYSTEM_PROMPT, but I’m unsure how to do this correctly.

Here is the JavaScript code I am using to handle the streaming:

function askQuestion() {
    const fileInput = document.getElementById('fileInput');
    const questionInput = document.getElementById('questionInput');
    const responseDiv = document.getElementById('response');

    const formData = new FormData();
    formData.append('file', fileInput.files[0]);
    formData.append('question', questionInput.value);

    responseDiv.innerHTML = '';  // Clear previous response

    fetch('/upload', {
        method: 'POST',
        body: formData
    })
    .then(response => {
        if (!response.ok) {
            throw new Error(`HTTP error! status: ${response.status}`);
        }
        const reader = response.body.getReader();
        const decoder = new TextDecoder();

        return new ReadableStream({
            start(controller) {
                function push() {
                    reader.read().then(({ done, value }) => {
                        if (done) {
                            controller.close();
                            return;
                        }
                        const chunk = decoder.decode(value, { stream: true });
                        console.log("Received chunk:", chunk);  // Debug log
                        controller.enqueue(chunk);
                        responseDiv.innerHTML += chunk;
                        push();
                    }).catch(error => {
                        console.error('Stream reading error:', error);
                        controller.error(error);
                    });
                }
                push();
            }
        });
    })
    .then(stream => new Response(stream).text())
    .then(result => {
        console.log('Complete response received');
    })
    .catch(error => {
        console.error('Error:', error);
        responseDiv.innerHTML = 'An error occurred while processing your request.';
    });
}

My Questions:

  1. How can I correctly slice out the SYSTEM_PROMPT from the “text” field and display only the final text?
  2. How can I implement streaming in a way that ensures the response is updated on the screen in real-time, without showing intermediate fragments?

Any advice or guidance would be greatly appreciated!