I am using the vllm API server with the following setup:
python -m vllm.entrypoints.api_server --model=mistralai/Mistral-7B-Instruct-v0.3 --dtype=half --tensor-parallel-size=4 --gpu-memory-utilization=0.5 --max-model-len=27000
I am sending requests to the server using this Python function:
def send_request_2_llm(prompt: str):
url = "http://localhost:8000/generate"
if len(prompt) > 27_000:
prompt = prompt[:27_000]
payload = {
"prompt": prompt,
"stream": True,
"min_tokens": 256,
"max_tokens": 1024
}
response = requests.post(url, json=payload, stream=True)
return response
I want to display the streamed response on my Flask app’s screen. The issue I’m encountering is with the structure of the streamed responses. The API server returns the response in a sequence of JSON objects like this:
{"text": "SYSTEM_PROMPT + hello"}
{"text": "SYSTEM_PROMPT + hello how"}
{"text": "SYSTEM_PROMPT + hello how are"}
{"text": "SYSTEM_PROMPT + hello how are you"}
{"text": "SYSTEM_PROMPT + hello how are you?"}
On my Flask app, I want to print only the final text (“hello how are you?”) on a single line, in a streaming fashion. I believe I can slice the “text” by SYSTEM_PROMPT, but I’m unsure how to do this correctly.
Here is the JavaScript code I am using to handle the streaming:
function askQuestion() {
const fileInput = document.getElementById('fileInput');
const questionInput = document.getElementById('questionInput');
const responseDiv = document.getElementById('response');
const formData = new FormData();
formData.append('file', fileInput.files[0]);
formData.append('question', questionInput.value);
responseDiv.innerHTML = ''; // Clear previous response
fetch('/upload', {
method: 'POST',
body: formData
})
.then(response => {
if (!response.ok) {
throw new Error(`HTTP error! status: ${response.status}`);
}
const reader = response.body.getReader();
const decoder = new TextDecoder();
return new ReadableStream({
start(controller) {
function push() {
reader.read().then(({ done, value }) => {
if (done) {
controller.close();
return;
}
const chunk = decoder.decode(value, { stream: true });
console.log("Received chunk:", chunk); // Debug log
controller.enqueue(chunk);
responseDiv.innerHTML += chunk;
push();
}).catch(error => {
console.error('Stream reading error:', error);
controller.error(error);
});
}
push();
}
});
})
.then(stream => new Response(stream).text())
.then(result => {
console.log('Complete response received');
})
.catch(error => {
console.error('Error:', error);
responseDiv.innerHTML = 'An error occurred while processing your request.';
});
}
My Questions:
- How can I correctly slice out the SYSTEM_PROMPT from the “text” field and display only the final text?
- How can I implement streaming in a way that ensures the response is updated on the screen in real-time, without showing intermediate fragments?
Any advice or guidance would be greatly appreciated!