Overview
I’ve been using the Vertex API relatively successfully for the past few months, but I’ve noticed that when the text part of your prompt becomes extremely long, for example, 130,000 characters or so, the API seems to malfunction.
Implementation Details
- Gemini 1.5 Pro as the llm
- React front-end
- Firebase / Firestore / Cloud Functions as the back-end
API Infrastructure
I’ve tried two approaches to integrating with Vertex:
- Using the
vertexai-preview
package that ships withfirebase
- Using the
VertexAI
NodeJS package in a cloud function with increased (1 GB) memory allocation and a longer (120 sec) max runtime
Package Structure
All of my calls to Vertex follow the documentation’s pattern, where the “files” sent to the llm are included via a Cloud Storage URI and the text parts of the prompt are text
parts. Like this:
async function multiPartContent() {
const filePart = {fileData: {fileUri: "gs://generativeai-downloads/images/scones.jpg", mimeType: "image/jpeg"}};
const textPart = {text: 'What is this picture about?'};
const request = {
contents: [{role: 'user', parts: [textPart, filePart]}],
};
const streamingResult = await generativeVisionModel.generateContentStream(request);
for await (const item of streamingResult.stream) {
console.log('stream chunk: ', JSON.stringify(item));
}
const aggregatedResponse = await streamingResult.response;
console.log(aggregatedResponse.candidates[0].content);
}
In my case, I am using the generateContentStream
approach.
Expected Behavior
Given the massive context window, I expect to be able to send lots of information to the llm, then get a response back.
Observed Behavior
vertexai-preview
client-side package
When using the vertexai-preview
package, I get a Firebase Error with no message
property as I start pushing the requests, including more files and text.
I can confirm that my usage is nowhere near the 2m token context window. Usually, these heavier requests are around 200k tokens.
VertexAI
server-side approach
Here’s a relevant code block from my Cloud Function:
const req = {
contents: [{ role: "user", parts }],
};
console.log(`Initiating content generation for docId: ${docId}`);
const streamingResp = await generativeModel.generateContentStream(req);
This logic will work for non-large requests, but when there’s a heavier request, it will fail. In the cloud logs, I’ll see the “Initiating content generation” and, even though I’m catching the errors in my code (the block you see is inside a try / catch
block), I don’t see any additional cloud logs. The process literally just poof ends.
Some things I’ve tried
Chunking the text
part
s
I’ve tried to convert long text
strings into multiple smaller (e.g. ~50k character) text
parts. So, the parts
I send the llm have, for example:
- 3 fileUri parts
- 3 text parts
This didn’t work at all.
Sending the long text
part as a fileUri part
I’ve tried converting long text
strings into stored plain text files, then sending them as fileUri parts.
This approach does seem to improve reliability. Here, I run into something of a prompt engineering problem, because the prompt actually is in the now-stored text file that I’ve sent to the LLM.
Summary
Overall, I’m finding it difficult to work with the Vertex API with these larger requests. The Vertex API claims to be able to process these heavy requests, but I’m just finding that as I make these higher-token requests the API completely fails with errors that are non-descriptive.
- I never see any max memory errors
- I never see any function run time errors
- I never see any errors talking about the length of the
text
part of my prompt (also, if you happen to know what, if any, limitations there are on thetext
part of Vertex prompts, please let me know)
I’d love to know how to approach this.