Deion I created a test environment with the latest Next.js a

Azure OpenAI streaming is slow and chunky,about vercel/ai

Comments (26)

allancarvalho commented on June 11, 2024 6

I have done a small workaround for this behavior.

Basically this function

export function logStream(originalStream: ReadableStream) {
  const [loggedStream, loggingStream] = originalStream.tee();
  return new ReadableStream({
    async start(controller) {
      const reader = loggingStream.getReader();
      async function read() {
        const { done, value } = await reader.read();
        if (done) {
          controller.close();
          return;
        }
        controller.enqueue(value);
        await new Promise(resolve => setTimeout(resolve, 80));
        read();
      }
      read();
    }
  });
}

after that change it

return new StreamingTextResponse(stream);

return new StreamingTextResponse(logStream(stream));

from ai.

ElectricCodeGuy commented on June 11, 2024 4

This streaming behavior is due to Azure content filtering. [https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython](Azure Content filter)
This can cause delay and is also a contributor to why the tokens are received in much larger chunks.
We have reached out to Microsoft on the issue but to no avail.

We came up with a temporary "fix"
One way was to read the received tokens into an array on the client and then pushing them out little by little to imitate a more smooth streaming effect. However this caused some weird UI bugs.

So now we instead read the received tokens into a Uint8Array[] on the server and using the Transform and TransformCallback from node:stream and then we calculate how fast the tokens should be emitted to the client since the tokens form azure OpenAI can very in speed. It is quite hard to implement but it is possible. If you need help with it, feel free to reach out to me on linkin or here :)

from ai.

afbarbaro commented on June 11, 2024 2

I can confirm this even happens on the Azure OpenAI Playground, so it is indeed an Azure issue. I have access to customizing the content filters (had better luck than @ElectricCodeGuy) but no matter what their configuration is, the streaming is still slow and chunky. See video below.

Has anyone implemented a solution other than creating another stream that splits the chunks into letters or few characters to emulate a smoother streaming (at the expense of higher initial latency and more code).

azure.slow.streaming.mov

from ai.

christophmeise commented on June 11, 2024 2

Yes, I see that it also happens in the Azure playground - which indicates but does not proof that it is an Azure problem.

Has anyone tried to use @langchain/openai and just call Azure directly?
Here is a video from a part of our app that uses Langchain + Azure without any chunk splitting and with zero additional latency.

Aufzeichnung.2024-03-07.153139.mp4

Here is a screenshot from the handleLLMNewToken console logs

Here is the code snipped how I stream with no problems from Azure like you see in the video & logs:

const model = new ChatOpenAI({
      temperature: 0.6,
      topP: 0.96,
      maxTokens: -1,
      modelName: "gpt-4",
      azureOpenAIApiDeploymentName: "xxx", // using the same azure model like with vercel sdk
    }).bind(modelParams);

 const chain = prompt
      .pipe(model as any)
      .pipe(new JsonOutputFunctionsParser()); // using json because I have an array but works with all parsers

const stream = await chain.stream({
 some_context: xxx
});

return new Observable((subscriber) => {
      (async () => {
        let hooks;
        for await (const chunk of stream) {
          console.log(chunk);  // this is from the logs screenshot
          subscriber.next({ data: chunk }); // i just pass it to the endpoint and display in the UI
          hooks = (chunk as any).hooks;
        }
        
        ...
 });

Maybe I am missing something but this is working and a working solution for this exact problem of the thread?
If yes -> this is not an Azure problem and the Vercel SDK has an issue with the streaming
If no -> I would appreciate an explanation why I don't have this problem with my own endpoint and why it can't also be done in the SDK

from ai.

JoshFriedmanO3 commented on June 11, 2024 2

Yeah i've basically narrowed it down (at least in my env) to the point at which the SDK is making requests to the API. The streaming actually works from there to the FE for each token / character, but there is around a ~2 second delay for each chunk that is sent by Azure's API.

Basically its coming through like this:

I
I am
I am working
I am working fine

---- 2 second delay

I am working fine. Are
I am working fine. Are you
I am working fine. Are you working
I am working fine. Are you working fine?

---- 2 second delay

I am working fine. Are you working fine? Finished
I am working fine. Are you working fine? Finished streaming.

Regardless of Vercel's SDK, I can recreate the streaming issues in Azure's playground.

@christophmeise So thats all i've been able to go off of, the issue may be in Vercel, but I have not been able to identify what would cause it.

Clearly the custom streaming integration you wrote is solving this issue, but then why would it apply to the Azure playground. Quite a confusing issue haha

For some clarity, heres the video I sent to Microsoft when I was chatting with them: https://www.loom.com/share/31a36da7ded44881a2a720ebdae0346d?sid=957bead8-1831-49fc-892d-9ac1b7d1b1cb

Clearly shows whats happening

from ai.

JakobStadlhuber commented on June 11, 2024 2

Yeah i've basically narrowed it down (at least in my env) to the point at which the SDK is making requests to the API. The streaming actually works from there to the FE for each token / character, but there is around a ~2 second delay for each chunk that is sent by Azure's API.

Basically its coming through like this:
I
I am
I am working
I am working fine

---- 2 second delay

I am working fine. Are
I am working fine. Are you
I am working fine. Are you working
I am working fine. Are you working fine?

---- 2 second delay

I am working fine. Are you working fine? Finished
I am working fine. Are you working fine? Finished streaming.
Regardless of Vercel's SDK, I can recreate the streaming issues in Azure's playground.

@christophmeise So thats all i've been able to go off of, the issue may be in Vercel, but I have not been able to identify what would cause it.

Clearly the custom streaming integration you wrote is solving this issue, but then why would it apply to the Azure playground. Quite a confusing issue haha

For some clarity, heres the video I sent to Microsoft when I was chatting with them: https://www.loom.com/share/31a36da7ded44881a2a720ebdae0346d?sid=957bead8-1831-49fc-892d-9ac1b7d1b1cb

Clearly shows whats happening

we have the exact same experience.

from ai.

christophmeise commented on June 11, 2024 1

This was happening on a client component. The component itself is always the same in my tests and it worked fine with OpenAI / Nest.js + Langchain + Azure. Only if I use Azure with Vercel AI I have the slow behaviour.

I also tried logging the token stream in the "handleLLMNewToken" callback handler. Even there the tokens come slowly in big chunks. So the problem is not in rendering, it is in the streaming implementation.

from ai.

kyb3r commented on June 11, 2024 1

It is an azure problem, it's related to content filtering. It's possible to turn content filtering off, or use an async mode in azure settings. But that is limited access, and only available to managed customers.

But I think that the ai sdk also has a problem, I noticed that the new rsc demo has chunking instead of a smooth flow of tokens?

Do you notice the same issue here: https://sdk.vercel.ai/demo

from ai.

christophmeise commented on June 11, 2024 1

This is a problem i've been experiencing for months, i've tried to talk to Microsoft with no luck as well.

It's been quite hard to debug as i'm not sure where that chunk streaming is initiating haha.

@ElectricCodeGuy, @christophmeise I am really curious about both your custom integrations.

We are similarly using langchain/openai eg:
export const streamingModel = new ChatOpenAI({
  // modelName: "gpt-4",
  azureOpenAIApiDeploymentName: "gpt4",
  streaming: true,
  temperature: 0,
  tags: ["GPT-4 Streaming"]
});
but we're calling the model via langchain LLMChain import { LLMChain } from "langchain/chains";
answerWithContextChain.stream({
      chatHistory,
      context: vectorResults,
      question: sanitizedQuestion,
    }, {
      callbacks: [handlers, runCollector]
    });
and then returning to the FE with return new StreamingTextResponse(stream, {}, data);

As I said in my comment - it works for us when we don't use Vercel and use our own Node endpoint with SSE. We have no workaround or custom buffering for the tokens - it works out of the box.

I am still wondering why everyone is convinced that it is an issue with Azure and not Vercel.

from ai.

faisal-saddique commented on June 11, 2024 1

Following for updates

from ai.

vhiairrassary commented on June 11, 2024 1

What about https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython#asynchronous-modified-filter?

EDIT: There are some drawbacks to be carefully reviewed: "Customers must be aware that while the feature improves latency, it's a trade-off against the safety and real-time vetting of smaller sections of model output."

from ai.

MaxLeiter commented on June 11, 2024

I believe its known that Azure streams in larger chunks than OpenAI. Maybe we can provide a utility here, but the best solution is for your client to not rely on the servers chunking strategy if you want a consistent feeling.

https://learn.microsoft.com/en-us/answers/questions/1359927/azure-openai-api-with-stream-true-does-not-give-ch

from ai.

christophmeise commented on June 11, 2024

Thank you! I know that it is possible with Azure because I have built a Nest.js endpoint that streams from Azure just fine with the exact same model and settings.

I used an endpoint with Server-Sent Events (SSE) and manually parsed the stream using microsoft/fetch-event-source in the client.
The stream of tokens is super quick and steady but they are unsorted - that means that I needed to fix the sorting client-side so that the response is correct in the UI.

I am currently trying to transition from my own Nest.js backend to Vercel AI, but the streaming is not working as before.

from ai.

valstu commented on June 11, 2024

Was this happening when streaming RSC? I noticed the same behaviour, here's a video of comparison between RSC and text streaming + client component: https://x.com/valtterikaresto/status/1764412056948576712

Streaming text and rendering client component (right side of the video) seems much smoother for some reason.

from ai.

christophmeise commented on June 11, 2024

Sorry, but this seems wrong.

Again: Azure streams correctly when I stream via Langchain on my node backend - tokens come bit by bit, not chunky at all. I use the same content filter and deployment.

How can this be an Azure problem when it works for me on node and not with this sdk??

from ai.

JoshFriedmanO3 commented on June 11, 2024

This is a problem i've been experiencing for months, i've tried to talk to Microsoft with no luck as well.

It's been quite hard to debug as i'm not sure where that chunk streaming is initiating haha.

@ElectricCodeGuy, @christophmeise I am really curious about both your custom integrations.

We are similarly using langchain/openai eg:

export const streamingModel = new ChatOpenAI({
  // modelName: "gpt-4",
  azureOpenAIApiDeploymentName: "gpt4",
  streaming: true,
  temperature: 0,
  tags: ["GPT-4 Streaming"]
});

but we're calling the model via langchain LLMChain import { LLMChain } from "langchain/chains";

answerWithContextChain.stream({
      chatHistory,
      context: vectorResults,
      question: sanitizedQuestion,
    }, {
      callbacks: [handlers, runCollector]
    });

and then returning to the FE with return new StreamingTextResponse(stream, {}, data);

from ai.

JakobStadlhuber commented on June 11, 2024

I have the same problem. Any news?

from ai.

JakobStadlhuber commented on June 11, 2024

We do not use Vercel and have the same problem directly with the SDK in Kotlin. Also the same behaviour in the Playground.

from ai.

emrahtoy commented on June 11, 2024

Same here 👍

from ai.

red-hunter commented on June 11, 2024

Yeah i've basically narrowed it down (at least in my env) to the point at which the SDK is making requests to the API. The streaming actually works from there to the FE for each token / character, but there is around a ~2 second delay for each chunk that is sent by Azure's API.
Basically its coming through like this:
I
I am
I am working
I am working fine

---- 2 second delay

I am working fine. Are
I am working fine. Are you
I am working fine. Are you working
I am working fine. Are you working fine?

---- 2 second delay

I am working fine. Are you working fine? Finished
I am working fine. Are you working fine? Finished streaming.
Regardless of Vercel's SDK, I can recreate the streaming issues in Azure's playground.
@christophmeise So thats all i've been able to go off of, the issue may be in Vercel, but I have not been able to identify what would cause it.
Clearly the custom streaming integration you wrote is solving this issue, but then why would it apply to the Azure playground. Quite a confusing issue haha
For some clarity, heres the video I sent to Microsoft when I was chatting with them: https://www.loom.com/share/31a36da7ded44881a2a720ebdae0346d?sid=957bead8-1831-49fc-892d-9ac1b7d1b1cb
Clearly shows whats happening
we have the exact same experience.

I am streaming to a teams bot and I have the exact problem

from ai.

kemeny commented on June 11, 2024

From what I have been discussing with folks at Microsoft, their recommendation is to use provisions throughput units; this secures output quota and should fix the chunkiness of the output. I am not fully convinced. Has someone tried this? Here's the documentation: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput
Might this be the root cause?

from ai.

kemeny commented on June 11, 2024

From what I have been discussing with folks at Microsoft, their recommendation is to use provisions throughput units; this secures output quota and should fix the chunkiness of the output. I am not fully convinced. Has someone tried this? Here's the documentation: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput Might this be the root cause?

So, that might not be the solution at all. I did go through another similar issue on the Microsoft forum, where they mentioned that content filtering limits streaming throughput.

In the OAI Azure portal, within Content Filtering (Preview), there's an option to set Streaming mode from Default to Asynchronous Modified Filter. However, this requires approval from Microsoft to activate.

from ai.

JoshFriedmanO3 commented on June 11, 2024

@kemeny This is further than I was able to get with them. I still do think its related to the content filtering. the stremed token has a ton of json wrapped around it for each token, which makes me think the pre-processing is causing that chunkiness. I was also curious about the PTU's, but don't have the ability to get em. Same for the content filtering, was not allowed access.

Really just an overall frustrating experience. Legit no other cloud platform has streaming issues like Azure.

from ai.

valstu commented on June 11, 2024

I heard that buying PTU does actually improve the performance. Although on the same sentence I also heard that GPT-4 will require a lot of PTU and pricing is suddenly climbing to five figures per month.

Also it seems like it is quite difficult to get access to modified content filter -program. So hands are pretty much tied at this point.

from ai.

JoshFriedmanO3 commented on June 11, 2024

@allancarvalho Yeah this is a great solution considering we cant get the performance we want unless we turn off the content filtering and have the async streaming.

Thank you for this!

from ai.

vashat commented on June 11, 2024

I am having the same problem with Python SDK. If we managed to disable content filter, would it be more unsafe than using openai directly instead of azure or the same?

from ai.

Azure OpenAI streaming is slow and chunky about ai HOT 26 OPEN

Comments (26)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent