Suraj TC - surajtc.dev

I was trying to stream LLM responses through a FastAPI backend to show real-time updates, but ended up spending more time than I anticipated. I'm writing this in case someone else ends up trying to solve a similar problem, and this guide might help avoid the same pitfalls.

What is HTTP streaming?

HTTP streaming involves sending data in small, sequential chunks over a standard HTTP response, allowing the client to receive updates in real time. One common approach is using server-sent events (SSE), where the response has a Content-Type of text/event-stream. In this pattern, data is often formatted as JSON strings and sent as plain text using the SSE message format.

FastAPI endpoint for streaming text data as SSE

Here’s a FastAPI endpoint that streams text data (mocking the LLM response):

async def number_generator():
    counter = 0
    while True:
        await asyncio.sleep(0.1)
        data = json.dumps({"number": counter})
        yield f"data: {data}\n\n".encode("utf-8")
        counter += 1
        if counter == 10:
            break
 
@router.get("/stream")
async def stream_numbers():
    return StreamingResponse(
        number_generator(),
        media_type="text/event-stream",
        headers={
            "Connection": "keep-alive",
            "Cache-Control": "no-cache",
            "X-Accel-Buffering": "no",
        },
    )

Note: The header "X-Accel-Buffering": "no" is used to disable response buffering by reverse proxies like Nginx. Without this, the proxy might wait to collect a large chunk of data before sending it to the client, which defeats the purpose of real-time streaming.

This endpoint uses StreamingResponse to continuously send data chunks in real time. The number_generator function creates a live data stream by yielding JSON-formatted messages one at a time.

Consuming the stream on the client

The client can consume this stream using either EventSource or by directly reading the response body with a ReadableStream. Below is an example using the Fetch API and a ReadableStream to handle SSE-formatted messages.

const res = await fetch(`/stream`, {
  headers: {
    Accept: "text/event-stream",
  },
});
if (!res.body) throw new Error("No response body");
 
const reader = res.body.getReader();
const decoder = new TextDecoder();
let done = false;
 
while (!done) {
  const { value, done: readerDone } = await reader.read();
  done = readerDone;
 
  if (value) {
    const chunk = decoder.decode(value, { stream: true });
 
    const messages = chunk
      .split("\n\n")
      .filter((msg) => msg.startsWith("data:"));
 
    for (const message of messages) {
      const jsonData = message.slice(5).trim();
      try {
        const parsedData = JSON.parse(jsonData);
        const content = parsedData.message?.content || "";
        setMessageStream((prev) => prev + content);
      } catch (e) {
        console.error("Failed to parse JSON:", e);
      }
    }
  }
}

The code above fetches the /stream endpoint and reads the response body using a ReadableStream. Each chunk is decoded, split into individual SSE messages, and parsed as JSON before updating the message stream.

Streaming data from an LLM in real time can feel a bit tricky at first, but once the pieces fit together with the backend and a readable stream on the frontend, it becomes a powerful pattern. Hopefully, this guide makes things a little easier if you’re setting up something similar.

Streaming structured data from an HTTP endpoint

What is HTTP streaming?

FastAPI endpoint for streaming text data as SSE

Consuming the stream on the client