I was trying to stream LLM responses through a FastAPI backend to show real-time updates, but ended up spending more time than I anticipated. I'm writing this in case someone else ends up trying to solve a similar problem, and this guide might help avoid the same pitfalls.

What is HTTP streaming?
HTTP streaming involves sending data in small, sequential chunks over a standard HTTP response, allowing the client to receive updates in real time. One common approach is using server-sent events (SSE), where the response has a Content-Type
of text/event-stream
. In this pattern, data is often formatted as JSON strings and sent as plain text using the SSE message format.
FastAPI endpoint for streaming text data as SSE
Here’s a FastAPI endpoint that streams text data (mocking the LLM response):
Note: The header
"X-Accel-Buffering": "no"
is used to disable response buffering by reverse proxies like Nginx. Without this, the proxy might wait to collect a large chunk of data before sending it to the client, which defeats the purpose of real-time streaming.
This endpoint uses StreamingResponse
to continuously send data chunks in real time. The number_generator
function creates a live data stream by yielding JSON-formatted messages one at a time.
Consuming the stream on the client
The client can consume this stream using either EventSource
or by directly reading the response body with a ReadableStream
. Below is an example using the Fetch API and a ReadableStream to handle SSE-formatted messages.
The code above fetches the /stream
endpoint and reads the response body using a ReadableStream
. Each chunk is decoded, split into individual SSE messages, and parsed as JSON before updating the message stream.
Streaming data from an LLM in real time can feel a bit tricky at first, but once the pieces fit together with the backend and a readable stream on the frontend, it becomes a powerful pattern. Hopefully, this guide makes things a little easier if you’re setting up something similar.