Spring AI: Streaming LLM Tokens with NDJSON in Spring Boot
Transcript:
So today we’ll learn how to stream tokens with Spring AI. LLMs produce responses in a token-by-token manner, in other words, in text chunks. The Spring AI chat client enables us to return a Flux of chunks of a model response. I will show you how to enable this functionality and how to process these chunks using NDJSON. The demo is available on GitHub, you can clone the project and follow along.
First, start the model. I run the model locally, version 3.2 3B. If you run the model locally as well, it’s just a matter of running Ollama with the 3.2 3B model and serving it. If you prefer running the model in a Docker container, the project on GitHub contains a Compose instance. If you are using another model, just spin it up accordingly.
We will have a multi-module project where the frontend resides in a separate module. This way, we separate server-side logic and client-side logic. As for dependencies, we need Spring Boot WebFlux for reactive streaming. We also need Vaadin for the frontend. I am using the Spring Ollama starter, and if you are using another model, just add the corresponding starter.
Let’s first look at the core service. Here we have an AI chat service, and we use the stream method of Spring AI’s ChatClient API to get an asynchronous response. The chat client can stream tokens as a Flux of strings or as a Flux of chat responses. To keep things simple, we produce a Flux of strings. That’s basically it for this class. Streaming is enabled, and the interesting part starts in the controller where we process the incoming Flux.
You can process the Flux of AI responses on the server using server-side events or NDJSON. The first approach is well established and works well for browsers. It uses a single long-lived HTTP response where each token is wrapped into a server-side event. The problem with this approach is that you may end up with AI responses on the client side with missing spaces or no spaces at all. The root cause is that different models use different tokenizers. Many LLMs encode a word and a leading space as a single token, so the space often ends up in the next chunk. Since SSE uses many small chunks, you are responsible for reconstructing the response correctly. If the SSE pipeline trims data, parses it poorly, or if proxies compress server-side events, the final response can be badly corrupted.
Newline-delimited JSON, or NDJSON, solves this problem. It is a streaming format where each line is an explicit JSON object separated by a newline character. Each token we receive can be easily parsed as JSON.
Let’s look at how to implement this on the server side in the controller. We have a method annotated with PostMapping that returns the AI response to the client. In the simplest scenario, we specify application ndjson as the produced media type in Spring WebFlux. This means each emitted item from the reactive stream is serialized to a JSON object and flushed one per line. Instead of returning a Flux of strings, we return a Flux of maps, a reactive stream of JSON objects where the key is a string and the value is also a string. The key represents the event type, and the value contains the actual response chunk. Spring serializes each map to JSON using Jackson and writes it immediately followed by a newline. As a result, the client receives a stream of lines, each being a small JSON object with a single field.
Now let’s address error handling and stream completion. It is a good practice to explicitly signal when the stream ends and to handle backpressure. Backpressure occurs when the rate of message production exceeds the rate at which the client can consume them.
After receiving the streamed response from the model, we map each chunk to an event structure. We define different event types and associate content with each piece of text. For example, if the model emits three chunks such as “hello”, “world”, and “!”, the HTTP body will contain three token events. In addition, we define an error event with an error message and a done event that signals the end of the stream.
We then apply the onErrorResume operator. If any upstream error occurs, the stream does not break abruptly. Instead, it switches to a fallback emission: a single JSON line with type error and the error message, and then completes. This allows the client to handle terminal errors cleanly. After that, we use concatWith to append a final done event. This gives the client a deterministic end marker so it knows when to stop reading and clean up the UI state.
The limitRate operator allows us to control backpressure by requesting elements in batches instead of prefetching an unbounded number. In this case, we request up to 64 elements at a time. Optionally, we can add logging using doOnSubscribe and doFinally. This allows us to log when a subscriber connects and when the stream terminates, whether due to normal completion, an error, or client cancellation. This is useful for tracing sudden client disconnects.
With that, we are done with the server-side logic. Let’s move on to the client side. I will omit most of the UI element creation to reduce noise. The full code is available on GitHub, and I also have a separate video about integrating Vaadin with Spring Boot if you are interested.
In the main view class, we use WebClient, a reactive HTTP client, to call the backend chat endpoint. A Disposable field holds the current active subscription so it can be cancelled when needed, for example when a new request is made or when the view is detached. In the constructor, we build a WebClient with the base backend URL so we can later use relative URIs like /chat. We then wire the Ask button click handler to call the onAsk method. This method takes the input text and returns a Flux of streamed chunks, which is immediately passed to the startStream method for consumption.
In the onAsk method, we first validate the input. If it is null, we show a notification and return an empty stream. Then we use WebClient to build a POST request, send the prompt as plain text, and accept NDJSON. After executing the request with retrieve, we parse the NDJSON response using bodyToFlux as a Flux of maps, one map per line.
Next, we transform this Flux of maps into a Flux of strings containing only the text chunks. We use Reactor’s handle operator, which allows mapping, filtering, and terminating in one place. For token events, we extract the content and emit it downstream. For error events, we convert them into terminal error signals. For done events, we complete the stream. Unknown event types are ignored to remain forward compatible. The result is a Flux of strings representing only the streamed text chunks.
This Flux is passed to the startStream method. There, we cancel any previous active stream to avoid overlapping output, clear the output area, and disable the Ask button. Using UI.getCurrent, we obtain the current Vaadin UI so we can safely update the UI from reactive callbacks. We subscribe to the stream and store the subscription. On errors, we schedule a UI notification. In doFinally, regardless of whether the stream completed normally, errored, or was cancelled, we reset the UI state and clear the active subscription. Each text chunk is appended to the output using UI.access to ensure updates happen on the correct UI thread.
Finally, we add the Push annotation to the class implementing AppShellConfigurator, usually the main frontend class. This enables server-to-client push so UI updates triggered from the reactive stream are flushed to the browser immediately. As a result, each chunk appears on the screen in real time, producing a smoothly streaming AI response.
Thank you for watching. If you want more AI-related videos, leave a comment, like the video, subscribe to the channel, and until next time.





