Spring AI: Streaming LLM Tokens with NDJSON in Spring Boot

Transcript:

So today we’ll learn how to stream tokens with Spring AI. LLMs produce responses in a token-by-token manner, in other words, in text chunks. The Spring AI chat client enables us to return a Flux of chunks of a model response. I will show you how to enable this functionality and how to process these chunks using NDJSON. The demo is available on GitHub, you can clone the project and follow along.

First, start the model. I run the model locally, version 3.2 3B. If you run the model locally as well, it’s just a matter of running Ollama with the 3.2 3B model and serving it. If you prefer running the model in a Docker container, the project on GitHub contains a Compose instance. If you are using another model, just spin it up accordingly.

We will have a multi-module project where the frontend resides in a separate module. This way, we separate server-side logic and client-side logic. As for dependencies, we need Spring Boot WebFlux for reactive streaming. We also need Vaadin for the frontend. I am using the Spring Ollama starter, and if you are using another model, just add the corresponding starter.

Let’s first look at the core service. Here we have an AI chat service, and we use the stream method of Spring AI’s ChatClient API to get an asynchronous response. The chat client can stream tokens as a Flux of strings or as a Flux of chat responses. To keep things simple, we produce a Flux of strings. That’s basically it for this class. Streaming is enabled, and the interesting part starts in the controller where we process the incoming Flux.

You can process the Flux of AI responses on the server using server-side events or NDJSON. The first approach is well established and works well for browsers. It uses a single long-lived HTTP response where each token is wrapped into a server-side event. The problem with this approach is that you may end up with AI responses on the client side with missing spaces or no spaces at all. The root cause is that different models use different tokenizers. Many LLMs encode a word and a leading space as a single token, so the space often ends up in the next chunk. Since SSE uses many small chunks, you are responsible for reconstructing the response correctly. If the SSE pipeline trims data, parses it poorly, or if proxies compress server-side events, the final response can be badly corrupted.

Newline-delimited JSON, or NDJSON, solves this problem. It is a streaming format where each line is an explicit JSON object separated by a newline character. Each token we receive can be easily parsed as JSON.

Let’s look at how to implement this on the server side in the controller. We have a method annotated with PostMapping that returns the AI response to the client. In the simplest scenario, we specify application ndjson as the produced media type in Spring WebFlux. This means each emitted item from the reactive stream is serialized to a JSON object and flushed one per line. Instead of returning a Flux of strings, we return a Flux of maps, a reactive stream of JSON objects where the key is a string and the value is also a string. The key represents the event type, and the value contains the actual response chunk. Spring serializes each map to JSON using Jackson and writes it immediately followed by a newline. As a result, the client receives a stream of lines, each being a small JSON object with a single field.

Now let’s address error handling and stream completion. It is a good practice to explicitly signal when the stream ends and to handle backpressure. Backpressure occurs when the rate of message production exceeds the rate at which the client can consume them.

After receiving the streamed response from the model, we map each chunk to an event structure. We define different event types and associate content with each piece of text. For example, if the model emits three chunks such as “hello”, “world”, and “!”, the HTTP body will contain three token events. In addition, we define an error event with an error message and a done event that signals the end of the stream.

We then apply the onErrorResume operator. If any upstream error occurs, the stream does not break abruptly. Instead, it switches to a fallback emission: a single JSON line with type error and the error message, and then completes. This allows the client to handle terminal errors cleanly. After that, we use concatWith to append a final done event. This gives the client a deterministic end marker so it knows when to stop reading and clean up the UI state.

The limitRate operator allows us to control backpressure by requesting elements in batches instead of prefetching an unbounded number. In this case, we request up to 64 elements at a time. Optionally, we can add logging using doOnSubscribe and doFinally. This allows us to log when a subscriber connects and when the stream terminates, whether due to normal completion, an error, or client cancellation. This is useful for tracing sudden client disconnects.

With that, we are done with the server-side logic. Let’s move on to the client side. I will omit most of the UI element creation to reduce noise. The full code is available on GitHub, and I also have a separate video about integrating Vaadin with Spring Boot if you are interested.

In the main view class, we use WebClient, a reactive HTTP client, to call the backend chat endpoint. A Disposable field holds the current active subscription so it can be cancelled when needed, for example when a new request is made or when the view is detached. In the constructor, we build a WebClient with the base backend URL so we can later use relative URIs like /chat. We then wire the Ask button click handler to call the onAsk method. This method takes the input text and returns a Flux of streamed chunks, which is immediately passed to the startStream method for consumption.

In the onAsk method, we first validate the input. If it is null, we show a notification and return an empty stream. Then we use WebClient to build a POST request, send the prompt as plain text, and accept NDJSON. After executing the request with retrieve, we parse the NDJSON response using bodyToFlux as a Flux of maps, one map per line.

Next, we transform this Flux of maps into a Flux of strings containing only the text chunks. We use Reactor’s handle operator, which allows mapping, filtering, and terminating in one place. For token events, we extract the content and emit it downstream. For error events, we convert them into terminal error signals. For done events, we complete the stream. Unknown event types are ignored to remain forward compatible. The result is a Flux of strings representing only the streamed text chunks.

This Flux is passed to the startStream method. There, we cancel any previous active stream to avoid overlapping output, clear the output area, and disable the Ask button. Using UI.getCurrent, we obtain the current Vaadin UI so we can safely update the UI from reactive callbacks. We subscribe to the stream and store the subscription. On errors, we schedule a UI notification. In doFinally, regardless of whether the stream completed normally, errored, or was cancelled, we reset the UI state and clear the active subscription. Each text chunk is appended to the output using UI.access to ensure updates happen on the correct UI thread.

Finally, we add the Push annotation to the class implementing AppShellConfigurator, usually the main frontend class. This enables server-to-client push so UI updates triggered from the reactive stream are flushed to the browser immediately. As a result, each chunk appears on the screen in real time, producing a smoothly streaming AI response.

Thank you for watching. If you want more AI-related videos, leave a comment, like the video, subscribe to the channel, and until next time.

Summary

This video explains how to stream LLM responses token by token using Spring AI and reactive programming. It shows why traditional server-side events can cause formatting issues and how NDJSON provides a more reliable streaming format. The server-side implementation uses Spring WebFlux to stream structured JSON events with proper error handling, backpressure control, and a clear end-of-stream signal. On the client side, a reactive WebClient and Vaadin UI consume the stream and render tokens in real time. Together, this approach enables smooth, robust, and user-friendly AI response streaming in Spring applications.

About Catherine

Java developer passionate about Spring Boot. Writer. Developer Advocate at BellSoft

Further watching

Videos

Apr 30, 2026

Java Flight Recorder Tutorial: How to Profile Java Applications

High CPU, GC spikes, or slow startup are common production issues, but logs and metrics don’t always reveal what the JVM is actually doing. Java Flight Recorder (JFR) provides a precise, low-overhead view of JVM behavior, safe for use even in production environments. In this video, you’ll learn how to use JFR to identify real bottlenecks such as CPU hotspots, memory allocation pressure, thread contention, and I/O stalls. We walk through the full workflow, including starting recordings with JVM flags, controlling them via jcmd, running JFR inside Docker containers, and attaching to live systems using ephemeral containers. Then we analyze a real Spring Boot recording in JDK Mission Control, breaking down GC behavior, allocation patterns, thread states, and method-level hotspots. If you want to move from symptoms to root cause with more confidence, this approach will help. Full article with commands and examples: [https://bell-sw.com/blog/how-to-profile-java-applications-with-jfr-beginner-s-guide/](https://bell-sw.com/blog/how-to-profile-java-applications-with-jfr-beginner-s-guide/)

Watch Now

Videos

Apr 22, 2026

Dynamic SQL Queries with Spring Data JPA in 6 Minutes

If your repository layer has multiple queries for different filter combinations, your data access logic is already getting harder to maintain. In this video, we implement dynamic SQL queries in Spring Data JPA using Specifications — a composable approach that helps avoid query duplication and keeps your filtering logic clean. We build a flexible filtering system with optional parameters (category, language, format, price) and show how Specification.unrestricted() skips empty filters, while Specification.allOf(...) combines them into a single query. We also address a common issue: string-based field access. It’s fragile and can break at runtime when your model changes. Using the JPA Static Metamodel, we move to compile-time safety. The result is a cleaner, more maintainable way to implement dynamic filtering in Spring-based applications.

Watch Now

Videos

Apr 8, 2026

Best Oracle Java Alternatives in 2026 Comparison of OpenJDK Distributions

A comparison of major OpenJDK distributions (Temurin, Liberica, Zulu, Corretto, Semeru, etc.), covering who maintains them, how updates are delivered, and what lifecycle guarantees they provide. We also explain why upstream OpenJDK isn’t production-ready and how your vendor choice impacts real-world systems. Useful for Spring Boot, containers, and Kubernetes to avoid hidden risks and choose the right runtime.

Watch Now

How to Fix Liberica NIK Not Starting on macOS Catalina and Later

If you use macOS Catalina and later you may encounter the issue that prevents Liberica NIK from starting. To fix it, run the following:

sudo xattr -r -d com.apple.quarantine path/to/graalvm/folder/

If you need more help, check out the FAQ at the bottom of this page or contact us.

Spring AI: Streaming LLM Tokens with NDJSON in Spring Boot