Spring AI: Streaming LLM Tokens with NDJSON in Spring Boot

 

Transcript:

So today we’ll learn how to stream tokens with Spring AI. LLMs produce responses in a token-by-token manner, in other words, in text chunks. The Spring AI chat client enables us to return a Flux of chunks of a model response. I will show you how to enable this functionality and how to process these chunks using NDJSON. The demo is available on GitHub, you can clone the project and follow along.

First, start the model. I run the model locally, version 3.2 3B. If you run the model locally as well, it’s just a matter of running Ollama with the 3.2 3B model and serving it. If you prefer running the model in a Docker container, the project on GitHub contains a Compose instance. If you are using another model, just spin it up accordingly.

We will have a multi-module project where the frontend resides in a separate module. This way, we separate server-side logic and client-side logic. As for dependencies, we need Spring Boot WebFlux for reactive streaming. We also need Vaadin for the frontend. I am using the Spring Ollama starter, and if you are using another model, just add the corresponding starter.

Let’s first look at the core service. Here we have an AI chat service, and we use the stream method of Spring AI’s ChatClient API to get an asynchronous response. The chat client can stream tokens as a Flux of strings or as a Flux of chat responses. To keep things simple, we produce a Flux of strings. That’s basically it for this class. Streaming is enabled, and the interesting part starts in the controller where we process the incoming Flux.

You can process the Flux of AI responses on the server using server-side events or NDJSON. The first approach is well established and works well for browsers. It uses a single long-lived HTTP response where each token is wrapped into a server-side event. The problem with this approach is that you may end up with AI responses on the client side with missing spaces or no spaces at all. The root cause is that different models use different tokenizers. Many LLMs encode a word and a leading space as a single token, so the space often ends up in the next chunk. Since SSE uses many small chunks, you are responsible for reconstructing the response correctly. If the SSE pipeline trims data, parses it poorly, or if proxies compress server-side events, the final response can be badly corrupted.

Newline-delimited JSON, or NDJSON, solves this problem. It is a streaming format where each line is an explicit JSON object separated by a newline character. Each token we receive can be easily parsed as JSON.

Let’s look at how to implement this on the server side in the controller. We have a method annotated with PostMapping that returns the AI response to the client. In the simplest scenario, we specify application ndjson as the produced media type in Spring WebFlux. This means each emitted item from the reactive stream is serialized to a JSON object and flushed one per line. Instead of returning a Flux of strings, we return a Flux of maps, a reactive stream of JSON objects where the key is a string and the value is also a string. The key represents the event type, and the value contains the actual response chunk. Spring serializes each map to JSON using Jackson and writes it immediately followed by a newline. As a result, the client receives a stream of lines, each being a small JSON object with a single field.

Now let’s address error handling and stream completion. It is a good practice to explicitly signal when the stream ends and to handle backpressure. Backpressure occurs when the rate of message production exceeds the rate at which the client can consume them.

After receiving the streamed response from the model, we map each chunk to an event structure. We define different event types and associate content with each piece of text. For example, if the model emits three chunks such as “hello”, “world”, and “!”, the HTTP body will contain three token events. In addition, we define an error event with an error message and a done event that signals the end of the stream.

We then apply the onErrorResume operator. If any upstream error occurs, the stream does not break abruptly. Instead, it switches to a fallback emission: a single JSON line with type error and the error message, and then completes. This allows the client to handle terminal errors cleanly. After that, we use concatWith to append a final done event. This gives the client a deterministic end marker so it knows when to stop reading and clean up the UI state.

The limitRate operator allows us to control backpressure by requesting elements in batches instead of prefetching an unbounded number. In this case, we request up to 64 elements at a time. Optionally, we can add logging using doOnSubscribe and doFinally. This allows us to log when a subscriber connects and when the stream terminates, whether due to normal completion, an error, or client cancellation. This is useful for tracing sudden client disconnects.

With that, we are done with the server-side logic. Let’s move on to the client side. I will omit most of the UI element creation to reduce noise. The full code is available on GitHub, and I also have a separate video about integrating Vaadin with Spring Boot if you are interested.

In the main view class, we use WebClient, a reactive HTTP client, to call the backend chat endpoint. A Disposable field holds the current active subscription so it can be cancelled when needed, for example when a new request is made or when the view is detached. In the constructor, we build a WebClient with the base backend URL so we can later use relative URIs like /chat. We then wire the Ask button click handler to call the onAsk method. This method takes the input text and returns a Flux of streamed chunks, which is immediately passed to the startStream method for consumption.

In the onAsk method, we first validate the input. If it is null, we show a notification and return an empty stream. Then we use WebClient to build a POST request, send the prompt as plain text, and accept NDJSON. After executing the request with retrieve, we parse the NDJSON response using bodyToFlux as a Flux of maps, one map per line.

Next, we transform this Flux of maps into a Flux of strings containing only the text chunks. We use Reactor’s handle operator, which allows mapping, filtering, and terminating in one place. For token events, we extract the content and emit it downstream. For error events, we convert them into terminal error signals. For done events, we complete the stream. Unknown event types are ignored to remain forward compatible. The result is a Flux of strings representing only the streamed text chunks.

This Flux is passed to the startStream method. There, we cancel any previous active stream to avoid overlapping output, clear the output area, and disable the Ask button. Using UI.getCurrent, we obtain the current Vaadin UI so we can safely update the UI from reactive callbacks. We subscribe to the stream and store the subscription. On errors, we schedule a UI notification. In doFinally, regardless of whether the stream completed normally, errored, or was cancelled, we reset the UI state and clear the active subscription. Each text chunk is appended to the output using UI.access to ensure updates happen on the correct UI thread.

Finally, we add the Push annotation to the class implementing AppShellConfigurator, usually the main frontend class. This enables server-to-client push so UI updates triggered from the reactive stream are flushed to the browser immediately. As a result, each chunk appears on the screen in real time, producing a smoothly streaming AI response.

Thank you for watching. If you want more AI-related videos, leave a comment, like the video, subscribe to the channel, and until next time.

Summary

This video explains how to stream LLM responses token by token using Spring AI and reactive programming. It shows why traditional server-side events can cause formatting issues and how NDJSON provides a more reliable streaming format. The server-side implementation uses Spring WebFlux to stream structured JSON events with proper error handling, backpressure control, and a clear end-of-stream signal. On the client side, a reactive WebClient and Vaadin UI consume the stream and render tokens in real time. Together, this approach enables smooth, robust, and user-friendly AI response streaming in Spring applications.

About Catherine

Java developer passionate about Spring Boot. Writer. Developer Advocate at BellSoft

Social Media

Videos
card image
Dec 12, 2025
Will AI Replace Developers? A Vibe Coding Reality Check 2025

Can AI replace software engineers? ChatGPT, Copilot, and LLM-powered vibe coding tools promise to automate development—but after testing them against 17 years of production experience, the answer is more nuanced than the hype suggests. Full project generation produces over-engineered code that's hard to refactor. AI assistants excel at boilerplate but fail at business logic. MCP servers solve hallucination problems but create context overload. Meanwhile, DevOps automation actually works. This breakdown separates AI capabilities from marketing promises—essential for teams integrating LLMs and copilots without compromising code quality or architectural decisions.

Videos
card image
Dec 12, 2025
JRush | Container Essentials: Fast Builds, Secure Images, Zero Vulnerabilities

Web-conference for Java developers focused on hands-on strategies for building high-performance containers, eliminating CVEs, and detecting security issues before production.

Further watching

Videos
card image
Dec 30, 2025
Java in 2025: LTS Release, AI on JVM, Framework Modernization

Java in 2025 isn't about headline features, it's about how production systems changed under the hood. While release notes focus on individual JEPs, the real story is how the platform, frameworks, and tooling evolved to improve stability, performance, and long-term maintainability. In this video, we look at Java from a production perspective. What does Java 25 LTS mean for teams planning to upgrade? How are memory efficiency, startup time, and observability getting better? Why do changes like Scoped Values and AOT optimizations matter beyond benchmarks? We also cover the broader ecosystem: Spring Boot 4 and Framework 7, AI on the JVM with Spring AI and LangChain4j, Kotlin's growing role in backend systems, and tooling updates that make upgrades easier. Finally, we touch on container hardening and why runtime and supply-chain decisions matter just as much as language features.

Videos
card image
Dec 24, 2025
I Solved Advent of Code 2025 in Kotlin: Here's How It Went

Every year, Advent of Code spawns thousands of solutions — but few engineers step back to see the bigger picture. This is a complete walkthrough of all 12 days from 2025, focused on engineering patterns rather than puzzle statements. We cover scalable techniques: interval math without brute force, dynamic programming, graph algorithms (JGraphT), geometry with Java AWT Polygon, and optimization problems that need constraint solvers like ojAlgo. You'll see how Java and Kotlin handle real constraints, how visualizations validate assumptions, and when to reach for libraries instead of writing everything from scratch. If you love puzzles, programming—or both—and maybe want to learn how to solve them on the JVM, this is for you.

Videos
card image
Dec 18, 2025
Java 26 Preview: New JEPs and What They Mean for You

Java 26 is the next feature release that brings features for enhanced performance, security, and developer experience. This video discusses the upcoming JDK 26 release, highlighting ten JEPs including JEP 500. JEP 500 focuses on preparing developers for future restrictions on mutating final fields in Java, emphasizing their role in maintaining immutable state. This is crucial for robust programming and understanding the nuances of mutable vs immutable data, especially concerning an immutable class in java. We also touch upon the broader implications for functional programming in Java.