Spring AI: Streaming LLM Tokens with NDJSON in Spring Boot

 

Transcript:

So today we’ll learn how to stream tokens with Spring AI. LLMs produce responses in a token-by-token manner, in other words, in text chunks. The Spring AI chat client enables us to return a Flux of chunks of a model response. I will show you how to enable this functionality and how to process these chunks using NDJSON. The demo is available on GitHub, you can clone the project and follow along.

First, start the model. I run the model locally, version 3.2 3B. If you run the model locally as well, it’s just a matter of running Ollama with the 3.2 3B model and serving it. If you prefer running the model in a Docker container, the project on GitHub contains a Compose instance. If you are using another model, just spin it up accordingly.

We will have a multi-module project where the frontend resides in a separate module. This way, we separate server-side logic and client-side logic. As for dependencies, we need Spring Boot WebFlux for reactive streaming. We also need Vaadin for the frontend. I am using the Spring Ollama starter, and if you are using another model, just add the corresponding starter.

Let’s first look at the core service. Here we have an AI chat service, and we use the stream method of Spring AI’s ChatClient API to get an asynchronous response. The chat client can stream tokens as a Flux of strings or as a Flux of chat responses. To keep things simple, we produce a Flux of strings. That’s basically it for this class. Streaming is enabled, and the interesting part starts in the controller where we process the incoming Flux.

You can process the Flux of AI responses on the server using server-side events or NDJSON. The first approach is well established and works well for browsers. It uses a single long-lived HTTP response where each token is wrapped into a server-side event. The problem with this approach is that you may end up with AI responses on the client side with missing spaces or no spaces at all. The root cause is that different models use different tokenizers. Many LLMs encode a word and a leading space as a single token, so the space often ends up in the next chunk. Since SSE uses many small chunks, you are responsible for reconstructing the response correctly. If the SSE pipeline trims data, parses it poorly, or if proxies compress server-side events, the final response can be badly corrupted.

Newline-delimited JSON, or NDJSON, solves this problem. It is a streaming format where each line is an explicit JSON object separated by a newline character. Each token we receive can be easily parsed as JSON.

Let’s look at how to implement this on the server side in the controller. We have a method annotated with PostMapping that returns the AI response to the client. In the simplest scenario, we specify application ndjson as the produced media type in Spring WebFlux. This means each emitted item from the reactive stream is serialized to a JSON object and flushed one per line. Instead of returning a Flux of strings, we return a Flux of maps, a reactive stream of JSON objects where the key is a string and the value is also a string. The key represents the event type, and the value contains the actual response chunk. Spring serializes each map to JSON using Jackson and writes it immediately followed by a newline. As a result, the client receives a stream of lines, each being a small JSON object with a single field.

Now let’s address error handling and stream completion. It is a good practice to explicitly signal when the stream ends and to handle backpressure. Backpressure occurs when the rate of message production exceeds the rate at which the client can consume them.

After receiving the streamed response from the model, we map each chunk to an event structure. We define different event types and associate content with each piece of text. For example, if the model emits three chunks such as “hello”, “world”, and “!”, the HTTP body will contain three token events. In addition, we define an error event with an error message and a done event that signals the end of the stream.

We then apply the onErrorResume operator. If any upstream error occurs, the stream does not break abruptly. Instead, it switches to a fallback emission: a single JSON line with type error and the error message, and then completes. This allows the client to handle terminal errors cleanly. After that, we use concatWith to append a final done event. This gives the client a deterministic end marker so it knows when to stop reading and clean up the UI state.

The limitRate operator allows us to control backpressure by requesting elements in batches instead of prefetching an unbounded number. In this case, we request up to 64 elements at a time. Optionally, we can add logging using doOnSubscribe and doFinally. This allows us to log when a subscriber connects and when the stream terminates, whether due to normal completion, an error, or client cancellation. This is useful for tracing sudden client disconnects.

With that, we are done with the server-side logic. Let’s move on to the client side. I will omit most of the UI element creation to reduce noise. The full code is available on GitHub, and I also have a separate video about integrating Vaadin with Spring Boot if you are interested.

In the main view class, we use WebClient, a reactive HTTP client, to call the backend chat endpoint. A Disposable field holds the current active subscription so it can be cancelled when needed, for example when a new request is made or when the view is detached. In the constructor, we build a WebClient with the base backend URL so we can later use relative URIs like /chat. We then wire the Ask button click handler to call the onAsk method. This method takes the input text and returns a Flux of streamed chunks, which is immediately passed to the startStream method for consumption.

In the onAsk method, we first validate the input. If it is null, we show a notification and return an empty stream. Then we use WebClient to build a POST request, send the prompt as plain text, and accept NDJSON. After executing the request with retrieve, we parse the NDJSON response using bodyToFlux as a Flux of maps, one map per line.

Next, we transform this Flux of maps into a Flux of strings containing only the text chunks. We use Reactor’s handle operator, which allows mapping, filtering, and terminating in one place. For token events, we extract the content and emit it downstream. For error events, we convert them into terminal error signals. For done events, we complete the stream. Unknown event types are ignored to remain forward compatible. The result is a Flux of strings representing only the streamed text chunks.

This Flux is passed to the startStream method. There, we cancel any previous active stream to avoid overlapping output, clear the output area, and disable the Ask button. Using UI.getCurrent, we obtain the current Vaadin UI so we can safely update the UI from reactive callbacks. We subscribe to the stream and store the subscription. On errors, we schedule a UI notification. In doFinally, regardless of whether the stream completed normally, errored, or was cancelled, we reset the UI state and clear the active subscription. Each text chunk is appended to the output using UI.access to ensure updates happen on the correct UI thread.

Finally, we add the Push annotation to the class implementing AppShellConfigurator, usually the main frontend class. This enables server-to-client push so UI updates triggered from the reactive stream are flushed to the browser immediately. As a result, each chunk appears on the screen in real time, producing a smoothly streaming AI response.

Thank you for watching. If you want more AI-related videos, leave a comment, like the video, subscribe to the channel, and until next time.

Summary

This video explains how to stream LLM responses token by token using Spring AI and reactive programming. It shows why traditional server-side events can cause formatting issues and how NDJSON provides a more reliable streaming format. The server-side implementation uses Spring WebFlux to stream structured JSON events with proper error handling, backpressure control, and a clear end-of-stream signal. On the client side, a reactive WebClient and Vaadin UI consume the stream and render tokens in real time. Together, this approach enables smooth, robust, and user-friendly AI response streaming in Spring applications.

About Catherine

Java developer passionate about Spring Boot. Writer. Developer Advocate at BellSoft

Social Media

Videos
card image
Mar 9, 2026
jOOQ Deep Dive: CTE, MULTISET, and SQL Pipelines

Some backend developers reach the point where the ORM stops being helpful. Complex joins, nested result graphs, or CTE pipelines quickly push frameworks like Hibernate to their limits. And when that happens, teams often end up writing fragile raw SQL strings or fighting performance issues like the classic N+1 query problem. In this video, we build a healthcare scheduling application NeonCare using jOOQ, Spring Boot 4, and PostgreSQL, and show how to write production-grade SQL directly in Java while keeping full compile-time type safety.

Videos
card image
Feb 27, 2026
Spring Developer Roadmap 2026: What You Need to Know

Spring Boot is powerful. But knowing the framework isn’t the same as understanding backend engineering. In this video, I walk through the roadmap I believe matters for a Spring developer in 2026. We start with data. That means real SQL — CTEs, window functions, normalization trade-offs — and understanding what ACID and BASE actually imply for system guarantees. Spring Data JPA is useful, but you still need to know what happens underneath. Then architecture: microservices vs modular monolith, serverless, CQRS, and when HTTP, gRPC, Kafka, or WebSockets make sense. Not as buzzwords — but as design choices with trade-offs. Security and infrastructure follow: OWASP Top 10, AuthN vs AuthZ, encryption in transit and at rest, Docker, Kubernetes, Infrastructure as Code, and observability with Micrometer, OpenTelemetry, and Grafana. This roadmap isn’t about mastering every tool. It’s about knowing what affects reliability in production.

Further watching

Videos
card image
Apr 2, 2026
Java Memory Options You Need in Production

JVM memory tuning can be tricky. Teams increase -Xmx and assume the problem is solved. Then the app still hits OOM. Because maximum heap size is not the only thing that affects memory footprint. The JVM uses RAM for much more than heap: metaspace, thread stacks, JIT/code cache, direct buffers, and native allocations. That’s why your process can run out of memory while heap still looks “fine”. In this video, we break down how JVM memory actually works and how to control it with a minimal, production-safe set of flags. We cover heap sizing (-Xms, -Xmx), dynamic resizing, direct memory (-XX:MaxDirectMemorySize), and total RAM limits (-XX:MaxRAMPercentage) — especially in containerized environments like Docker and Kubernetes. We also explain GC choices such as G1, ZGC, and Shenandoah, when defaults are enough, and why GC logging (-Xlog:gc*) is mandatory before tuning. Finally, we show how to diagnose failures with heap dumps and OOM hooks. This is not about adding more flags. It’s about understanding what actually consumes memory — and making decisions you can justify in production.

Videos
card image
Mar 26, 2026
Java Developer Roadmap 2026: From Basics to Production

Most Java roadmaps teach tools. This one teaches order — the only thing that actually gets you to production. You don’t need to learn everything. You need to learn the right things, in the right sequence. In this video, we break down a practical Java developer roadmap for 2026 — from syntax and OOP to Spring, databases, testing, and deployment. Structured into 8 levels, it shows how real engineers grow from fundamentals to production-ready systems. We cover what to learn and what to ignore: core Java, collections, streams, build tools, Git, SQL and JDBC before Hibernate, the Spring ecosystem, testing with JUnit, and deployment with Docker and CI/CD. You’ll also understand why most developers get stuck — jumping into frameworks too early, skipping SQL, or treating tools as knowledge. This roadmap gives you a clear path into real-world Java development — with priorities, trade-offs, and production context.

Videos
card image
Mar 19, 2026
TOP-5 Lightweight Linux Distributions for Containers

In this video, we compare five lightweight Linux distributions commonly used as base images: Alpine, Alpaquita, Chiseled Ubuntu, RHEL UBI Micro, and Wolfi. There are no rankings or recommendations — just a structured look at how these distros differ so you can evaluate them in your own context.