Spring AI: Streaming LLM Tokens with NDJSON in Spring Boot

 

Transcript:

So today we’ll learn how to stream tokens with Spring AI. LLMs produce responses in a token-by-token manner, in other words, in text chunks. The Spring AI chat client enables us to return a Flux of chunks of a model response. I will show you how to enable this functionality and how to process these chunks using NDJSON. The demo is available on GitHub, you can clone the project and follow along.

First, start the model. I run the model locally, version 3.2 3B. If you run the model locally as well, it’s just a matter of running Ollama with the 3.2 3B model and serving it. If you prefer running the model in a Docker container, the project on GitHub contains a Compose instance. If you are using another model, just spin it up accordingly.

We will have a multi-module project where the frontend resides in a separate module. This way, we separate server-side logic and client-side logic. As for dependencies, we need Spring Boot WebFlux for reactive streaming. We also need Vaadin for the frontend. I am using the Spring Ollama starter, and if you are using another model, just add the corresponding starter.

Let’s first look at the core service. Here we have an AI chat service, and we use the stream method of Spring AI’s ChatClient API to get an asynchronous response. The chat client can stream tokens as a Flux of strings or as a Flux of chat responses. To keep things simple, we produce a Flux of strings. That’s basically it for this class. Streaming is enabled, and the interesting part starts in the controller where we process the incoming Flux.

You can process the Flux of AI responses on the server using server-side events or NDJSON. The first approach is well established and works well for browsers. It uses a single long-lived HTTP response where each token is wrapped into a server-side event. The problem with this approach is that you may end up with AI responses on the client side with missing spaces or no spaces at all. The root cause is that different models use different tokenizers. Many LLMs encode a word and a leading space as a single token, so the space often ends up in the next chunk. Since SSE uses many small chunks, you are responsible for reconstructing the response correctly. If the SSE pipeline trims data, parses it poorly, or if proxies compress server-side events, the final response can be badly corrupted.

Newline-delimited JSON, or NDJSON, solves this problem. It is a streaming format where each line is an explicit JSON object separated by a newline character. Each token we receive can be easily parsed as JSON.

Let’s look at how to implement this on the server side in the controller. We have a method annotated with PostMapping that returns the AI response to the client. In the simplest scenario, we specify application ndjson as the produced media type in Spring WebFlux. This means each emitted item from the reactive stream is serialized to a JSON object and flushed one per line. Instead of returning a Flux of strings, we return a Flux of maps, a reactive stream of JSON objects where the key is a string and the value is also a string. The key represents the event type, and the value contains the actual response chunk. Spring serializes each map to JSON using Jackson and writes it immediately followed by a newline. As a result, the client receives a stream of lines, each being a small JSON object with a single field.

Now let’s address error handling and stream completion. It is a good practice to explicitly signal when the stream ends and to handle backpressure. Backpressure occurs when the rate of message production exceeds the rate at which the client can consume them.

After receiving the streamed response from the model, we map each chunk to an event structure. We define different event types and associate content with each piece of text. For example, if the model emits three chunks such as “hello”, “world”, and “!”, the HTTP body will contain three token events. In addition, we define an error event with an error message and a done event that signals the end of the stream.

We then apply the onErrorResume operator. If any upstream error occurs, the stream does not break abruptly. Instead, it switches to a fallback emission: a single JSON line with type error and the error message, and then completes. This allows the client to handle terminal errors cleanly. After that, we use concatWith to append a final done event. This gives the client a deterministic end marker so it knows when to stop reading and clean up the UI state.

The limitRate operator allows us to control backpressure by requesting elements in batches instead of prefetching an unbounded number. In this case, we request up to 64 elements at a time. Optionally, we can add logging using doOnSubscribe and doFinally. This allows us to log when a subscriber connects and when the stream terminates, whether due to normal completion, an error, or client cancellation. This is useful for tracing sudden client disconnects.

With that, we are done with the server-side logic. Let’s move on to the client side. I will omit most of the UI element creation to reduce noise. The full code is available on GitHub, and I also have a separate video about integrating Vaadin with Spring Boot if you are interested.

In the main view class, we use WebClient, a reactive HTTP client, to call the backend chat endpoint. A Disposable field holds the current active subscription so it can be cancelled when needed, for example when a new request is made or when the view is detached. In the constructor, we build a WebClient with the base backend URL so we can later use relative URIs like /chat. We then wire the Ask button click handler to call the onAsk method. This method takes the input text and returns a Flux of streamed chunks, which is immediately passed to the startStream method for consumption.

In the onAsk method, we first validate the input. If it is null, we show a notification and return an empty stream. Then we use WebClient to build a POST request, send the prompt as plain text, and accept NDJSON. After executing the request with retrieve, we parse the NDJSON response using bodyToFlux as a Flux of maps, one map per line.

Next, we transform this Flux of maps into a Flux of strings containing only the text chunks. We use Reactor’s handle operator, which allows mapping, filtering, and terminating in one place. For token events, we extract the content and emit it downstream. For error events, we convert them into terminal error signals. For done events, we complete the stream. Unknown event types are ignored to remain forward compatible. The result is a Flux of strings representing only the streamed text chunks.

This Flux is passed to the startStream method. There, we cancel any previous active stream to avoid overlapping output, clear the output area, and disable the Ask button. Using UI.getCurrent, we obtain the current Vaadin UI so we can safely update the UI from reactive callbacks. We subscribe to the stream and store the subscription. On errors, we schedule a UI notification. In doFinally, regardless of whether the stream completed normally, errored, or was cancelled, we reset the UI state and clear the active subscription. Each text chunk is appended to the output using UI.access to ensure updates happen on the correct UI thread.

Finally, we add the Push annotation to the class implementing AppShellConfigurator, usually the main frontend class. This enables server-to-client push so UI updates triggered from the reactive stream are flushed to the browser immediately. As a result, each chunk appears on the screen in real time, producing a smoothly streaming AI response.

Thank you for watching. If you want more AI-related videos, leave a comment, like the video, subscribe to the channel, and until next time.

Summary

This video explains how to stream LLM responses token by token using Spring AI and reactive programming. It shows why traditional server-side events can cause formatting issues and how NDJSON provides a more reliable streaming format. The server-side implementation uses Spring WebFlux to stream structured JSON events with proper error handling, backpressure control, and a clear end-of-stream signal. On the client side, a reactive WebClient and Vaadin UI consume the stream and render tokens in real time. Together, this approach enables smooth, robust, and user-friendly AI response streaming in Spring applications.

About Catherine

Java developer passionate about Spring Boot. Writer. Developer Advocate at BellSoft

Social Media

Videos
card image
Jan 20, 2026
JDBC vs ORM vs jOOQ: Choose the Right Java Database Tool

Still unsure what is the difference between JPA, Hibernate, JDBC, or jOOQ and when to use which? This video clarifies the entire Java database access stack with real, production-oriented examples. We start at the foundation, which is JDBC, a low-level API every other tool eventually relies on for database communication. Then, we go through the ORM concept, JPA as a specification of ORM, Hibernate as the implementation and extension of JPA, and Blaze Persistence as a powerful upgrade to JPA Criteria API. From there, we take a different path with jOOQ: a database-first, SQL-centric approach that provides type-safe queries and catches many SQL errors at compile time instead of runtime. You’ll see when raw JDBC makes sense for small, focused services, when Hibernate fits CRUD-heavy domains, and when jOOQ excels at complex reporting and analytics. We discuss real performance pitfalls such as N+1 queries and lazy loading, and show practical combination strategies like “JPA for CRUD, jOOQ for reports.” The goal is to equip you with clarity so that you can make informed architectural decisions based on domain complexity, query patterns, and long-term maintainability.

Videos
card image
Jan 13, 2026
Hibernate: Ditch or Double Down? When ORM Isn't Enough

Every Java team debates Hibernate at some point: productivity champion or performance liability? Both are right. This video shows you when to rely on Hibernate's ORM magic and when to drop down to SQL. We walk through production scenarios: domain models with many-to-many relations where Hibernate excels, analytical reports with window functions where JDBC dominates, and hybrid architectures that use both in the same Spring Boot codebase. You'll see real code examples: the N+1 query trap that kills performance, complex window functions and anti-joins that Hibernate can't handle, equals/hashCode pitfalls with lazy loading, and practical two-level caching strategies. We also explore how Hibernate works under the hood—translating HQL to database-specific SQL dialects, managing sessions and transactions through JDBC, implementing JPA specifications. The strategic insight: modern applications need both ORM convenience for transactional business logic and SQL precision for data-intensive analytics. Use Hibernate for CRUD and relationship management. Use SQL where ORM abstractions leak or performance demands direct control.

Further watching

Videos
card image
Feb 6, 2026
Backend Developer Roadmap 2026: What You Need to Know

Backend complexity keeps growing, and frameworks can't keep up. In 2026, knowing React or Django isn't enough. You need fundamentals that hold up when systems break, traffic spikes, or your architecture gets rewritten for the third time.I've been building production systems for 15 years. This roadmap covers three areas that separate people who know frameworks from people who can actually architect backend systems: data, architecture, and infrastructure. This is about how to think, not what tools to install.

Videos
card image
Jan 29, 2026
JDBC Connection Pools in Microservices. Why They Break Down (and What to Do Instead)

In this livestream, Catherine is joined by Rogerio Robetti, the founder of Open J Proxy, to discuss why traditional JDBC connection pools break down when teams migrate to microservices, and what is a more efficient and reliable approach to organizing database access with microservice architecture.

Videos
card image
Jan 27, 2026
Sizing JDBC Connection Pools for Real Production Load

Many production outages start with connection pool exhaustion. Your app waits seconds for connections while queries take milliseconds; yet, most teams run default settings that collapse under load. This video shows how to configure connection pools that survive real production traffic: sizing based on database limits and thread counts, setting timeouts that prevent cascading failures, and implementing an open source database proxy Open J Proxy for centralized connection management with virtual connection handles, client-side load balancing, and slow query segregation. For senior Java developers, DevOps engineers, and architects who need database performance that holds under pressure.