Build RAG System with Spring AI: No More AI Lies

 

Transcript:

Ever asked ChatGPT something and it confidently made stuff up? Yeah, it happened a lot to me recently. I bet it happened to you too. The thing is, it's not the only problem. I have to know a lot about our products. But sometimes it's almost impossible to find information because we have so much documentation, so many blog posts, and so many YouTube videos. It's hardly possible to process them all.

Hi, my name is Pasha and I'm a developer advocate for BellSoft, and today we will fight LLM Hallucinations in your product. But first, let's talk about several fundamentals.

First, vectors. This is the base for all the LLM magic. A vector is just a set of numbers. You can think of it as an array of doubles, for example, or as a simple metaphor. Imagine you need to distinguish New York from Tokyo. They have coordinates. In a way, these coordinates are vectors. LLMs operate on multi-dimensional vectors. For New York and Tokyo, there are only two dimensions: longitude and latitude. Vectors for our LLMs are much longer, usually 1,861 dimensions, for example.

The second thing you should know about is a vector database. What's a vector database? Essentially, it's a specialized storage suited specifically to handle vectors. They can be usual databases, for example a Postgres database. It's called PGVector. There are also different kinds of vector databases, for example Milvus or Chroma. And of course, Oracle has support for vectors too.

What does it mean to handle vectors? It means that this database can perform operations on vectors very quickly. One of the most important operations on vectors is called finding distance. One of the most popular ways to find distance between vectors is cosine distance.

Now, here is a small demonstration of what cosine distance is. Cosine distance is the product of two vectors divided by the product of their magnitudes. Now imagine we have three objects: rose, sunflower, and sun. Sunflower is visually closer to rose than to sun. But how does it work for vectors? As you can see, there are three vectors and they have angles between them. The angle between rose and sunflower is smaller than the angle between sunflower and sun, and the cosine distance is also smaller. A smaller cosine distance means more similar meaning.

Now you are probably asking me, Pasha, why are you telling all of this? Why do we care about vectors? Why do we care about cosine distance? Why do we care about databases? The answer is that this is part of the solution to our problem. This is part of the solution to LLM Hallucinations.

If you think about it, browsing through our documents is almost impossible for us, but it's very simple for machines. There is a very simple approach to solve this problem. It's called RAG — retrieval augmented generation.

The workflow looks like this. We take a huge document, or small documents if we have many of them. Documents are usually text, but they might also be images and so on. Then we split every document into chunks. We take every chunk and convert it into a vector. How do we do it? There are special neural networks that can do it for us. For example, OpenAI has one, and there are many others.

When we have a vector, we save this vector to our vector database along with the source text, because a vector is just a set of numbers. It's impossible to decode a vector back into text without knowing what was encoded. We have to preserve the source.

Then when I want to find something in my documents, I don't have to read through them anymore. What I need is to ask a question to the LLM. The LLM gives me a vector for my question. Then I find the closest documents to my question in my vector database. Then I put these documents into the context, and I ask the LLM: given all the information I just provided, please answer my question.

Now the LLM does not need to hallucinate anymore. It knows the exact answer if it is in the documents. If there are no documents, we can say that. If it is not sure, it can answer: “I'm not sure. I don't have enough information.”

Now let's talk about how to implement this whole system in code. Obviously, we'll use Spring.

The first part is uploading our documents to a vector database. As a reminder, we split documents into chunks, convert every chunk into a vector, and save it to the database. Here's how we do it in Spring.

The simplest way is to have one endpoint. In my case, it's a PostMapping with a path, and we send a multipart file. Then I use a document reader, which is a part of the Spring AI suite, to convert the file into a list of documents. In this case, every document is a chunk that will be converted into a vector. Then I call vectorStore add docs.

All the magic is hidden. Spring sends every document to the neural network to convert it into a vector, and then behind the scenes it stores the vector together with the source text in the vector store. Then we respond that everything is done.

To make our bot answer user questions, we add another endpoint, for example a GetMapping, and pass the question text. Then we create a search query with topK build. This part means: please find the top three documents that are closest to our query. This uses cosine distance.

We execute the query on the vector store, and everything is abstracted. It doesn't matter whether it's PGVector, Milvus, or Chroma. We get the documents and convert each document into a specific format. In my case, it's the document title, the document text, and the file name from metadata. I split the documents with a delimiter to make sure the LLM does not mix them up.

Now I provide system instructions. It does not have to be this complex, but this is what I use. You are a helpful assistant. You find information in the data provided below. The format of the data is the title and the text of the document or its part. Documents are separated by a delimiter. If you reference a document, reference it by name.

This way we are sure that the LLM does not hallucinate. We can also find the document by name in storage and read it fully if needed. Always reference the document where you found the information. If there is no information, answer: “Sorry, I don't have such information.”

Then I insert the documents into the prompt. I add the user's question, create a prompt, and call chatClient prompt call. This returns a string that I send back to the user.

This is the easiest and simplest possible RAG workflow. It does not support chatting or advanced features. It is intentionally simple. But now you have all the basic information to know where to dig.

This is your Spring AI. It will not magically make your application smarter than you, but it will make your data useful and help save you from Hallucinations.

If you like the video, like it, subscribe to the channel, leave a comment, and who knows, maybe I'll create the next one. Thank you so much. Pash out.

 

 

Summary

This talk explains why LLM Hallucinations happen and why they are a common problem when product knowledge is spread across large amounts of documentation. It introduces vectors and vector databases as the foundation for Retrieval Augmented Generation (RAG). Documents are split into chunks, converted into vectors, and stored so that relevant information can be retrieved using cosine distance. When a user asks a question, the LLM receives the most relevant document context instead of guessing. As a result, answers are grounded in real data and hallucinations are significantly reduced.

Social Media

Videos
card image
Apr 2, 2026
Java Memory Options You Need in Production

JVM memory tuning can be tricky. Teams increase -Xmx and assume the problem is solved. Then the app still hits OOM. Because maximum heap size is not the only thing that affects memory footprint. The JVM uses RAM for much more than heap: metaspace, thread stacks, JIT/code cache, direct buffers, and native allocations. That’s why your process can run out of memory while heap still looks “fine”. In this video, we break down how JVM memory actually works and how to control it with a minimal, production-safe set of flags. We cover heap sizing (-Xms, -Xmx), dynamic resizing, direct memory (-XX:MaxDirectMemorySize), and total RAM limits (-XX:MaxRAMPercentage) — especially in containerized environments like Docker and Kubernetes. We also explain GC choices such as G1, ZGC, and Shenandoah, when defaults are enough, and why GC logging (-Xlog:gc*) is mandatory before tuning. Finally, we show how to diagnose failures with heap dumps and OOM hooks. This is not about adding more flags. It’s about understanding what actually consumes memory — and making decisions you can justify in production.

Videos
card image
Mar 26, 2026
Java Developer Roadmap 2026: From Basics to Production

Most Java roadmaps teach tools. This one teaches order — the only thing that actually gets you to production. You don’t need to learn everything. You need to learn the right things, in the right sequence. In this video, we break down a practical Java developer roadmap for 2026 — from syntax and OOP to Spring, databases, testing, and deployment. Structured into 8 levels, it shows how real engineers grow from fundamentals to production-ready systems. We cover what to learn and what to ignore: core Java, collections, streams, build tools, Git, SQL and JDBC before Hibernate, the Spring ecosystem, testing with JUnit, and deployment with Docker and CI/CD. You’ll also understand why most developers get stuck — jumping into frameworks too early, skipping SQL, or treating tools as knowledge. This roadmap gives you a clear path into real-world Java development — with priorities, trade-offs, and production context.

Further watching

Videos
card image
Apr 30, 2026
Java Flight Recorder Tutorial: How to Profile Java Applications

High CPU, GC spikes, or slow startup are common production issues, but logs and metrics don’t always reveal what the JVM is actually doing. Java Flight Recorder (JFR) provides a precise, low-overhead view of JVM behavior, safe for use even in production environments. In this video, you’ll learn how to use JFR to identify real bottlenecks such as CPU hotspots, memory allocation pressure, thread contention, and I/O stalls. We walk through the full workflow, including starting recordings with JVM flags, controlling them via jcmd, running JFR inside Docker containers, and attaching to live systems using ephemeral containers. Then we analyze a real Spring Boot recording in JDK Mission Control, breaking down GC behavior, allocation patterns, thread states, and method-level hotspots. If you want to move from symptoms to root cause with more confidence, this approach will help. Full article with commands and examples: [https://bell-sw.com/blog/how-to-profile-java-applications-with-jfr-beginner-s-guide/](https://bell-sw.com/blog/how-to-profile-java-applications-with-jfr-beginner-s-guide/)

Videos
card image
Apr 22, 2026
Dynamic SQL Queries with Spring Data JPA in 6 Minutes

If your repository layer has multiple queries for different filter combinations, your data access logic is already getting harder to maintain. In this video, we implement dynamic SQL queries in Spring Data JPA using Specifications — a composable approach that helps avoid query duplication and keeps your filtering logic clean. We build a flexible filtering system with optional parameters (category, language, format, price) and show how Specification.unrestricted() skips empty filters, while Specification.allOf(...) combines them into a single query. We also address a common issue: string-based field access. It’s fragile and can break at runtime when your model changes. Using the JPA Static Metamodel, we move to compile-time safety. The result is a cleaner, more maintainable way to implement dynamic filtering in Spring-based applications.

Videos
card image
Apr 8, 2026
Best Oracle Java Alternatives in 2026 Comparison of OpenJDK Distributions

A comparison of major OpenJDK distributions (Temurin, Liberica, Zulu, Corretto, Semeru, etc.), covering who maintains them, how updates are delivered, and what lifecycle guarantees they provide. We also explain why upstream OpenJDK isn’t production-ready and how your vendor choice impacts real-world systems. Useful for Spring Boot, containers, and Kubernetes to avoid hidden risks and choose the right runtime.