Build RAG System with Spring AI: No More AI Lies

 

Transcript:

Ever asked ChatGPT something and it confidently made stuff up? Yeah, it happened a lot to me recently. I bet it happened to you too. The thing is, it's not the only problem. I have to know a lot about our products. But sometimes it's almost impossible to find information because we have so much documentation, so many blog posts, and so many YouTube videos. It's hardly possible to process them all.

Hi, my name is Pasha and I'm a developer advocate for BellSoft, and today we will fight LLM Hallucinations in your product. But first, let's talk about several fundamentals.

First, vectors. This is the base for all the LLM magic. A vector is just a set of numbers. You can think of it as an array of doubles, for example, or as a simple metaphor. Imagine you need to distinguish New York from Tokyo. They have coordinates. In a way, these coordinates are vectors. LLMs operate on multi-dimensional vectors. For New York and Tokyo, there are only two dimensions: longitude and latitude. Vectors for our LLMs are much longer, usually 1,861 dimensions, for example.

The second thing you should know about is a vector database. What's a vector database? Essentially, it's a specialized storage suited specifically to handle vectors. They can be usual databases, for example a Postgres database. It's called PGVector. There are also different kinds of vector databases, for example Milvus or Chroma. And of course, Oracle has support for vectors too.

What does it mean to handle vectors? It means that this database can perform operations on vectors very quickly. One of the most important operations on vectors is called finding distance. One of the most popular ways to find distance between vectors is cosine distance.

Now, here is a small demonstration of what cosine distance is. Cosine distance is the product of two vectors divided by the product of their magnitudes. Now imagine we have three objects: rose, sunflower, and sun. Sunflower is visually closer to rose than to sun. But how does it work for vectors? As you can see, there are three vectors and they have angles between them. The angle between rose and sunflower is smaller than the angle between sunflower and sun, and the cosine distance is also smaller. A smaller cosine distance means more similar meaning.

Now you are probably asking me, Pasha, why are you telling all of this? Why do we care about vectors? Why do we care about cosine distance? Why do we care about databases? The answer is that this is part of the solution to our problem. This is part of the solution to LLM Hallucinations.

If you think about it, browsing through our documents is almost impossible for us, but it's very simple for machines. There is a very simple approach to solve this problem. It's called RAG — retrieval augmented generation.

The workflow looks like this. We take a huge document, or small documents if we have many of them. Documents are usually text, but they might also be images and so on. Then we split every document into chunks. We take every chunk and convert it into a vector. How do we do it? There are special neural networks that can do it for us. For example, OpenAI has one, and there are many others.

When we have a vector, we save this vector to our vector database along with the source text, because a vector is just a set of numbers. It's impossible to decode a vector back into text without knowing what was encoded. We have to preserve the source.

Then when I want to find something in my documents, I don't have to read through them anymore. What I need is to ask a question to the LLM. The LLM gives me a vector for my question. Then I find the closest documents to my question in my vector database. Then I put these documents into the context, and I ask the LLM: given all the information I just provided, please answer my question.

Now the LLM does not need to hallucinate anymore. It knows the exact answer if it is in the documents. If there are no documents, we can say that. If it is not sure, it can answer: “I'm not sure. I don't have enough information.”

Now let's talk about how to implement this whole system in code. Obviously, we'll use Spring.

The first part is uploading our documents to a vector database. As a reminder, we split documents into chunks, convert every chunk into a vector, and save it to the database. Here's how we do it in Spring.

The simplest way is to have one endpoint. In my case, it's a PostMapping with a path, and we send a multipart file. Then I use a document reader, which is a part of the Spring AI suite, to convert the file into a list of documents. In this case, every document is a chunk that will be converted into a vector. Then I call vectorStore add docs.

All the magic is hidden. Spring sends every document to the neural network to convert it into a vector, and then behind the scenes it stores the vector together with the source text in the vector store. Then we respond that everything is done.

To make our bot answer user questions, we add another endpoint, for example a GetMapping, and pass the question text. Then we create a search query with topK build. This part means: please find the top three documents that are closest to our query. This uses cosine distance.

We execute the query on the vector store, and everything is abstracted. It doesn't matter whether it's PGVector, Milvus, or Chroma. We get the documents and convert each document into a specific format. In my case, it's the document title, the document text, and the file name from metadata. I split the documents with a delimiter to make sure the LLM does not mix them up.

Now I provide system instructions. It does not have to be this complex, but this is what I use. You are a helpful assistant. You find information in the data provided below. The format of the data is the title and the text of the document or its part. Documents are separated by a delimiter. If you reference a document, reference it by name.

This way we are sure that the LLM does not hallucinate. We can also find the document by name in storage and read it fully if needed. Always reference the document where you found the information. If there is no information, answer: “Sorry, I don't have such information.”

Then I insert the documents into the prompt. I add the user's question, create a prompt, and call chatClient prompt call. This returns a string that I send back to the user.

This is the easiest and simplest possible RAG workflow. It does not support chatting or advanced features. It is intentionally simple. But now you have all the basic information to know where to dig.

This is your Spring AI. It will not magically make your application smarter than you, but it will make your data useful and help save you from Hallucinations.

If you like the video, like it, subscribe to the channel, leave a comment, and who knows, maybe I'll create the next one. Thank you so much. Pash out.

 

 

Summary

This talk explains why LLM Hallucinations happen and why they are a common problem when product knowledge is spread across large amounts of documentation. It introduces vectors and vector databases as the foundation for Retrieval Augmented Generation (RAG). Documents are split into chunks, converted into vectors, and stored so that relevant information can be retrieved using cosine distance. When a user asks a question, the LLM receives the most relevant document context instead of guessing. As a result, answers are grounded in real data and hallucinations are significantly reduced.

Social Media

Videos
card image
Dec 12, 2025
Will AI Replace Developers? A Vibe Coding Reality Check 2025

Can AI replace software engineers? ChatGPT, Copilot, and LLM-powered vibe coding tools promise to automate development—but after testing them against 17 years of production experience, the answer is more nuanced than the hype suggests. Full project generation produces over-engineered code that's hard to refactor. AI assistants excel at boilerplate but fail at business logic. MCP servers solve hallucination problems but create context overload. Meanwhile, DevOps automation actually works. This breakdown separates AI capabilities from marketing promises—essential for teams integrating LLMs and copilots without compromising code quality or architectural decisions.

Videos
card image
Dec 12, 2025
JRush | Container Essentials: Fast Builds, Secure Images, Zero Vulnerabilities

Web-conference for Java developers focused on hands-on strategies for building high-performance containers, eliminating CVEs, and detecting security issues before production.

Further watching

Videos
card image
Dec 30, 2025
Java in 2025: LTS Release, AI on JVM, Framework Modernization

Java in 2025 isn't about headline features, it's about how production systems changed under the hood. While release notes focus on individual JEPs, the real story is how the platform, frameworks, and tooling evolved to improve stability, performance, and long-term maintainability. In this video, we look at Java from a production perspective. What does Java 25 LTS mean for teams planning to upgrade? How are memory efficiency, startup time, and observability getting better? Why do changes like Scoped Values and AOT optimizations matter beyond benchmarks? We also cover the broader ecosystem: Spring Boot 4 and Framework 7, AI on the JVM with Spring AI and LangChain4j, Kotlin's growing role in backend systems, and tooling updates that make upgrades easier. Finally, we touch on container hardening and why runtime and supply-chain decisions matter just as much as language features.

Videos
card image
Dec 24, 2025
I Solved Advent of Code 2025 in Kotlin: Here's How It Went

Every year, Advent of Code spawns thousands of solutions — but few engineers step back to see the bigger picture. This is a complete walkthrough of all 12 days from 2025, focused on engineering patterns rather than puzzle statements. We cover scalable techniques: interval math without brute force, dynamic programming, graph algorithms (JGraphT), geometry with Java AWT Polygon, and optimization problems that need constraint solvers like ojAlgo. You'll see how Java and Kotlin handle real constraints, how visualizations validate assumptions, and when to reach for libraries instead of writing everything from scratch. If you love puzzles, programming—or both—and maybe want to learn how to solve them on the JVM, this is for you.

Videos
card image
Dec 18, 2025
Java 26 Preview: New JEPs and What They Mean for You

Java 26 is the next feature release that brings features for enhanced performance, security, and developer experience. This video discusses the upcoming JDK 26 release, highlighting ten JEPs including JEP 500. JEP 500 focuses on preparing developers for future restrictions on mutating final fields in Java, emphasizing their role in maintaining immutable state. This is crucial for robust programming and understanding the nuances of mutable vs immutable data, especially concerning an immutable class in java. We also touch upon the broader implications for functional programming in Java.