Build RAG System with Spring AI: No More AI Lies

 

Transcript:

Ever asked ChatGPT something and it confidently made stuff up? Yeah, it happened a lot to me recently. I bet it happened to you too. The thing is, it's not the only problem. I have to know a lot about our products. But sometimes it's almost impossible to find information because we have so much documentation, so many blog posts, and so many YouTube videos. It's hardly possible to process them all.

Hi, my name is Pasha and I'm a developer advocate for BellSoft, and today we will fight LLM Hallucinations in your product. But first, let's talk about several fundamentals.

First, vectors. This is the base for all the LLM magic. A vector is just a set of numbers. You can think of it as an array of doubles, for example, or as a simple metaphor. Imagine you need to distinguish New York from Tokyo. They have coordinates. In a way, these coordinates are vectors. LLMs operate on multi-dimensional vectors. For New York and Tokyo, there are only two dimensions: longitude and latitude. Vectors for our LLMs are much longer, usually 1,861 dimensions, for example.

The second thing you should know about is a vector database. What's a vector database? Essentially, it's a specialized storage suited specifically to handle vectors. They can be usual databases, for example a Postgres database. It's called PGVector. There are also different kinds of vector databases, for example Milvus or Chroma. And of course, Oracle has support for vectors too.

What does it mean to handle vectors? It means that this database can perform operations on vectors very quickly. One of the most important operations on vectors is called finding distance. One of the most popular ways to find distance between vectors is cosine distance.

Now, here is a small demonstration of what cosine distance is. Cosine distance is the product of two vectors divided by the product of their magnitudes. Now imagine we have three objects: rose, sunflower, and sun. Sunflower is visually closer to rose than to sun. But how does it work for vectors? As you can see, there are three vectors and they have angles between them. The angle between rose and sunflower is smaller than the angle between sunflower and sun, and the cosine distance is also smaller. A smaller cosine distance means more similar meaning.

Now you are probably asking me, Pasha, why are you telling all of this? Why do we care about vectors? Why do we care about cosine distance? Why do we care about databases? The answer is that this is part of the solution to our problem. This is part of the solution to LLM Hallucinations.

If you think about it, browsing through our documents is almost impossible for us, but it's very simple for machines. There is a very simple approach to solve this problem. It's called RAG — retrieval augmented generation.

The workflow looks like this. We take a huge document, or small documents if we have many of them. Documents are usually text, but they might also be images and so on. Then we split every document into chunks. We take every chunk and convert it into a vector. How do we do it? There are special neural networks that can do it for us. For example, OpenAI has one, and there are many others.

When we have a vector, we save this vector to our vector database along with the source text, because a vector is just a set of numbers. It's impossible to decode a vector back into text without knowing what was encoded. We have to preserve the source.

Then when I want to find something in my documents, I don't have to read through them anymore. What I need is to ask a question to the LLM. The LLM gives me a vector for my question. Then I find the closest documents to my question in my vector database. Then I put these documents into the context, and I ask the LLM: given all the information I just provided, please answer my question.

Now the LLM does not need to hallucinate anymore. It knows the exact answer if it is in the documents. If there are no documents, we can say that. If it is not sure, it can answer: “I'm not sure. I don't have enough information.”

Now let's talk about how to implement this whole system in code. Obviously, we'll use Spring.

The first part is uploading our documents to a vector database. As a reminder, we split documents into chunks, convert every chunk into a vector, and save it to the database. Here's how we do it in Spring.

The simplest way is to have one endpoint. In my case, it's a PostMapping with a path, and we send a multipart file. Then I use a document reader, which is a part of the Spring AI suite, to convert the file into a list of documents. In this case, every document is a chunk that will be converted into a vector. Then I call vectorStore add docs.

All the magic is hidden. Spring sends every document to the neural network to convert it into a vector, and then behind the scenes it stores the vector together with the source text in the vector store. Then we respond that everything is done.

To make our bot answer user questions, we add another endpoint, for example a GetMapping, and pass the question text. Then we create a search query with topK build. This part means: please find the top three documents that are closest to our query. This uses cosine distance.

We execute the query on the vector store, and everything is abstracted. It doesn't matter whether it's PGVector, Milvus, or Chroma. We get the documents and convert each document into a specific format. In my case, it's the document title, the document text, and the file name from metadata. I split the documents with a delimiter to make sure the LLM does not mix them up.

Now I provide system instructions. It does not have to be this complex, but this is what I use. You are a helpful assistant. You find information in the data provided below. The format of the data is the title and the text of the document or its part. Documents are separated by a delimiter. If you reference a document, reference it by name.

This way we are sure that the LLM does not hallucinate. We can also find the document by name in storage and read it fully if needed. Always reference the document where you found the information. If there is no information, answer: “Sorry, I don't have such information.”

Then I insert the documents into the prompt. I add the user's question, create a prompt, and call chatClient prompt call. This returns a string that I send back to the user.

This is the easiest and simplest possible RAG workflow. It does not support chatting or advanced features. It is intentionally simple. But now you have all the basic information to know where to dig.

This is your Spring AI. It will not magically make your application smarter than you, but it will make your data useful and help save you from Hallucinations.

If you like the video, like it, subscribe to the channel, leave a comment, and who knows, maybe I'll create the next one. Thank you so much. Pash out.

 

 

Summary

This talk explains why LLM Hallucinations happen and why they are a common problem when product knowledge is spread across large amounts of documentation. It introduces vectors and vector databases as the foundation for Retrieval Augmented Generation (RAG). Documents are split into chunks, converted into vectors, and stored so that relevant information can be retrieved using cosine distance. When a user asks a question, the LLM receives the most relevant document context instead of guessing. As a result, answers are grounded in real data and hallucinations are significantly reduced.

Social Media

Videos
card image
Jan 20, 2026
JDBC vs ORM vs jOOQ: Choose the Right Java Database Tool

Still unsure what is the difference between JPA, Hibernate, JDBC, or jOOQ and when to use which? This video clarifies the entire Java database access stack with real, production-oriented examples. We start at the foundation, which is JDBC, a low-level API every other tool eventually relies on for database communication. Then, we go through the ORM concept, JPA as a specification of ORM, Hibernate as the implementation and extension of JPA, and Blaze Persistence as a powerful upgrade to JPA Criteria API. From there, we take a different path with jOOQ: a database-first, SQL-centric approach that provides type-safe queries and catches many SQL errors at compile time instead of runtime. You’ll see when raw JDBC makes sense for small, focused services, when Hibernate fits CRUD-heavy domains, and when jOOQ excels at complex reporting and analytics. We discuss real performance pitfalls such as N+1 queries and lazy loading, and show practical combination strategies like “JPA for CRUD, jOOQ for reports.” The goal is to equip you with clarity so that you can make informed architectural decisions based on domain complexity, query patterns, and long-term maintainability.

Videos
card image
Jan 13, 2026
Hibernate: Ditch or Double Down? When ORM Isn't Enough

Every Java team debates Hibernate at some point: productivity champion or performance liability? Both are right. This video shows you when to rely on Hibernate's ORM magic and when to drop down to SQL. We walk through production scenarios: domain models with many-to-many relations where Hibernate excels, analytical reports with window functions where JDBC dominates, and hybrid architectures that use both in the same Spring Boot codebase. You'll see real code examples: the N+1 query trap that kills performance, complex window functions and anti-joins that Hibernate can't handle, equals/hashCode pitfalls with lazy loading, and practical two-level caching strategies. We also explore how Hibernate works under the hood—translating HQL to database-specific SQL dialects, managing sessions and transactions through JDBC, implementing JPA specifications. The strategic insight: modern applications need both ORM convenience for transactional business logic and SQL precision for data-intensive analytics. Use Hibernate for CRUD and relationship management. Use SQL where ORM abstractions leak or performance demands direct control.

Further watching

Videos
card image
Feb 6, 2026
Backend Developer Roadmap 2026: What You Need to Know

Backend complexity keeps growing, and frameworks can't keep up. In 2026, knowing React or Django isn't enough. You need fundamentals that hold up when systems break, traffic spikes, or your architecture gets rewritten for the third time.I've been building production systems for 15 years. This roadmap covers three areas that separate people who know frameworks from people who can actually architect backend systems: data, architecture, and infrastructure. This is about how to think, not what tools to install.

Videos
card image
Jan 29, 2026
JDBC Connection Pools in Microservices. Why They Break Down (and What to Do Instead)

In this livestream, Catherine is joined by Rogerio Robetti, the founder of Open J Proxy, to discuss why traditional JDBC connection pools break down when teams migrate to microservices, and what is a more efficient and reliable approach to organizing database access with microservice architecture.

Videos
card image
Jan 27, 2026
Sizing JDBC Connection Pools for Real Production Load

Many production outages start with connection pool exhaustion. Your app waits seconds for connections while queries take milliseconds; yet, most teams run default settings that collapse under load. This video shows how to configure connection pools that survive real production traffic: sizing based on database limits and thread counts, setting timeouts that prevent cascading failures, and implementing an open source database proxy Open J Proxy for centralized connection management with virtual connection handles, client-side load balancing, and slow query segregation. For senior Java developers, DevOps engineers, and architects who need database performance that holds under pressure.