Build RAG System with Spring AI: No More AI Lies

 

Transcript:

Ever asked ChatGPT something and it confidently made stuff up? Yeah, it happened a lot to me recently. I bet it happened to you too. The thing is, it's not the only problem. I have to know a lot about our products. But sometimes it's almost impossible to find information because we have so much documentation, so many blog posts, and so many YouTube videos. It's hardly possible to process them all.

Hi, my name is Pasha and I'm a developer advocate for BellSoft, and today we will fight LLM Hallucinations in your product. But first, let's talk about several fundamentals.

First, vectors. This is the base for all the LLM magic. A vector is just a set of numbers. You can think of it as an array of doubles, for example, or as a simple metaphor. Imagine you need to distinguish New York from Tokyo. They have coordinates. In a way, these coordinates are vectors. LLMs operate on multi-dimensional vectors. For New York and Tokyo, there are only two dimensions: longitude and latitude. Vectors for our LLMs are much longer, usually 1,861 dimensions, for example.

The second thing you should know about is a vector database. What's a vector database? Essentially, it's a specialized storage suited specifically to handle vectors. They can be usual databases, for example a Postgres database. It's called PGVector. There are also different kinds of vector databases, for example Milvus or Chroma. And of course, Oracle has support for vectors too.

What does it mean to handle vectors? It means that this database can perform operations on vectors very quickly. One of the most important operations on vectors is called finding distance. One of the most popular ways to find distance between vectors is cosine distance.

Now, here is a small demonstration of what cosine distance is. Cosine distance is the product of two vectors divided by the product of their magnitudes. Now imagine we have three objects: rose, sunflower, and sun. Sunflower is visually closer to rose than to sun. But how does it work for vectors? As you can see, there are three vectors and they have angles between them. The angle between rose and sunflower is smaller than the angle between sunflower and sun, and the cosine distance is also smaller. A smaller cosine distance means more similar meaning.

Now you are probably asking me, Pasha, why are you telling all of this? Why do we care about vectors? Why do we care about cosine distance? Why do we care about databases? The answer is that this is part of the solution to our problem. This is part of the solution to LLM Hallucinations.

If you think about it, browsing through our documents is almost impossible for us, but it's very simple for machines. There is a very simple approach to solve this problem. It's called RAG — retrieval augmented generation.

The workflow looks like this. We take a huge document, or small documents if we have many of them. Documents are usually text, but they might also be images and so on. Then we split every document into chunks. We take every chunk and convert it into a vector. How do we do it? There are special neural networks that can do it for us. For example, OpenAI has one, and there are many others.

When we have a vector, we save this vector to our vector database along with the source text, because a vector is just a set of numbers. It's impossible to decode a vector back into text without knowing what was encoded. We have to preserve the source.

Then when I want to find something in my documents, I don't have to read through them anymore. What I need is to ask a question to the LLM. The LLM gives me a vector for my question. Then I find the closest documents to my question in my vector database. Then I put these documents into the context, and I ask the LLM: given all the information I just provided, please answer my question.

Now the LLM does not need to hallucinate anymore. It knows the exact answer if it is in the documents. If there are no documents, we can say that. If it is not sure, it can answer: “I'm not sure. I don't have enough information.”

Now let's talk about how to implement this whole system in code. Obviously, we'll use Spring.

The first part is uploading our documents to a vector database. As a reminder, we split documents into chunks, convert every chunk into a vector, and save it to the database. Here's how we do it in Spring.

The simplest way is to have one endpoint. In my case, it's a PostMapping with a path, and we send a multipart file. Then I use a document reader, which is a part of the Spring AI suite, to convert the file into a list of documents. In this case, every document is a chunk that will be converted into a vector. Then I call vectorStore add docs.

All the magic is hidden. Spring sends every document to the neural network to convert it into a vector, and then behind the scenes it stores the vector together with the source text in the vector store. Then we respond that everything is done.

To make our bot answer user questions, we add another endpoint, for example a GetMapping, and pass the question text. Then we create a search query with topK build. This part means: please find the top three documents that are closest to our query. This uses cosine distance.

We execute the query on the vector store, and everything is abstracted. It doesn't matter whether it's PGVector, Milvus, or Chroma. We get the documents and convert each document into a specific format. In my case, it's the document title, the document text, and the file name from metadata. I split the documents with a delimiter to make sure the LLM does not mix them up.

Now I provide system instructions. It does not have to be this complex, but this is what I use. You are a helpful assistant. You find information in the data provided below. The format of the data is the title and the text of the document or its part. Documents are separated by a delimiter. If you reference a document, reference it by name.

This way we are sure that the LLM does not hallucinate. We can also find the document by name in storage and read it fully if needed. Always reference the document where you found the information. If there is no information, answer: “Sorry, I don't have such information.”

Then I insert the documents into the prompt. I add the user's question, create a prompt, and call chatClient prompt call. This returns a string that I send back to the user.

This is the easiest and simplest possible RAG workflow. It does not support chatting or advanced features. It is intentionally simple. But now you have all the basic information to know where to dig.

This is your Spring AI. It will not magically make your application smarter than you, but it will make your data useful and help save you from Hallucinations.

If you like the video, like it, subscribe to the channel, leave a comment, and who knows, maybe I'll create the next one. Thank you so much. Pash out.

 

 

Summary

This talk explains why LLM Hallucinations happen and why they are a common problem when product knowledge is spread across large amounts of documentation. It introduces vectors and vector databases as the foundation for Retrieval Augmented Generation (RAG). Documents are split into chunks, converted into vectors, and stored so that relevant information can be retrieved using cosine distance. When a user asks a question, the LLM receives the most relevant document context instead of guessing. As a result, answers are grounded in real data and hallucinations are significantly reduced.

Social Media

Videos
card image
Mar 9, 2026
jOOQ Deep Dive: CTE, MULTISET, and SQL Pipelines

Some backend developers reach the point where the ORM stops being helpful. Complex joins, nested result graphs, or CTE pipelines quickly push frameworks like Hibernate to their limits. And when that happens, teams often end up writing fragile raw SQL strings or fighting performance issues like the classic N+1 query problem. In this video, we build a healthcare scheduling application NeonCare using jOOQ, Spring Boot 4, and PostgreSQL, and show how to write production-grade SQL directly in Java while keeping full compile-time type safety.

Videos
card image
Feb 27, 2026
Spring Developer Roadmap 2026: What You Need to Know

Spring Boot is powerful. But knowing the framework isn’t the same as understanding backend engineering. In this video, I walk through the roadmap I believe matters for a Spring developer in 2026. We start with data. That means real SQL — CTEs, window functions, normalization trade-offs — and understanding what ACID and BASE actually imply for system guarantees. Spring Data JPA is useful, but you still need to know what happens underneath. Then architecture: microservices vs modular monolith, serverless, CQRS, and when HTTP, gRPC, Kafka, or WebSockets make sense. Not as buzzwords — but as design choices with trade-offs. Security and infrastructure follow: OWASP Top 10, AuthN vs AuthZ, encryption in transit and at rest, Docker, Kubernetes, Infrastructure as Code, and observability with Micrometer, OpenTelemetry, and Grafana. This roadmap isn’t about mastering every tool. It’s about knowing what affects reliability in production.

Further watching

Videos
card image
Apr 2, 2026
Java Memory Options You Need in Production

JVM memory tuning can be tricky. Teams increase -Xmx and assume the problem is solved. Then the app still hits OOM. Because maximum heap size is not the only thing that affects memory footprint. The JVM uses RAM for much more than heap: metaspace, thread stacks, JIT/code cache, direct buffers, and native allocations. That’s why your process can run out of memory while heap still looks “fine”. In this video, we break down how JVM memory actually works and how to control it with a minimal, production-safe set of flags. We cover heap sizing (-Xms, -Xmx), dynamic resizing, direct memory (-XX:MaxDirectMemorySize), and total RAM limits (-XX:MaxRAMPercentage) — especially in containerized environments like Docker and Kubernetes. We also explain GC choices such as G1, ZGC, and Shenandoah, when defaults are enough, and why GC logging (-Xlog:gc*) is mandatory before tuning. Finally, we show how to diagnose failures with heap dumps and OOM hooks. This is not about adding more flags. It’s about understanding what actually consumes memory — and making decisions you can justify in production.

Videos
card image
Mar 26, 2026
Java Developer Roadmap 2026: From Basics to Production

Most Java roadmaps teach tools. This one teaches order — the only thing that actually gets you to production. You don’t need to learn everything. You need to learn the right things, in the right sequence. In this video, we break down a practical Java developer roadmap for 2026 — from syntax and OOP to Spring, databases, testing, and deployment. Structured into 8 levels, it shows how real engineers grow from fundamentals to production-ready systems. We cover what to learn and what to ignore: core Java, collections, streams, build tools, Git, SQL and JDBC before Hibernate, the Spring ecosystem, testing with JUnit, and deployment with Docker and CI/CD. You’ll also understand why most developers get stuck — jumping into frameworks too early, skipping SQL, or treating tools as knowledge. This roadmap gives you a clear path into real-world Java development — with priorities, trade-offs, and production context.

Videos
card image
Mar 19, 2026
TOP-5 Lightweight Linux Distributions for Containers

In this video, we compare five lightweight Linux distributions commonly used as base images: Alpine, Alpaquita, Chiseled Ubuntu, RHEL UBI Micro, and Wolfi. There are no rankings or recommendations — just a structured look at how these distros differ so you can evaluate them in your own context.