Posts

Creating a Bot for Documentation Assistance with Spring AI and Unstructured.io

Jul 25, 2024
Pasha Finkelshteyn
18.1

Spring AI enables the developers to integrate the power of AI into their projects with minimal effort as Spring does most of the magic behind the scenes.

In this article, I will show you how to build an AI bot that answers a user's questions based on the provided set of documents. The application leverages the power of Spring AI and Unstructured.io, a solution for preparing texts for LLMs.

The source code of the application is available on GitHub.

What is a vector in AI

A vector in AI is a set of numeric values, often represented as an array, where each value corresponds to a particular attribute or a feature of a data point in a multidimensional space. These values are typically double-precision floating-point numbers. Machine learning models use these vectors to classify the data, make predictions, and identify patterns by leveraging mathematical operations. A very simplified but straightforward example: if you take a vector for a word ‘mother’, take away a ‘woman’ attribute, and then add a ‘man’ attribute, you will get a vector for a word ‘father.’

But vectors can represent not only words, but phrases, sentences, or even larger text blocks.

What does a documentation assistance bot do

The purpose of a chatbot I created is to answer user questions about company products. But the bot shouldn’t make up answers. It should be based on the Retrieval-Augmented Generation (RAG) concept. RAG is a technique for increasing the reliability of LLM responses with facts fetched from some specific sources. In our case, the bot will use the existing company documentation. 

To do that, we need to break the whole documentation corpus into smaller units, each of which can be turned into an embedding (another name for a vector) by the AI model. As I used the ADA model provided by OpenAI, this unit can be max. 8,191 symbols long.

But we don’t want to create embeddings manually as it poses certain challenges, such as the necessity to extract plain text and get rid of noise such as formatting. This is where Unstructured.io comes to the rescue.

What is Unstructured.io?

Unstructured.io is a solution that takes a document of any type (HTML, PDF, CSV, PNG, PPTX, etc.) and transforms it into an LLM-ready file free of artifacts.

Unstructured can divide the text into smaller elements (or chunks) according to one of four smart chunking strategies:

  • “Basic” combines sequential elements,
  • “By title” preserves section boundaries, 
  • “By page” creates separate chunks for each page,
  • “By similarity” combines topically similar sequential elements into chunks.

Bot application structure

My chatbot is a Gradle application written in Kotlin. The project is based on Java 21 and Spring AI 1.0.0-M1. I used Liberica JDK recommended by Spring as a Java runtime. You can download Liberica JDK 21 for your platform from the site or get it directly through IntelliJ IDEA.

There are a few dependencies:

dependencies {
    implementation(libs.jackson.module.kotlin)
    implementation(libs.kotlin.reflect)
    implementation(libs.kotlinx.coroutines.core)
    implementation(libs.kotlinx.coroutines.reactor)
    implementation(libs.kotlinx.serialization.json)
    implementation(libs.ktor.client.content.negotiation)
    implementation(libs.ktor.client.core)
    implementation(libs.ktor.client.java)
    implementation(libs.ktor.serialization.kotlinx.json)
    implementation(libs.spring.ai.openai.spring.boot.starter)
    implementation(libs.spring.ai.pgvector.store.spring.boot.starter)
    implementation(libs.spring.boot.starter.web)
    testImplementation(libs.kotlin.test.junit5)
    testImplementation(libs.spring.boot.starter.test)
    testRuntimeOnly(libs.junit.platform.launcher)
}

The application.properties file

First, let’s look at application configuration.

The application.properties file includes the following configs:

spring.application.name=bell-sw-bot
spring.main.banner-mode=off
spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.dimensions=1536
spring.ai.openai.api-key={openai_key}
spring.ai.openai.chat.enabled=true
spring.ai.openai.embedding.enabled=true
spring.datasource.url=jdbc:postgresql://localhost:5432/postgres
spring.datasource.username=postgres
spring.datasource.password=password
unstructured.key={unstructured_key}
unstructured.endpoint=https://api.unstructuredapp.io/general/v0/general

Let’s take not of the most important ones:

  • spring.ai.vectorstore.pgvector.index-type defines the nearest neighbor search index type. HNSW creates a multilayer graph; other possible index types are described in the documentation.
  • spring.ai.vectorstore.pgvector.distance-type defines a search distance type. COSINE_DISTANCE is the default setting, but if the vectors are normalized, you can use other types defined in the docs.
  • spring.ai.vectorstore.pgvector.dimensions defines the size of the List of Doubles.
  • spring.ai.openai.api-key contains the key for accessing OpenAI.
  • spring.ai.openai.chat.enabled and spring.ai.openai.embedding.enabled enable OpenAI chat model and embedding model. 
  • unstructured.key and unstructured.endpoint are used to configure access to the Unstructured.io API.

The application itself contains just a few classes.

BellSwBotApplication class

The BellSwBotApplication class is the entry point for our application. It contains the usual @SpringBootApplication annotation as well as @EnableConfigurationProperties that enables support for the classes annotated with @ConfigurationProperties (in my application, it is the UnstructuredConfig class).

@SpringBootApplication
@EnableConfigurationProperties(UnstructuredConfig::class)
class BellSwBotApplication

fun main(args: Array<String>) {
    runApplication<BellSwBotApplication>(*args)
}

UnstructuredConfig class

The UnstructuredConfig class helps to configure access to the Unstructured.io API.

@ConfigurationProperties("unstructured")
class UnstructuredConfig(val key: String, val endpoint: String)

The class is annotated with @ConfigurationProperties and transparent. Its namespace is “unstructured”, and it includes only two properties: key and endpoint. Both are defined in the application.properties file, as we saw above.

The class can be injected as a usual Spring Bean, which we will later do in the UnstructuredClient class.

UnstructuredResponse class

The typealias UnstructuredResponse represents a List of UnstructuredDocument objects (It basically means I’m too lazy to type List<UnstructuredDocument> all around the application).

typealias UnstructuredResponse = List<UnstructuredDocument>

UnstructuredDocument is a simple data class that contains basic information about a text chunk: its type, ID, text block, and Metadata, which is a separate class containing document metadata such as file name, file type, and so on. The Metadata class also contains the asMap() method that I will later use to persist the metadata in the database.

@Serializable
data class UnstructuredDocument(
    val type: String,

    @SerialName("element_id")
    val elementID: String,

    val text: String,
    val metadata: Metadata
)

@Serializable
data class Metadata(
    @SerialName("category_depth")
    val categoryDepth: Long? = null,

    val languages: List<String>,
    @SerialName("link_texts")
    val linkTexts: List<String>? = null,
    @SerialName("link_urls")
    val linkUrls: List<String>? = null,
    val filename: String,
    val filetype: String,

    @SerialName("parent_id")
    val parentID: String? = null,

    @SerialName("emphasized_text_contents")
    val emphasizedTextContents: List<String>? = null,

    @SerialName("emphasized_text_tags")
    val emphasizedTextTags: List<String>? = null,

    @SerialName("orig_elements")
    val origElements: String? = null,

    @SerialName("is_continuation")
    val isContinuation: Boolean? = null,
) {
    @Suppress("UNCHECKED_CAST")
    fun asMap(): Map<String, Any> = mapOf(
        "categoryDepth" to categoryDepth,
        "languages" to languages,
        "filename" to filename,
        "filetype" to filetype,
        "parentID" to parentID,
        "isContinuation" to isContinuation,
    ).filterValues { it != null } as Map<String, Any>
}

Right, we have text chunks with metadata. Now, we need to turn them into embeddings. So, each text chunk that we got from Unstructured.io needs to be sent to OpenAI, which will transform them into embeddings.

But Spring has its own Document class used in Spring AI. Here is a snippet from it:

public class Document implements Content {
   // omitted
   private final String id;
   private Map<String, Object> metadata;
   private String content;
   private List<Media> media;
   @JsonProperty(
       index = 100
   )
   private List<Double> embedding;
   // omitted

So, we need to turn UnstructuredResponse into Spring AI Document. We will do that with the UnstructuredResponseDocumentAdapter class:

class UnstructuredResponseDocumentAdapter(response: UnstructuredDocument) :
        Document(UUID.nameUUIDFromBytes(response.elementID.toByteArray()).toString(), response.text, response.metadata.asMap()) {
    override fun toString(): String = "UnstructuredResponseDocumentAdapter() ${super.toString()}"
}

This class receives three arguments: a UUID generated from the text ID returned by Unstructured.io, our text, and text metadata as Map.

The next step is to save the resulting embedding. For that purpose, I use pgvector, which is a PostgreSQL extension for storing embeddings and performing vector similarity search.The embeddings are stored in a table with the following fields: 

  • A Spring Document object we got after converting our UnstructuredResponse documentation block to Spring-supported format,
  • A corresponding embedding,
  • A UUID,
  • Metadata in JSON.

UnstructuredClient class

The UnstructuredClient class is responsible for sending texts to Unstructured.io and receiving LLM-ready text chunks.

First, we need to create an HttpClient. I used Ktor HttpClient that enables the application to withstand huge loads.

@Component
class UnstructuredClient(val unstructuredConfig: UnstructuredConfig) {
    private val client = HttpClient(Java) {
        install(ContentNegotiation) {
            json(Json {
                allowStructuredMapKeys = true
                ignoreUnknownKeys = true
            })
        }
    }

The parse() method works with the Unstructured.io API. It takes 25 arguments, but for this application, there’s no need to configure most of them. They simply default to null. The only required argument is the MultipartFile sent to Unstructured.io. Below you will find a snippet of the method, the full signature can be found on GitHub.

    suspend fun parse(
                file: MultipartFile,
        chunkingStrategy: String? = null,
        combineUnderNChars: Int? = null,
        maxCharacters: Int? = null
    ) = client
        .submitFormWithBinaryData(
            url = unstructuredConfig.endpoint,
            formData = formData {
                file.apply {
                    append("files", file.bytes, Headers.build {
                        append(HttpHeaders.ContentDisposition, "filename=\"${file.originalFilename}\"")
                    })
                }
                chunkingStrategy?.apply { append("chunking_strategy", chunkingStrategy) }
                combineUnderNChars?.apply { append("combine_under_n_chars", combineUnderNChars.toString()) }
                maxCharacters?.apply { append("max_characters", maxCharacters.toString()) }
            }
        ) {
            accept(Json)
            contentType(FormData)
            header("unstructured-api-key", unstructuredConfig.key)
        }

Let’s look more closely at the method body.

The HttpClient submits a request (FormData type) to the URL endpoint provided by the UnstructuredConfig class). Then, it appends all arguments to the request: if the argument is present, it is added to the request. If it is null, then the Client does nothing and moves on to the next argument.

Finally, when the request is formed, we add information about the content type (application/json), request type (FormData), and API key provided by the UnstructuredConfig class.

BotController class

All the building blocks of the application are ready, it’s time to describe user communication!

The BotController class is a RestController configured in the following way:

@RestController
class BotController(val vectorStore: VectorStore, chatModel: ChatModel, val unstructuredClient: UnstructuredClient) {
    private val chatClient = ChatClient.create(chatModel)

Note that we can’t directly inject ChatClient, which is used to communicate with OpenAI, but we can create it from ChatModel defined in the .properties file under

spring.ai.openai.chat.enabled=true

VectorStore is also defined in the .properties file under

spring.ai.vectorstore.pgvector.index-type=HNSW
spring.ai.vectorstore.pgvector.distance-type=COSINE_DISTANCE
spring.ai.vectorstore.pgvector.dimensions=1536

Our BotController contains only two methods, but that’s all we need to answer user’s questions and save embeddings derived from our documentation in the database.

Storing the document embeddings

First, let’s create the storeDocument() method for sending documents to Unstructured.io (in a real-world scenario, it should be hidden behind the Spring Security wall to prevent users from sending random documents and storing them in our database).

    @PostMapping("/")
    suspend fun storeDocument(@RequestParam file: MultipartFile): ResponseEntity<Unit> {
        val responses: UnstructuredResponse = unstructuredClient.parse(
            file = file,
            combineUnderNChars = 500,
            maxCharacters = 8191,
            includeOrigElements = false,
            chunkingStrategy = "basic",
        )
            .body()

        val docs = responses.map(::UnstructuredResponseDocumentAdapter)
        vectorStore.add(docs)
        return ResponseEntity.of(Optional.of(Unit))
    }

This method receives a MultipartFile and hands it over to the UnstructuredClient class, which performs the communication with Unstructured.io, together with some other essential information:

  • combineUnderNChars = 500 tells the Unstructured API that if for some reason, we get tiny pieces of text under 500 characters, they should be combined into one,
  • maxCharacters = 8191 is the max size of a resulting text chunk,
  • includeOrigElements = false instructs the API not to return the original document together with the chunks,
  • chunkingStrategy = "basic" sets one of the chunking strategies we discussed earlier.

After receiving a List of responses from Unstructured.io (UnstructuredResponse), we convert all responses to Spring Documents and save them to the database with

vectorStore.add(docs)

This small line of code actually contains tons of Spring magic under the hood. The method determines a model we use for creating embeddings, obtains embeddings for these Documents from the model, and saves them to VectorStore along with the source texts.

Querying The Bot

After we stored our documents, we can create the most exciting part of the bot: the user interface. In our case, it will just return a text, but it might be as complex as we need to integrate it anywhere. To achieve this goal, let’s write the query() method for processing user queries. Let’s look at it more closely:

    @GetMapping("/search")
    fun query(@RequestParam q: String): String {
        val query = SearchRequest.query(q).withTopK(3)
        val docs = vectorStore.similaritySearch(query)
        val information = docs.joinToString("\n---||---\n") {
            """TITLE: ${it.metadata["filename"]}
            |BODY:
            |${it.content}
        """.trimMargin()
        }
        val systemPromptTemplate = SystemPromptTemplate(
            """You are a helpful assistant.

You find information in data provided below.
The format of data is the following:
---
TITLE: title of the source post
BODY:
text of the document or its part
---
The delimiter between documents is "---||---"
If you have to reference a document, reference it by name.
Always reference a document where you found the information.
If there is no information, you  answer with a message
"Sorry, I do not have such an information".

Use the following information to answer the question:
---
{information}"""
        )
        val systemMessage = systemPromptTemplate.createMessage(
            mapOf("information" to information)
        )
        val userMessagePromptTemplate =
            PromptTemplate(""""{question}"""")
        val model = mapOf("question" to q)

        val userMessage = UserMessage(userMessagePromptTemplate.create(model).getContents())
        val prompt = Prompt(listOf(systemMessage, userMessage))
        val response = chatClient.prompt(prompt).call().content()
        return response
    }

The whole heavy-lifting process of searching for similar documents in our vector storage is hidden in these two lines of code:

val query = SearchRequest.query(q).withTopK(3)
val docs = vectorStore.similaritySearch(query)

Here, we feed the user's question to Spring’s static method query() provided by SearchRequest and select three most relevant documents associated with the topic of the question.

The similaritySearch() method hides the logic of searching for similar embeddings: an SQL query, query for OpenAI to create an embedding out of the user’s question, etc. The method returns a list of documents found in the database. 

After that, we take these documents and format them by adding a TITLE (which will contain the file name) and BODY (with the text block). And then we add them to the prompt.

The systemPromptTemplate variable defines how we communicate with OpenAI. It includes the format of our data, specifies a delimiter, sets other essential requirements, and provides the document chunks that the bot uses to form an answer. These document chunks are the ones that Spring found for us and we formatted into the information block.

Finally, we create a prompt we will return to the user.

Conclusion

As you can see, building an AI-backed application with Spring is not difficult, and if you have a working knowledge of basic Spring concepts, you will get a hang of Spring AI pretty quickly.

If you want to read more tutorials on modern Spring development, subscribe to your newsletter!

 

Subcribe to our newsletter

figure

Read the industry news, receive solutions to your problems, and find the ways to save money.

Further reading