johnnyreilly.com

Using Kernel Memory to Chunk Documents into Azure AI Search

John Reilly April 21, 2024

I've recently been working on building retrieval augmented generation (RAG) experiences into applications; building systems where large language models (LLMs) can query documents. To achieve this, we first need a strategy to chunk those documents and make them LLM-friendly. Kernel Memory, a sister project of Semantic Kernel supports this.

If you haven't heard of Kernel Memory before, it's a library that, amongst other things, provides a way to chunk documents into smaller pieces. To quote the Kernel Memory GitHub repository:

Kernel Memory (KM) is a service built on the feedback received and lessons learned from developing Semantic Kernel (SK) and Semantic Memory (SM). It provides several features that would otherwise have to be developed manually, such as storing files, extracting text from files, providing a framework to secure users' data, etc. The KM codebase is entirely in .NET, which eliminates the need to write and maintain features in multiple languages. As a service, KM can be used from any language, tool, or platform, e.g. browser extensions and ChatGPT assistants.

In this post, I'll show you how to use Kernel Memory to chunk documents in the background of an ASP.NET application.

Kernel Memory: Serverless vs Service

There's two ways that we can run Kernel Memory: "Serverless" and "Service".

Running the full service is more powerful, but effectively requires running a separate application. We would then need to integrate our main app with that. Given that I'm building with ASP.NET, I'll be using the serverless approach, which allows us to run Kernel Memory within the context of a single application (which will contain our main app code as well). We can then manage our integrations with Kernel Memory as simple method calls.

This is simpler and more cost-effective, but it does have some limitations. The service approach offers more features; including persistent retry logic. The documentation states that if we want to scale then we'll likely want to consider the service approach. But my own experience has been that serverless works very well for small to medium-sized applications.

Perhaps surprisingly, using serverless we can still have the experience of running Kernel Memory as a non-blocking separate service within the context of our ASP.NET application. This is achieved by running Kernel Memory as a hosted service - this is the standard ASP.NET mechanism for running background tasks. That's what we're going to use.

There's four parts to bring this to life:

Our Kernel Memory serverless instance - this is where the integration between Kernel Memory, Azure Open AI, Azure AI Search and the actual chunking takes place
A queue which we'll use to provide documents for chunking with Kernel Memory
Our hosted service which will bring together the queue and the Kernel Memory integration to manage our background document processing
An endpoint in our ASP.NET application to add documents to the queue
Setting up Kernel Memory serverless

There's a number of dependencies that we need to add to our project to get Kernel Memory working. These are:

With this in place we'll start to integrate with Kernel Memory. We will first construct ourselves an IKernelMemory like so:

What we're doing here, is creating an IKernelMemory instance and making it aware of all our deployed Azure resources. Going through how to deploy those is out of the scope of this post, but it's probably worth highlighting that we're using AzureIdentity for auth as it's particularly secure, if you would like to use other options, you certainly can.

It's probably worth highlighting that we're using the text-embedding-ada-002 model for text embedding and the gpt-3.5-turbo-16k model for text generation. These are the models that I've found to be most effective for my use cases. Of these, the text embedding model is the most important - it's the one that will be used to chunk documents.

You'll also note we're using Azure AI Document Intelligence; this is optional and just tackles a few more document chunking scenarios. It's not mandatory.

Chunking with Kernel Memory serverless

With our IKernelMemory ready to go, we now need a way to chunk documents. Deep down, this is achieved by acquiring the document we want to chunk from blob storage and passing it to _memory.ImportDocumentAsync with the name of the index we want to process into. You can see examples of this usage in the Kernel Memory docs. You can also see how it works in the Kernel Memory repository itself.

However, it's often helpful to have a number of other things in place to manage:

Applying tags to documents (this gives us more power when querying later)
Creating acceptable names / ids for the Azure AI Search Service
Handling rate limiting - more on that in a moment

To that end, I tend to end up implementing a Process method that looks something like this:

Much of the code above concerns rate limiting / 429s. It's not uncommon when chunking to be hit by 429s - "Too many requests". Chunking documents requires use of Azure Open AI resources, and the level of access we have is typically restricted and controlled via quotas. There's an element of this that we can avoid by controlling the quota available on our Azure Open AI deployments (you can read more about this here), and we can implement a certain amount of retry logic also.

The code above tries to handle a number of re-attempts as wisely as it can, and using the information that Azure APIs surface around when re-attempting is allowed. Interestingly you'll see a variety of strategies employed here around retry times, as the way information is surfaced to support this keeps changing! We can likely have less code in future when a final standard is committed to.

Bringing it together

We're going to put this all together in a single class called RagGestionService.

You might be puzzled by the name "RagGestion" - this is a term my good friend George Karsas coined to describe the process of preparing documents for Retrieval Augmented Generation. It's a great term, and I've adopted it!

The RagGestionService will look like this:

By the way, I don't advise hard-coding the Azure resources as I have here, but rather passing them in as configuration. Incidentally, we could also use dependency injection to inject a prepared IKernelMemory instance into the service, but again, I'm keeping it simple here for clarity.

Our document processor queue

In order that we have a way to provide documents for chunking, we need a queue. This is a simple queue that we can add documents to, and then process them in the background. We're going to use a ConcurrentQueue for this, with a little wrapper around it so we can encapsulate the queue for sharing between our UI and our background task, and also to do some logging.

The EnqueueDocumentUri method above will be called from the context of our UI - from an ASP.NET controller. This will be invoked when someone uploads a file and will also be responsible for adding the file to a BlobService for storage prior to processing.

By contrast, the DequeueDocumentUri method will be called from the context of our background service; it will call this method to pick up a file for processing.

Our background service

Next, we need a background service to bring together our DocumentProcessorQueue and our RagGestionService. This is a standard ASP.NET hosted service. It will look like this:

This service will run in the background of the ASP.NET application, will pick up documents from the queue (if there are any) and pass them to the RagGestionService for processing. It will trigger every 5 seconds, running for the lifetime of the application.

You'll see we're doing some timing here - this is because it's useful to know how long the process takes. If we're processing a lot of documents, we'll want to know how long it's taking to process each one.

Adding documents to the queue

To add documents to the queue, we'll need to create an endpoint in our ASP.NET application. This endpoint will accept files and add them to the queue. Here's an example of how we might do that:

As we can see, this endpoint:

Accepts files from a POST request with an index name in the querystring
Uploads them to Blob Storage (matching the container name to the index they will be processed into in future)
Adds them to the queue with _documentProcessorQueue.EnqueueDocumentUri. This will then be picked up by the background service and processed.

Registering our services

Finally, we'll need to register our services in the Program.cs file. We'll want to add the following:

With this in place we have an application that can upload documents and chunk them in the background.

Conclusion

And that's it! This is an ASP.NET application that can chunk documents (or RagGest 😉) in the background using Kernel Memory running in serverless mode. I haven't yet had the need to upgrade to the full Kernel Memory service. Perhaps the day will come, but the mileage we can get with this approach is considerable.

Many thanks to David Rosevear and George Karsas for their help working on this mechanism. And George for "RagGestion" - I love it!

Discussion in the ATmosphere