Michelle Anne Tabirao
on 15 May 2025
Introduction
One of the most critical gaps in traditional Large Language Models (LLMs) is that they rely on static knowledge already contained within them. Basically, they might be very good at understanding and responding to prompts, but they often fall short in providing current or highly specific information. This is where RAG comes in; RAG addresses these critical gaps in traditional LLMs by incorporating current and new information that serves as a reliable source of truth for these models.
In our previous blog on understanding and deploying RAG, we walked you through the basics of what this technique is and how it enhances generative AI models by utilizing external knowledge sources such as documents and extensive databases. These external knowledge bases enhance machine learning models for enterprise applications by providing verifiable, up-to-date information that reduces errors, simplifies implementation, and lowers the cost of continuous retraining.
Building a robust generative AI infrastructure, such as those for RAG, can be complex and challenging. It requires careful consideration of the technology stack, data, scalability, ethics, and security. For the technology stack, the hardware, operating systems, cloud services, and generative AI services must be resilient and efficient based on the scale that enterprises require.
There are several open source software options available for building generative AI infrastructure and complex AI projects that accelerate development, avoid vendor lock-in, reduce costs, and satisfy enterprise needs.
Objective
In this guide, we will take you through setting up a RAG pipeline. We will utilize open source tools such as Charmed OpenSearch for efficient search retrieval and KServe for machine learning inference, specifically in Azure and Ubuntu environments while leveraging silicons.
This guide is intended for data enthusiasts, engineers, scientists, and machine learning professionals who want to start building RAG solutions on public cloud platforms, such as Azure, using enterprise open source tools that are not native to Azure microservices offering. This can be used for various projects, including proofs of concept, development, and production.
Please note that multiple open source tools not highlighted in this guide can be used in place of the ones we outline. In cases where you do use different tools, you should adjust the hardware specifications—such as storage, computing power, and configuration—to meet the specific requirements of your use case.
RAG workflow
When building a generative AI project, such as a RAG system and advanced generative AI reference architectures, it is crucial to include multiple components and services. These components typically encompass databases, knowledge bases, retrieval systems, vector databases, model embeddings, large language models (LLMs), inference engines, prompt processing, and guardrail and fine-tuning services, among others.
RAG allows users to choose the most suitable RAG services and applications for their specific use cases. The reference workflow outlined below mainly utilizes two open source tools: Charmed OpenSearch and KServe. In the RAG workflow depicted below, fine-tuning is not mandatory; however, it can enhance the performance of LLMs as the project scales.

Figure 1: RAG workflow diagram using open source tools
The table below describes all the RAG services highlighted in the workflow diagram above and maps the open source solutions that are used in this guide.
Services | Description | Open source solutions |
Advanced parsing | Text splitters are advanced parsing techniques for the document that goes to the RAG system. In this way, the document can be cleaner, and focused and will provide informative input. | Charmed Kubeflow: Text splitters |
Ingest/data processing | The ingest or data processing is a data pipeline layer. This is responsible for data extraction, cleansing, and the removal of unnecessary data that you will run. | Charmed OpenSearch can be used for document processing |
Embedding model | The embedding model is a machine-learning model that converts raw data to vector representations. | Charmed OpenSearch- Sentence transformer |
Retrieval and ranking | This component retrieves the data from the knowledge base; it also ranks the relevance of the information being fetched based on relevance scores. | Charmed OpenSearch with FAISS (Facebook AI Similarity Search) |
Vector database | A vector database stores vector embeddings so data can be easily searched by the ‘retrieval and ranking services’. | Charmed OpenSearch- KNN Index as a vector database |
Prompt processing | This service formats queries and retrieved text into a readable format so it is structured to the LLM. | Charmed OpenSearch – OpenSearch: ML – agent predict |
LLM | This component provides the final response using multiple GenAI models. | GPT LlamaDeepseek |
LLM inference | This refers to operationalizing machine learning in production by processing running data into a machine learning model so that it gives an output. | Charmed Kubeflow with Kserve |
Guardrail | This component ensures ethical content in the GenAI response by creating a guardrail filter for the inputs and outputs. | Charmed OpenSearch: guardrail validation model |
LLM Fine-tuning | Fine-tuning is the process of taking a pre-trained machine learning model and further training it on a smaller, targeted data set. | Charmed Kubeflow |
Model repository | This component is used to store and version trained machine learning models, especially within the process of fine-tuning. This registry can track the model’s lifecycle from deployment to retirement. | Charmed KubeflowCharmed MLFlow |
Framework for building LLM application | This simplifies LLM workflow, prompts, and services so that building LLMs is easier. | Langchain |
This table provides an overview of the key components involved in building a RAG system and advanced GenAI reference solution, along with associated open source solutions for each service. Each service performs a specific task that can enhance your LLM setup, whether it relates to data management and preparation, embedding a model in your database, or improving the LLM itself.
The deployment guide below will cover most of the services except the following: model repository, LLM fine-tuning, and text splitters.
The rate of innovation in this field, particularly within the open source community, has become exponential. It is crucial to stay updated with the latest developments, including new models and emerging RAG solutions.
RAG component: Charmed OpenSearch
Charmed OpenSearch will be mainly used in this RAG workflow deployment. Charmed OpenSearch is an operator that builds on the OpenSearch upstream by integrating automation to streamline the deployment, management, and orchestration of production clusters. The operator enhances efficiency, consistency, and security. Its rich features include high availability, seamless scaling features for deployments of all sizes, HTTP and data-in-transit encryption, multi-cloud support, safe upgrades without downtime, roles and plugin management, and data visualization through Charmed OpenSearch Dashboards.
With the Charmed OpenSearch operator (also known as charm), you can deploy and run OpenSearch on physical and virtual machines (VM) and other cloud and cloud-like environments, including AWS, Azure, Google Cloud, OpenStack, and VMware. For the next section, deployment guide, will be using Azure VM instances:
Figure 2: Charmed OpenSearch architecture
Charmed OpenSearch uses Juju. Juju is an open source orchestration engine for software operators that enables the deployment, integration, and lifecycle management of applications at any scale on any infrastructure. In the deployment process, the Juju controller manages the flow of data and interactions within multiple OpenSearch deployments, including mediating between different parts of the system.
Charmed OpenSearch deployment and use is straightforward. If you’d like to learn how to use and deploy it in a range of cloud environments, you can read more in our in-depth Charmed OpenSearch documentation.
RAG component: KServe
KServe is a cloud-native solution within the Kubeflow ecosystem that serves machine learning models. By leveraging Kubernetes, KServe operates effectively in cloud-native environments. It can be used for various purposes, including model deployment, machine learning model versioning, LLM inference, and model monitoring.
In the RAG use case discussed in this guide, we will use KServe to perform inference on LLM. Specifically, it will process an already-trained LLM to make predictions based on new data. This emphasizes the need for a robust LLM inference system that works with local and public LLMs. The system should be scalable, capable of handling high concurrency, providing low-latency responses, and delivering accurate answers to LLM-related questions.
In the deployment guide, we’ll take you through a comprehensive and hands-on guide to building and deploying a RAG service using Charmed OpenSearch and KServe. Charmed Kubeflow by Canonical natively supports KServe.
Deployment guide to building an end-to-end RAG workflow with Charmed OpenSearch and KServe
Our deployment guide for building an end-to-end RAG workflow with Charmed OpenSearch and KServe covers everything you need to make your own RAG workflow, including:
- Prerequisites
- Install Juju and configure Azure credentials
- Bootstrap Juju controller and create Juju model for Charmed OpenSearch
- Deploy Charmed OpenSearch and set up the RAG service
- Ask and start conversational flow with your RAG
Canonical for your RAG requirements
Canonical provides Data and AI workshop, and enterprise open source tools and services and can advise on securing the safety of your code, data, and models in production.
Build the right RAG architecture and application with the Canonical RAG workshop
Canonical offers a 5-day workshop designed to help you start building your enterprise RAG systems. By the end of the workshop, you will have a thorough understanding of RAG and LLM theory, architecture, and best practices. Together, we will develop and deploy solutions tailored to your specific needs. Download the datasheet here.
Learn and use best-in-class RAG tooling on any hardware and cloud
Unlock the benefits of RAG with open source tools designed for your entire data and machine learning lifecycle. Run RAG-enabled LLM on any hardware and cloud platform, whether in production or at scale.
Canonical offers enterprise-ready AI infrastructure along with open source data and AI tools to help you kickstart your RAG projects.
Secure your AI stack with confidence
Enhance the security of your GenAI projects while mastering best practices for managing your software stack. Discover ways to safeguard your code, data, and machine learning models in production with Confidential AI.