Hello RAG! Using YugabyteDB to power a RAG Pipeline

May 8, 2025

In the rapidly evolving world of AI, Retrieval-Augmented Generation (RAG) has emerged as a way of supplementing large language models (LLMs) custom-curated data. This helps make the results relevant, efficient, and context-aware. But, as with any hybrid AI system, data infrastructure plays a crucial role behind the scenes.

In this post, we explore how distributed SQL database, YugabyteDB, can power the retrieval layer of a RAG pipeline—offering scale, resilience, and low-latency access to semantically rich data. Whether you’re building smart chatbots, enterprise search, or generative AI assistants, this setup can improve performance, operational efficiency, and cost-effective scaling.

Why Use RAG?

One of the more compelling AI applications is frontline customer support.

Virtually every company that sells a product or service requires some form of customer support. Almost universally, support teams experience budget constraints—whether it be time, staffing, or cost.

What if we could offer better support for less cost?

That’s where AI, and more specifically Retrieval-Augmented Generation (RAG), can step up.

How do we get our own support data into the AI?

That’s the challenge. Support content typically resides across many spaces, including public documentation, internal knowledge bases, Slack threads, support tickets, and more. To leverage this content effectively, we need to vectorize it. This means converting it into embeddings that preserve semantic meaning and can be searched efficiently.

For many companies, storing this data externally is non-negotiable. Hosting your own vector database ensures control, privacy, and security—key requirements for enterprise adoption.

Here’s how to have support documents on premises and supply the LLM with needed context from the documents:

First, transform all the support content into vector embeddings and store them in a PostgreSQL-compatible vector database like YugabyteDB, which supports hybrid transactional and vector workloads.
Second, when a user asks a question, generate an embedding for that question, search the internal vector store for semantically similar content, and pass those results as context to the LLM to generate accurate, relevant responses.

This is the essence of RAG.

A Practical Example of RAG in Action

Below is a simple example that:

Loads a directory of files (./data/paul_graham)
Splits them into chunks
Vectorizes the chunks using OpenAI embeddings
Stores each chunk’s ID, text, and vector in a YugabyteDB table called vectors

The same setup can be used for real support content—internal docs, chat logs, email threads, etc.—instead of sample essays.

All you need is an OpenAI API key (exported as OPENAI_API_KEY), and a running YugabyteDB instance with vector support enabled.

Once loaded, you can query the vector table with any user question, retrieve the most relevant matches, and use them to feed a large language model like GPT-4.

This same RAG (Retrieval-Augmented Generation) approach isn’t limited to just “Ask Your Support Knowledge Base” scenarios. It can also be applied to a wide range of enterprise use cases, including:

Semantic search
Recommendations (for products, services, advice, and more)
Personalization
Fraud detection

RAG is a versatile framework that can enhance decision-making and user experience across many business domains.

Now for a code example of using RAG with Yugabyte. The following code example can be found in this Github Repo.

First I created the Yugabyte database here: https://cloud.yugabyte.com/ using the last available version (2.25.1), enabled the vector extension, and created a table for the vector embeddings:

CREATE EXTENSION vector;

CREATE TABLE vectors (
    id           TEXT PRIMARY KEY,
    article_text TEXT,
    embedding    VECTOR(1536)
);

CREATE INDEX NONCONCURRENTLY ON vectors USING ybhnsw (embedding vector_cosine_ops);

Visit Yugabyte Vector Docs for additional information.

The SQL commands can be run from the web console, which can connect to the database, or can be run with tools like psql or DBeaver.

I installed psql on my Mac with: “brew install postgresql@15”
Then connected with: “psql -h hostname -p 5433 -U yugabyte”
On EC2, I installed with: “sudo dnf install -y postgresql15-server postgresql15-contrib”
I used Python 3.9 setup:

python3.9 -m venv aiblog
source aiblog/bin/activate
cd aiblog
pip install llama-index
pip install psycopg2
OPENAI_API_KEY='your openAI key'

# in ./aiblog/data I have a file about "paul_graham"
# you could put any textual data that in ./aiblog/data
# that you want to suppliment the LLM retrieval with
 
# I only needed llama-index and psycopg2 packages but you
# can install my full enviroment with requirements.txt from
# the github repo
# https://github.com/kyle-hailey/rag_example

pip install -r requirements.txt

Use the Python program to store documentation as embeddings, i.e. vectorize text data and store in a vector field in the database for later lookup.

SimpleDirectoryReader() can be used to read the data and recognizes many formats, including text, pdf, word, jpeg, mp3, and more.

Here are the contents of the Python code file:

insert.py

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.schema import Document

import openai
import psycopg2

connection_string = "postgresql://yugabyte:password@127.0.0.1:5433/yugabyte"

# Try to connect and give feedback
try:
    conn = psycopg2.connect(connection_string)
    cursor = conn.cursor()
    print("✅ Successfully connected to the database.\n")
except Exception as e:
    print("❌ Failed to connect to the database.")
    print("Error:", e)
    exit(1)

# Load documents and create index
print("📄 Loading documents...")
documents = SimpleDirectoryReader("./data").load_data()
print(f"📦 Loaded {len(documents)} documents.\n")

print("🔍 Vectorizing documents...")
index = VectorStoreIndex.from_documents(documents)
print("✅ Vectorization complete.\n")

# Insert documents with clean feedback
for doc_id, doc in index.docstore.docs.items():
    embedding = index._embed_model.get_text_embedding(doc.text)
    embedding_str = "[" + ",".join(map(str, embedding)) + "]"

    insert_sql = """
        INSERT INTO vectors (id, article_text, embedding)
        VALUES (%s, %s, %s)
        ON CONFLICT (id) DO NOTHING;
    """

    try:
        cursor.execute(insert_sql, (doc_id, doc.text, embedding_str))
        conn.commit()

        text_snippet = doc.text[:40].replace("\n", " ").strip()
        print(f"📥 {len(doc.text):4d} chars | \"{text_snippet}\" | { [round(v, 4) for v in embedding[:5]] }")
    except Exception as e:
        print(f"❌ Failed to insert row: {e}")

print("\n🎉 Done inserting all data.")

cursor.close()
conn.close()

The following results are shown after running the program:

$ python insert.py
✅ Successfully connected to the database.

📄 Loading documents...
📦 Loaded 1 documents.

🔍 Vectorizing documents...
✅ Vectorization complete.

📥 4170 chars | "What I Worked On February 2021 Before" | [0.0041,
📥 4325 chars | "All that seemed left for philosophy were" |[0.0197,
📥 4193 chars | "Its brokenness did, as so often happens," |[0.0065,
📥 4339 chars | "If he even knew about the strange classe" [-0.0068,
📥 4291 chars | "The students and faculty in the painting" [-0.0073,
📥 4329 chars | "I wanted to go back to RISD, but I was n" |[0.0019,
📥 4261 chars | "But alas it was more like the Accademia" | [0.0065,
📥 4293 chars | "After I moved to New York I became her d" [-0.0001,
📥 4319 chars | "Now we felt like we were really onto som" [-0.0179,
📥 4258 chars | "In its time, the editor was one of the b" [-0.0091,
📥 4181 chars | "A company with just a handful of employe" |[0.0008,
📥 4244 chars | "I stuck it out for a few more months, th" |[0.0073,
📥 4292 chars | "But about halfway through the summer I r" |[0.0034,
📥 4456 chars | "One of the most conspicuous patterns I'v" [-0.0037,
📥 4454 chars | "Horrified at the prospect of having my i" |[0.0007,
📥 4235 chars | "We'd use the building I owned in Cambrid" |[0.0128,
📥 4128 chars | "It was originally meant to be a news agg" |[0.0031,
📥 4161 chars | "It had already eaten Arc, and was in the" |[0.0125,
📥 4381 chars | "Then in March 2015 I started working on" |[-0.0092,
📥 4352 chars | "I remember taking the boys to the coast" | [0.0182,
📥 4472 chars | "But when the software is an online store" [-0.0007,
📥 1805 chars | "[17] Another problem with HN was a bizar" | [0.005,
🎉 Done inserting all data.

The following program executes these steps:

Your question is converted to a vector
The vector is used to retrieve relevant chunks from the database
Those chunks are passed to GPT-4
GPT-4 answers your question using that context

Here is the Python code for the file: question.py

import psycopg2
import openai
import os
from llama_index.embeddings.openai import OpenAIEmbedding

# --- Setup ---
openai.api_key = os.getenv("OPENAI_API_KEY")
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
client = openai.OpenAI()
connection_string = "postgresql://yugabyte:password@127.0.0.1:5433/yugabyte"
def ask_question(question, top_k=7):
    # Connect to DB
    conn = psycopg2.connect(connection_string)
    cursor = conn.cursor()

    # Get embedding of question
    query_embedding = embed_model.get_query_embedding(question)
    embedding_str = "[" + ",".join(map(str, query_embedding)) + "]"

    # Vector search
    search_sql = """
        SELECT id, article_text, embedding <=> %s AS distance
        FROM vectors
        ORDER BY embedding <=> %s
        LIMIT %s;
    """
    cursor.execute(search_sql, (embedding_str, embedding_str, top_k))
    results = cursor.fetchall()

    cursor.close()
    conn.close()

    if not results:
        return "No matching documents found."

    # Print the first 40 characters of each retrieved chunk
    print("\n🔍 Retrieved context snippets:")
    for _, text, distance in results:
        print(f"- {text[:40]!r} (distance: {distance:.4f})")

    # Build context
    context = "\n\n".join([f"{text}" for (_, text, _) in results])

    # Ask GPT
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that answers questions using the provided context."},
            {"role": "user", "content": f"Context:\n{context}"},
            {"role": "user", "content": f"Question: {question}"}
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content.strip()

# --- Interactive Loop ---
if __name__ == "__main__":
    try:
        print("Ask me a question (press Ctrl+C to quit):\n")
        while True:
            question = input("❓ Your question: ").strip()
            if not question:
                continue
            answer = ask_question(question)
            print("\n💡 Answer:\n" + answer + "\n")
    except KeyboardInterrupt:
        print("\nGoodbye! 👋")

% python question.py
Ask me a question (press Ctrl+C to quit):

❓ Your question: tell me about paul graham

🔍 Retrieved context snippets:
- 'Over the next several years Paul Graham' (distance: 0.1471)
- 'The article is about Paul Graham\n\nWhat ' (distance: 0.1513)
- 'Paul Graham certainly did. So at the end' (distance: 0.1523)
- 'They either lived long ago or were myste' (distance: 0.1530)
- 'But the most important thing Paul Graha' (distance: 0.1583)
- 'You can do something similar on a map of' (distance: 0.1621)
- 'When Paul Graham was dealing with some ' (distance: 0.1628)

💡 Answer:

Paul Graham is a writer, programmer, and entrepreneur. He has written numerous essays on various topics, some of which were reprinted as a book titled "Hackers & Painters". He has also worked on spam filters and has a passion for painting. He was known for hosting dinners for a group of friends every Thursday night, teaching him how to cook for groups.

Before college, Graham mainly focused on writing and programming. He wrote short stories and tried programming on the IBM 1401. He later got a microcomputer and started programming more seriously, writing simple games and a word processor. In college, he initially planned to study philosophy but switched to AI.

Graham also worked on a new dialect of Lisp, called Arc, and gave a talk at a Lisp conference about how they'd used Lisp at Viaweb. This talk gained significant attention online, leading him to realize the potential of online essays.

In 2003, Graham met Jessica Livingston at a party. She was in charge of marketing at a Boston investment bank and later compiled a book of interviews with startup founders. In 2005, Graham and Livingston, along with Robert and Trevor, decided to start their own investment firm, which became Y Combinator.

Graham also worked on several different projects, including the development of the programming language Arc and the creation of the online platform Hacker News. In 2012, he decided to hand over Y Combinator to Sam Altman and retire.

Conclusion

RAG architectures hold the key to moving beyond static, pre-trained models by anchoring responses in live, curated data.

YugabyteDB’s distributed nature and PostgreSQL compatibility make it uniquely suited to serve as a scalable backbone for the retrieval component of modern AI systems.

By integrating embeddings, vector search, and structured metadata in one place, developers can bridge the gap between generative fluency and factual grounding.

Want to try it for yourself? Spin up your own YugabyteDB database at http://cloud.yugabyte.com and try out vector embeddings.

May 8, 2025