Building a RAG system

Conceptual illustration of calculated embedding vector for the parts (chunks) of the text in an uploaded document in the RAG

Conceptual illustration of how querying text in RAG is done

Semantic is stored in vectors of numbers

Multilingual capabilities

Storing the text, metadata and embedding vector in vector database

What is RAG

RAG is the acronym for Retrieval- Augmented Generation.

It is an AI technique that combines two approches:

Retrieval , The system fetches relevant information from a predefined source, such as a document store or a database
Generation, A language model uses both the retrieved information and its own internal knowledge to generate the answer.

In simple terms instead of the system relying only in what ir was trained, it looks things up to create a context and then generates the answer.

This makes RAG especialy usefull for:

Question answering
Document search + Summarization
Chatbots with private or specialized knowledge

Phases on RAG implementation

RAG is implemented in a phased approach:

Ingestion
Text Pre-processesing
Tokenizer
Embeddings
Storing
Retrieval
Generation
Post processing
Safety and Governance

Will take as an example building a RAG to deliver legal and regulatory information to the staff of a municipality.

Ingestion

In this phase documents are ingested in the system. In this phase laws, regulations in the form of pdf, word, text are ingested in the system.

Tools such as tesseract library and others would be used to extract text from the documents.

Text Pre-processing

In this phase text are cleaned from unnecessary information, metadata is registered such as document type, date, institution etc.

Tekst is divided in small parts called chunks.

Chunking can be done by a number of tokens (words or characters) , for example 400-1000 token per chunk or by text headings or by regular expressions for example Article %

This phase outputs a vector of text chunks.

Tokenizer

Each part of text (chunk of 400-100 tokens or law articles, or paragraphs) is divided into tokens.

For example "Bashkia e Tiranës” → tokens [Bash, kia, e, Tiran, ës].

Embeddings Generation

For each of the chunks of text an embeddings vector is calculated.

Embeddings are a way to turn text (words, sentences, documents) into numbers so a computer can compare meanings.

Instead of comparing words by spelling, we compare them by semantic similarity (meaning).

Examples

“leje ndërtimi” (building permit) and “autorizim ndërtimi” (construction authorization) should be close together in vector space.
“leje ndërtimi” and “raport financiar” (financial report) should be far apart.

How embeddings are generated.

Each chunk has a vectos of numbers that is calculated for it.

How is it done.

Embeding models for exampe bge m3 are trainined in huge texts in a specific language.

The training results in the calculation of a vector in a d-dimensional space, in our bge m3 model for example which supports multilingual models including Albanian it is a 1024-dimensional space.

This means that the model has an embeddings so a vector of 1024 numbers for each of the tokens in a certain language, including Albanian.

To understand it, will suppose that there is a 3-dimensional space. Training of the embedding model has provided a space where has its own coordinate. For simplicity we will use words as tokens.

Conceptual Illustration of a trained embedding model in a 3-dimensional space

Conceptual illustration of a trained embeding model in a 3-dimensional space

The picture shows a conceptual concept about the 3-dimensional space where each of the words of Albanian language have a predefined position, determined by its coordinates= the embedding vector, vector with 3 elements. Once we understand the concept we should not forget that the vector has 1024 numbers but it is not possible to visualize a space with 1024 dimensions.

As can be seen in the picture each word (symbolizing tokens) has its own defined location in the space. What we can see from the illustration is the fact that words(tokens) that are semanticaly near to each other apear near in the embeding space. As we can see words like "mesues, nxenes, shkolle" are near each other in the space while away from them but near to each other are words "plehra, pastrim, riciklim".

This is how you should conceptualy think of a trainined embeding model. Of course this is an illustration of the concept, the real space has 1024 dimensions and tokens are smaller.

Conceptual illustration of calculated embedding vector for the parts (chunks) of the text in an uploaded document in the RAG

The embedding model calculates an embeding vector for a text chunk.

It does this by using the prior knowledge it has, more concretely the coordinates of each word (token) in the text chunk.

It uses a pooling (aggregation) function such as average for example. (This is a semplification of course, gde m3 uses CLS pooling.)

The figure illustrates how the articles of a law are positioned in the space.

In this figure our documents are conceptually loaded in the RAG space. In reality the embedding model is used only to calculate the embeddings of the text. Afterwards metadata about the text such as document, name, page etc., including the embeddings vector is stored in a database, for example in Postgres with the extension to store vectors.

Conceptual illustration of how querying text in RAG is done

When querying RAG, the first step is to ask the embedding model to calculate the embedding vector for the question itself. In other word ask the model to calculate the position of the question in the space.

The model does this the same way it did with the text chunks stored in the database.

After calculating the position in the space of the question it is a metter of a database query. For example in Postgres it would be something similar to the query below

SELECT doc_name, page_number, text, embedding <=> %question_vector AS distance

FROM doc_chunks

ORDER BY distance

LIMIT 3;

so search the doc_chunks table to get the 3 top rows that have the smallest distance between the calculated vectors. Meaning the paragraphs that are nearer in the space with the position of the question.

Semantic is stored in vectors of numbers

The illustration enforced the fact that the semantics it exists in the embedding model. These are pretrained models that have learned reading huge texts in several languages, and thus as a result of the training they calculated a weight vector for each token of the language. Another way, more correct technically to say this would be embedding model calculated the embeddings for each token. The relationship between tokens (think of words for simplicity) is calculated as a distance between the respective vectors.

When we load text in the RAG text semantics is calculated as an aggregation of the semantics of the tokens(words) it contains.

When we search for text that are semanticaly near, we search vectors of numbers that have smaller distance among them.

Multilingual capabilities

Since semantics is represented by the embeddings model, RAG performs very well in different languages. Semantically near words in albanian would as well be near in English.

This feature enables querying in another language the stored text which is in another language

The figure illustrates the fact that semantically similar words even in different languages would be near each other in the 3-dimensional space.

As can be seen in the picture that illustrates words in the domain of education, words student(eng), nxenes(alb),studente(italian) are near each other, words like shkolle(alb), school (eng), scuola(it) are near each other in the space.

Storing the text, metadata and embedding vector in vector database

Once the Embedding vector is generated for each chunk of text, it is stored in a database. There are several databases that support vectors. We will use Postgres with pgvector extension.

pgvector extension of Postgress is an extension for storing and retreiving vectors.

How it stores vectors

It introduces a vector datatype
A vector is stored as a fixed length array of 32-bit floats.
You define the length at column declaration ezample vector(1024), creates a vector with 1024 elements of type 32 bit float.

Indexes for fast retrieval

as a default, we query vectors without an indexing scheme
For faster similarity search, pgvector supports approximate nearest neighbor indexes (ANN)
- IVFFlat: Inverted File Index (clusters vectors into lists , then searches a subsete)
- HNSW: Hierarchical Navigable Small World graph ( praph based ANN index with higher recall and speed)

Vector Functions and Operators

pgvector provides common similarity functions:

Distance Functions

Distance functions can be used in Order by or Where clause

vector <-> vector , euclidean (L2) distance

vector <#> vector , dot product (negative for similarity)

vector <=> vector , cosine distance

Aggregate functions and operators

avg(vector) , mean of vector

Page updated

Report abuse