2024 10 04

GCP – An advanced LlamaIndex RAG implementation on Google Cloud

Introduction

Retrieval Augmented Generation (RAG) is revolutionizing how we build Large Language Model (LLM)-powered applications, but unlike tabular machine learning where XGBoost reigns supreme, there’s no single “go-to” solution for RAG. Developers need efficient ways to experiment with different retrieval techniques and evaluate their performance. This post provides a practical guide to rapidly prototyping and evaluating RAG solutions using Llama-index, Streamlit, RAGAS, and Google Cloud’s Gemini models. We’ll move beyond simple tutorials and explore how to build reusable components, extend existing frameworks, and test performance reliably.

Explore the interactive chat experience provided by our full-stack application

Dive into the comprehensive batch evaluation process

RAG design and LlamaIndex

LlamaIndex is a powerful framework for building RAG applications. It simplifies the process of connecting to data sources, structuring information, and querying with LLMs..Here’s how LlamaIndex breaks down the RAG workflow::

Indexing and storage – how do we chunk, embed, organize and structure the documents we want to query.

Retrieval – how do we retrieve relevant document chunks for a given user query. In LlamaIndex, chunks of documents retrieved from an index are called nodes.

Node (chunk) post-processing – given a set of relevant nodes, further process them to make them more relevant (e.g. re-ranking them).

Response synthesis – given a final set of relevant nodes, curate a response for the user.

LlamaIndex offers a wide variety of techniques and integrations to complete these steps, from simple keyword search all the way to agentic approaches. The list of techniques can be quite overwhelming at first, so it’s better to think of each step in terms of the trade-offs you’re making and the trying core questions you’re trying to address:

Indexing and storage: What is the structure/nature of documents we want to query?

Retrieval: Are the right documents being retrieved?

Node (chunk) post-processing: Are the raw retrieved documents in the right order and format for the LLM to curate a response?

Response synthesis: Are responses relevant to the query and faithful to the documents provided?

For each of these questions in the RAG design lifecycle, let’s walk through sampling of proven techniques.

Indexing and storage

Indexing and storage consists of its own labyrinth of complex steps. You are faced with multiple choices for algorithms; techniques for parsing, chunking, and embedding; metadata extraction considerations; and the need to create separate indices for heterogeneous data sources. As complex as it may seem, in the end, indexing and storage is all about taking some group of documents, pre-processing them in such a way that a retrieval system can grab relevant chunks of those documents, and storing those pre-processed documents somewhere.

To help avoid much of the headache of choosing what path to take, Google Cloud provides the Document AI Layout Parser, which can process various file types including HTML, PDF, DOCX, and PPTX (in preview), identifying a wide range of content elements such as text blocks, paragraphs, tables, lists, titles, headings, and page headers and footers out of the box. By conducting a comprehensive layout analysis, Layout Parser maintains the document’s organizational hierarchy, which is crucial for context-aware information retrieval. See the full code for implementation of DocAI Layout parser here

code_block
<ListValue: [StructValue([(‘code’, ‘parser = DocAIParser(rn project_id=PROJECT_ID,rn location=DOCAI_LOCATION,rn processor_name=f”projects/{PROJECT_ID}/locations/{DOCAI_LOCATION}/processors/{DOCAI_PROCESSOR_ID}”,rn gcs_output_path=GCS_OUTPUT_PATH,rn )rnrnparsed_docs, raw_results = parser.batch_parse(blobs, chunk_size=CHUNK_SIZE, include_ancestor_headings=True)rnli_docs = [Document(text=doc.text, metadata=doc.metadata) for doc in parsed_docs]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a220>)])]>

Once documents are chunked, we must then create LlamaIndex nodes from them. LlamaIndex nodes include metadata fields that can keep track of the structure of their parent documents. For instance, a long document split into consecutive chunks could be represented as a doubly-linked list in LlamaIndex as a list of nodes with PREV and NEXT relationships set to the previous and next node IDs.

code_block
<ListValue: [StructValue([(‘code’, ‘def link_nodes(node_list):rn”’rnGiven a list of nodes, tie them together into a doubly linked listrn”’rn for i, current_node in enumerate(node_list):rn if i > 0: # Not the first nodern previous_node = node_list[i – 1]rn current_node.relationships[NodeRelationship.PREVIOUS] = RelatedNodeInfo(node_id=previous_node.node_id)rnrn if i < len(node_list) – 1: # Not the last nodern next_node = node_list[i + 1]rn current_node.relationships[NodeRelationship.NEXT] = RelatedNodeInfo(node_id=next_node.node_id)rn return node_listrnrnrnnode_chunk_list = []rnfor doc in li_docs:rn doc_dict = doc.to_dict()rn metadata = doc_dict.pop(“metadata”)rn doc_dict.update(metadata)rn chunks = split_to_chunks(doc_dict, target_heading_level=0, target_chunk_size=512, max_chunk_size=750)rnrn # Create nodes with relationships and flattenrn nodes = []rn for chunk in chunks:rn text = chunk.pop(“text”)rn doc_source_id = doc.doc_idrn node = TextNode(text=text, metadata=chunk)rn node.relationships[NodeRelationship.SOURCE] = RelatedNodeInfo(node_id=doc_source_id)rn nodes.append(node)rnrn nodes = link_nodes(nodes)rn node_chunk_list.extend(nodes)rn rnnodes = node_chunk_list’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18ad30>)])]>

Once we have LLamaIndex nodes, we can employ techniques to pre-process them before embedding for more advanced retrieval techniques (like auto-merging retrieval below). The Hierarchical Node Parser takes a list of nodes from a document and creates a hierarchy of nodes where smaller chunks link to larger chunks up the hierarchy. We might leaf chunks of 512 characters and link to parent chunks of 1024, and so on where each level up the hierarchy represents a larger and larger section of a given document. When we store this hierarchy, we only embed the leaf chunks and store the rest in a document store where we can query them by ID. At retrieval time, we perform vector similarity only on leaf chunks, and use the hierarchy relationship to obtain larger sections of the document for additional context. This logic is performed by the LlamaIndex Auto-merging Retriever.

code_block
<ListValue: [StructValue([(‘code’, ‘from llama_index.core.node_parser import HierarchicalNodeParserrnfrom llama_index.core.node_parser import get_leaf_nodes, get_root_nodesrnrnnode_parser = HierarchicalNodeParser.from_defaults(chunk_sizes = chunk_sizes)rnnodes = node_parser.get_nodes_from_documents(node_chunk_list)rnrnleaf_nodes = get_leaf_nodes(nodes)rnroot_nodes = get_root_nodes(nodes)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a9d0>)])]>

We can then embed the nodes and choose how and where to store them for downstream retrieval. A vector database is an obvious choice, but we may need to store documents in another way to facilitate other search methods to combine with semantic retrieval — for instance, hybrid search. For this example, we illustrate how to create a hybrid store where we need to store document chunks both as embedded vectors and as a key-value store in Google Cloud’s Vertex AI Vector Store and Firestore, respectively. This has utility when we need to query documents by either vector similarity or an id/metadata match.

code_block
<ListValue: [StructValue([(‘code’, ‘aiplatform.init(project=PROJECT_ID, location=LOCATION)rnrn# Creating Vector Search Indexrnvs_index, vs_endpoint = get_or_create_existing_index(VECTOR_INDEX_NAME, rn INDEX_ENDPOINT_NAME, rn APPROXIMATE_NEIGHBORS_COUNT)rnrn# Vertex Vector Search Vector DB and Firestore Docstorernvector_store = VertexAIVectorStore(rn project_id=PROJECT_ID,rn region=LOCATION,rn index_id=vs_index.name, # Use .name instead of .resource_namern endpoint_id=vs_endpoint.name, # Use .name instead of .resource_namern gcs_bucket_name=DOCSTORE_BUCKET_NAME,rn )rnrndocstore = FirestoreDocumentStore.from_database(project=PROJECT_ID,rn database=FIRESTORE_DB_NAME,rn namespace=FIRESTORE_NAMESPACE)rnrn# Setup embedding model and LLMrnembed_model = VertexTextEmbedding(model_name=EMBEDDINGS_MODEL_NAME, rn project=PROJECT_ID, rn location=LOCATION)rnllm = Vertex(model=”gemini-1.5-flash”, temperature=0.0)rnSettings.llm = llmrnSettings.embed_model = embed_modelrnrndocstore.add_documents(li_docs)rnstorage_context = StorageContext.from_defaults(docstore=docstore, vector_store=vector_store)rn# Creating an index automatically embeds and creates the vector db collectionrnindex = VectorStoreIndex(rn nodes=leaf_nodes, rn storage_context=storage_context, rn embed_model=embed_model, rn llm=llmrn )’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a610>)])]>

We should create multiple indices to explore the differences between combinations of approaches. For instance, we can create a flat, non-hierarchical index of fixed-sized chunks in addition to the hierarchical one.

Retrieval

Retrieval is the task of obtaining a small set of relevant documents from our vector store/docstore combination, which an LLM can use as context to curate a relevant response. The Retriever module in LlamaIndex provides a nice abstraction of this task. Subclasses of this module implement the _retrieve method, which takes as an argument a query and returns a list of NodesWithScore — basically a list of document chunks with a score indicating their relevance to the question. LlamaIndex has many popular implementations of retrievers. It is always good to try a baseline retriever that simply does vector similarity search to retrieve a specified top k number of NodesWithScore.

code_block
<ListValue: [StructValue([(‘code’, ‘from llama_index.core import StorageContext, VectorStoreIndexrnrn# Instantiating the index at retrieval time:rnaiplatform.init(project=self.project_id, location=self.location)rn# Get the Vector Search indexrnindexes = aiplatform.MatchingEngineIndex.list(rn filter=f’display_name=”{index_name}”‘rn )rnif not indexes:rn raise ValueError(f”No index found with display name: {index_name}”)rnvs_index = indexes[0]rn# Get the Vector Search endpointrnendpoints = aiplatform.MatchingEngineIndexEndpoint.list(rn filter=f’display_name=”{endpoint_name}”‘rn )rnif not endpoints:rn raise ValueError(f”No endpoint found with display name: {endpoint_name}”)rnvs_endpoint = endpoints[0]rnrn# Create the vector storernvector_store = VertexAIVectorStore(rn project_id=self.project_id,rn region=self.location,rn index_id=vs_index.resource_name.split(“/”)[-1],rn endpoint_id=vs_endpoint.resource_name.split(“/”)[-1],rn gcs_bucket_name=self.vs_bucket_namern )rnif firestore_db_name and firestore_namespace:rn docstore = FirestoreDocumentStore.from_database(project=self.project_id,rn database=firestore_db_name,rn namespace=firestore_namespace)rnelse:rn docstore = Nonernrn# Create storage contextrnstorage_context = StorageContext.from_defaults(rn vector_store=vector_store,rn docstore=docstorern )rn# Create and return the indexrnvector_store_index = VectorStoreIndex(nodes=[],rn storage_context=storage_context,rn embed_model=self.embed_model)rnbaseline_retriever = vector_index_index.as_retriever() rnnodes_with_scores = baseline_retriever.retrieve(query)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18ad90>)])]>

Auto-merging retrieval

The above baseline_retriever does not incorporate the structure of the hierarchical index we created earlier. An auto-merging retriever allows the retrieval of nodes not just based on vector similarity, but also based on the source document from which they came, through the hierarchy of chunks that we maintain in a document store. This allows us to retrieve additional content that may encapsulate the initial set of node chunks. For instance, a baseline_retriever may retrieve five node chunks based on vector similarity. Those chunks may be quite small (e.g., 512 characters) and if our query is complex, may not contain everything needed to answer the query properly. Of the five chunks returned, three may come from the same document and may be referencing different paragraphs of a single section. Because we stored the hierarchy of these chunks, their relation to larger chunks, and together they comprise the larger section, the auto-merging retriever can “walk” the hierarchy, retrieving the larger chunks and returning a larger section of the document for the LLM to compose a response. This balances out the trade-off between retrieval accuracy that comes with smaller chunk sizes and supplying the LLM with as much relevant data as possible.

code_block
<ListValue: [StructValue([(‘code’, ‘retriever = AutoMergingRetriever(baseline_retriever,rn storage_context, # contains reference to docstore,rntttt simple_ratio_thresh=0.5, # If greater than 50% of returned nodes belong to same document, perform auto-mergingrn verbose=True)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a8e0>)])]>

LlamaIndex Query Engine

Now that we have a set of NodesWithScores, we need to assess if they are in the optimal order. You may want to do additional post-processing like removing PII or formatting. Finally we need to pass these chunks to an LLM which will provide an answer catered to the user’s original intention. Orchestration of retrieval with node post-processing and response synthesis happens through the LlamaIndex QueryEngine. You create a QueryEngine by first defining a retriever, a node-post-processing method (if any) and a response synthesizer and passing them in as arguments. QueryEngine exposes the query and aquery (asynchronous equivalent of query) methods, which take as input a string query and return a Response object, which includes not only the LLM-generated answer, but the list of NodeWithScores (the chunks passed to the LLM as context).

code_block
<ListValue: [StructValue([(‘code’, ‘from llama_index.core import PromptTemplate, get_response_synthesizerrnfrom llama_index.core.retrievers import AutoMergingRetrieverrnfrom llama_index.core.query_engine import RetrieverQueryEnginernfrom llama_index.core.postprocessor import LLMRerankrnrn# Loading of index happens above…rnstorage_context = index.storage_contextrnbase_retriever = index.as_retriever(similarity_top_k=5)rnrnquery_engine = None rnrnsynth = get_response_synthesizer(text_qa_template=qa_prompt,rn refine_template=refine_prompt,rn response_mode=”compact”,rn use_async=False)rnrnretriever = AutoMergingRetriever(base_retriever,rn storage_context,rn verbose=True)rnrnranker_prompt = PromptTemplate(choice_select_prompt_tmpl)rnllm_reranker = LLMRerank(choice_batch_size=10, # Re-rank the top 10 chunksrn top_n=5, # Only return the top 5 after re-rankingrn choice_select_prompt=ranker_prompt,rn llm=reranker_llm)rnrnquery_engine = RetrieverQueryEngine.from_args(retriever,rn response_synthesizer=synth,rn node_postprocessors=[llm_reranker])’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a1c0>)])]>

Hypothetical document embedding

Most Llama-index retrievers perform retrieval by embedding the user’s query and computing the vector similarity between the query’s embedding with those in the vector store. However, this can be suboptimal due to the fact that the linguistic structure of the question may differ significantly from that of the answer. Hypothetical document embedding (HyDE) is a technique that attempts to address this by using LLM hallucination as a strength. The idea is to first hallucinate a response to the user’s query, without any provided context, and then embed the hallucinated answer as the basis for vector similarity search in the vector store.

Expansion with generated answers — Image by the author (inspired by [Gao, 2022])

HyDE is easy to integrate with LlamaIndex:

code_block
<ListValue: [StructValue([(‘code’, ‘from llama_index.core.indices.query.query_transform import HyDEQueryTransformrnfrom llama_index.core.query_engine import TransformQueryEnginernrnhyde = HyDEQueryTransform(include_original=True) # Include original query when doing similarity searchrnhyde_query_engine = TransformQueryEngine(query_engine, hyde)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18ab80>)])]>

LLM node re-ranking

A Node Post-Processor in Llamaindex implements a _postprocess_nodes method, which takes as input the query and the list of NodesWithScores and returns a new list of NodesWithScores. The initial set of nodes obtained from the retriever may not be ranked optimally and it can be beneficial to perform reranking where we re-sort the nodes by relevance determined by an LLM. There exist explicit models fine-tuned explicitly for the purpose of re-ranking chunks for a given query, or we can use a generic LLM to do the re-ranking for us. We can use a prompt like below to ask an LLM to rank nodes from a retriever:

code_block
<ListValue: [StructValue([(‘code’, ‘”A list of documents is shown below. Each document has a number next to it along “rn “with a summary of the document. A question is also provided. \n”rn “Respond with the numbers of the documents “rn “you should consult to answer the question, in order of relevance, as well \n”rn “as the relevance score. The relevance score is a number from 1-10 based on “rn “how relevant you think the document is to the question.\n”rn “Do not include any documents that are not relevant to the question. \n”rn “Example format: \n”rn “Document 1:\n<summary of document 1>\n\n”rn “Document 2:\n<summary of document 2>\n\n”rn “…\n\n”rn “Document 10:\n<summary of document 10>\n\n”rn “Question: <question>\n”rn “Answer:\n”rn “Doc: 9, Relevance: 7\n”rn “Doc: 3, Relevance: 4\n”rn “Doc: 7, Relevance: 3\n\n”rn “Let’s do this now and it is extremely important that you follow the EXACT format above where 1 line of output is: \n”rn “Doc: <doc_num>, Relevance: <score>\n”rn “Do not include any extra formatting whatsoever\n”rn “Go!\n\n”rn “{context_str}\n”rn “Question: {query_str}\n”rn “Answer:\n”‘), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a7f0>)])]>

For an example of a custom LLM re-ranker class, see the gitlab repo.

Response synthesis

There are many ways to instruct an LLM to create a response given a list of NodeWithScores. If the nodes are especially large, we might want to condense the nodes via summarization before asking the LLM to give a final response. Or given an initial response, we might want to give the LLM another chance to refine it or correct any errors that may be present. The ResponseSynthesizer in LlamaIndex lets us determine how the LLM will formulate a response given a list of nodes.

ReAct agent

Reasoning and acting or ReAct (Yao, et al. 2022) introduces a reasoning loop on top of the query pipeline we have created. This allows an LLM to perform chain-of-thought reasoning to address complex queries or queries that may require multiple retrieval steps in order to get a correct answer. To implement a ReAct loop in Llamaindex we expose the query_engine created above as a tool which the ReAct agent can use as part of the reasoning and acting procedure. You can add multiple tools here to allow the ReAct agent to choose among them or consolidate results among many.

code_block
<ListValue: [StructValue([(‘code’, ‘query_engine_tools = [rn QueryEngineTool(rn query_engine=query_engine,rn metadata=ToolMetadata(rn name=”google_financials”,rn description=(rn “Provides information about Google financials. “rn “Use a detailed plain text question as input to the tool.”rn ),rn ),rn )]rn llm = Vertex(model=llm_name, max_tokens=3000, temperature=temperature)rn Settings.llm = llmrn agent = ReActAgent.from_tools(query_engine_tools,rn llm=llm,rn verbose=True,rn context=system_prompt)’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18a160>)])]>

Creating the Final QueryEngine

Once you’ve decided on a few approaches across the steps outlined above, you will need to create logic to instantiate your QueryEngine based on an input configuration. You can find an example function here.

Evaluation metrics and techniques

Once we have a QueryEngine object, we have a simple way of passing queries and obtaining answers and associated context from the RAG pipeline. We can then go on to implement the QueryEngine object as part of a backend service such as FastAPI along with a simple front-end, which would allow us to experiment with this object in different ways (i.e., conversation vs. batch).

When chatting with the RAG pipeline, three pieces of information can be used to evaluate the response: the query, the retrieved context, and of course, the response. We can use these three fields to calculate evaluation metrics and help us compare responses more quantitatively. RAGAS is a framework which provides some out-of-the-box, heuristic metrics that can be computed given this triple, namely answer faithfulness, answer relevancy, and context relevancy. We compute these on the fly with each chat interaction and display them for the user.

Ideally, in parallel, we would attempt to obtain ground-truth answers as well through expert annotation. With ground truth, we can tell a lot more about how the RAG pipeline is performing. We can calculate LLM-graded accuracy, where we ask an LLM about whether the answer is consistent with the ground truth or calculate a variety of other metrics from RAGAS such as context precision and recall. Below is a summary of the metrics we can calculate as part of our evaluation:

RAGAS Metric Name

Requires Ground Truth?

Yes

Yes

Yes

Yes

Deployment

The FastAPI backend will implement two routes: /query_rag and /eval_batch. query_rag/ is used for one-shot chats with the query-engine with the option to perform evaluation on the answer on the fly. /eval_batch allows users to choose an eval_set from a Cloud Storage bucket and run batch evaluation on the dataset using the given query engine parameters.

code_block
<ListValue: [StructValue([(‘code’, ‘app = FastAPI()rnrn@app.post(“/query_rag”)rnasync def query_rag(rag_request: RAGRequest):rn # get_query_engine encapsulates boilerplate llamaindex for creating a q engine.rn query_engine = get_query_engine(index=index,rn llm_name=rag_request.llm_name,rn temperature=rag_request.temperature,rn similarity_top_k=rag_request.similarity_top_k,rn retrieval_strategy=rag_request.retrieval_strategy,rn use_hyde=rag_request.use_hyde,rn use_refine=rag_request.use_refine,rn use_node_rerank=rag_request.use_node_rerank)rn response = await query_engine.aquery(rag_request.query)rn if rag_request.evaluate_response:rn # Evaluate response with ragas against metricsrn retrieved_contexts = [r.node.text for r in response.source_nodes]rn eval_df = pd.DataFrame({“question”: rag_request.query, “answer”: [response.response], “contexts”: [retrieved_contexts]})rn eval_df_ds = Dataset.from_pandas(eval_df)rnrn # create LLM and Embeddings for Ragasrn vertextai_llm = ChatVertexAI(credentials=creds, model_name=rag_request.eval_model_name)rn vertextai_embeddings = VertexAIEmbeddings(credentials=creds, model_name=rag_request.embedding_model_name)rnrn # No ground truth so can only do answer_relevancy, faithfulness and context_relevancyrn metrics = [answer_relevancy, faithfulness, context_relevancy]rn result = evaluate(rn eval_df_ds, rn metrics=metrics,rn llm=vertextai_llm,rn embeddings=vertextai_embeddingsrn )rn result_dict = result.to_pandas()[[“answer_relevancy”, “faithfulness”, “context_relevancy”]].fillna(0).iloc[0].to_dict()rn retrieved_context_dict = {“retreived_chunks”: response.source_nodes}rn logger.info(result_dict)rn return {“response”: response.response} | result_dict | retrieved_context_dict’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e474f18afa0>)])]>

Streamlit’s Chat elements make it very easy to spin up a UI, allowing us to interact with the QueryEngine object via a FastAPI backend, along with setting sliders and input forms to match the configurations we set forth earlier.

Click here for the full code repo.

Conclusion

In summary, building an advanced RAG application on GCP utilizing modular tools such as LlamaIndex, RAGAS, FastAPI and streamlit allow you maximum flexibility as you explore different techniques and tweak various aspects of the RAG pipeline. With any luck, you may end up finding that magical combination of parameters, prompts, and algorithms which can comprise the “XGBoost” equivalent for your RAG problem.

Additional resources

https://cloud.google.com/python/docs/reference/documentai/latest

https://docs.llamaindex.ai/en/stable/

https://cloud.google.com/vertex-ai/generative-ai/docs/llamaindex-on-vertexai

https://docs.streamlit.io/develop/tutorials/llms/build-conversational-apps

https://docs.ragas.io/en/stable/