“`html
One of the most useful application patterns for generative AI workloads is Retrieval Augmented Generation (RAG). In the RAG pattern, we find pieces of reference content related to an input prompt by performing similarity searches on embeddings. Embeddings capture the information content in bodies of text, allowing natural language processing (NLP) models to work with language in a numeric form. Embeddings are just vectors of floating point numbers, so we can analyze them to help answer three important questions: Is our reference data changing over time? Are the questions users are asking changing over time? And finally, how well is our reference data covering the questions being asked?
In this post, you’ll learn about some of the considerations for embedding vector analysis and detecting signals of embedding drift. Because embeddings are an important source of data for NLP models in general and generative AI solutions in particular, we need a way to measure whether our embeddings are changing over time (drifting). In this post, you’ll see an example of performing drift detection on embedding vectors using a clustering technique with large language models (LLMS) deployed from Amazon SageMaker JumpStart. You’ll also be able to explore these concepts through two provided examples, including an end-to-end sample application or, optionally, a subset of the application.
Overview of RAG
The RAG pattern lets you retrieve knowledge from external sources, such as PDF documents, wiki articles, or call transcripts, and then use that knowledge to augment the instruction prompt sent to the LLM. This allows the LLM to reference more relevant information when generating a response. For example, if you ask an LLM how to make chocolate chip cookies, it can include information from your own recipe library. In this pattern, the recipe text is converted into embedding vectors using an embedding model, and stored in a vector database. Incoming questions are converted to embeddings, and then the vector database runs a similarity search to find related content. The question and the reference data then go into the prompt for the LLM.
Let’s take a closer look at the embedding vectors that get created and how to perform drift analysis on those vectors.
Analysis on embedding vectors
Embedding vectors are numeric representations of our data so analysis of these vectors can provide insight into our reference data that can later be used to detect potential signals of drift. Embedding vectors represent an item in n-dimensional space, where n is often large. For example, the GPT-J 6B model, used in this post, creates vectors of size 4096. To measure drift, assume that our application captures embedding vectors for both reference data and incoming prompts.
We start by performing dimension reduction using Principal Component Analysis (PCA). PCA tries to reduce the number of dimensions while preserving most of the variance in the data. In this case, we try to find the number of dimensions that preserves 95% of the variance, which should capture anything within two standard deviations.
Then we use K-Means to identify a set of cluster centers. K-Means tries to group points together into clusters such that each cluster is relatively compact and the clusters are as distant from each other as possible.
We calculate the following information based on the clustering output shown in the following figure:
The number of dimensions in PCA that explain 95% of the variance
The location of each cluster center, or centroid
Additionally, we look at the proportion (higher or lower) of samples in each cluster, as shown in the following figure.
Finally, we use this analysis to calculate the following:
Inertia – Inertia is the sum of squared distances to cluster centroids, which measures how well the data was clustered using K-Means.
Silhouette score – The silhouette score is a measure for the validation of the consistency within clusters, and ranges from -1 to 1. A value close to 1 means that the points in a cluster are close to the other points in the same cluster and far from the points of the other clusters. A visual representation of the silhouette score can be seen in the following figure.
We can periodically capture this information for snapshots of the embeddings for both the source reference data and the prompts. Capturing this data allows us to analyze potential signals of embedding drift.
Detecting embedding drift
Periodically, we can compare the clustering information through snapshots of the data, which includes the reference data embeddings and the prompt embeddings. First, we can compare the number of dimensions needed to explain 95% of the variation in the embedding data, the inertia, and the silhouette score from the clustering job. As you can see in the following table, compared to a baseline, the latest snapshot of embeddings requires 39 more dimensions to explain the variance, indicating that our data is more dispersed. The inertia has gone up, indicating that the samples are in aggregate farther away from their cluster centers. Additionally, the silhouette score has gone down, indicating that the clusters are not as well defined. For prompt data, that might indicate that the types of questions coming into the system are covering more topics.
Next, in the following figure, we can see how the proportion of samples in each cluster has changed over time. This can show us whether our newer reference data is broadly similar to the previous set, or covers new areas.
Finally, we can see if the cluster centers are moving, which would show drift in the information in the clusters, as shown in the following table.
Reference data coverage for incoming questions
We can also evaluate how well our reference data aligns to the incoming questions. To do this, we assign each prompt embedding to a reference data cluster. We compute the distance from each prompt to its corresponding center, and look at the mean, median, and standard deviation of those distances. We can store that information and see how it changes over time.
The following figure shows an example of analyzing the distance between the prompt embedding and reference data centers over time.
As you can see, the mean, median, and standard deviation distance statistics between prompt embeddings and reference data centers is decreasing between the initial baseline and the latest snapshot. Although the absolute value of the distance is difficult to interpret, we can use the trends to determine if the semantic overlap between reference data and incoming questions is getting better or worse over time.
Sample application
In order to gather the experimental results discussed in the previous section, we built a sample application that implements the RAG pattern using embedding and generation models deployed through SageMaker JumpStart and hosted on Amazon SageMaker real-time endpoints.
The application has three core components:
We use an interactive flow, which includes a user interface for capturing prompts, combined with a RAG orchestration layer, using LangChain.
The data processing flow extracts data from PDF documents and creates embeddings that get stored in Amazon OpenSearch Service. We also use these in the final embedding drift analysis component of the application.
The embeddings are captured in Amazon Simple Storage Service (Amazon S3) via Amazon Kinesis Data Firehose, and we run a combination of AWS Glue extract, transform, and load (ETL) jobs and Jupyter notebooks to perform the embedding analysis.
The following diagram illustrates the end-to-end architecture.
The full sample code is available on GitHub. The provided code is available in two different patterns:
Sample full-stack application with a Streamlit frontend – This provides an end-to-end application, including a user interface using Streamlit for capturing prompts, combined with the RAG orchestration layer, using LangChain running on Amazon Elastic Container Service (Amazon ECS) with AWS Fargate
Backend application – For those that don’t want to deploy the full application stack, you can optionally choose to only deploy the backend AWS Cloud Development Kit (AWS CDK) stack, and then use the Jupyter notebook provided to perform RAG orchestration using LangChain
To create the provided patterns, there are several prerequisites detailed in the following sections, starting with deploying the generative and text embedding models then moving on to the additional prerequisites.
Deploy models through SageMaker JumpStart
Both patterns assume the deployment of an embedding model and generative model. For this, you’ll deploy two models from SageMaker JumpStart. The first model, GPT-J 6B, is used as the embedding model and the second model, Falcon-40b, is used for text generation.
You can deploy each of these models through SageMaker JumpStart from the AWS Management Console, Amazon SageMaker Studio, or programmatically. For more information, refer to How to use JumpStart foundation models. To simplify the deployment, you can use the provided notebook derived from notebooks automatically created by SageMaker JumpStart. This notebook pulls the models from the SageMaker JumpStart ML hub and deploys them to two separate SageMaker real-time endpoints.
The sample notebook also has a cleanup section. Don’t run that section yet, because it will delete the endpoints just deployed. You will complete the cleanup at the end of the walkthrough.
After confirming successful deployment of the endpoints, you’re ready to deploy the full sample application. However, if you’re more interested in exploring only the backend and analysis notebooks, you can optionally deploy only that, which is covered in the next section.
Option 1: Deploy the backend application only
This pattern allows you to deploy the backend solution only and interact with the solution using a Jupyter notebook. Use this pattern if you don’t want to build out the full frontend interface.
Prerequisites
You should have the following prerequisites:
A SageMaker JumpStart model endpoint deployed – Deploy the models to SageMaker real-time endpoints using SageMaker JumpStart, as previously outlined
Deployment parameters – Record the following:
Text model endpoint name – The endpoint name of the text generation model deployed with SageMaker JumpStart
Embeddings model endpoint name – The endpoint name of the embedding model deployed with SageMaker JumpStart
Deploy the resources using the AWS CDK
Use the deployment parameters noted in the previous section to deploy the AWS CDK stack. For more information about AWS CDK installation, refer to Getting started with the AWS CDK.
Make sure that Docker is installed and running on the workstation that will be used for AWS CDK deployment. Refer to Get Docker for additional guidance.
$ cd pattern1-rag/cdk
$ cdk deploy BackendStack –exclusively
-c textModelEndpointName=
-c embeddingsModelEndpointName=
Alternatively, you can enter the context values in a file called cdk.context.json in the pattern1-rag/cdk directory and run cdk deploy BackendStack –exclusively.
The deployment will print out outputs, some of which will be needed to run the notebook. Before you can start question and answering, embed the reference documents, as shown in the next section.
Embed reference documents
For this RAG approach, reference documents are first embedded with a text embedding model and stored in a vector database. In this solution, an ingestion pipeline has been built that intakes PDF documents.
An Amazon Elastic Compute Cloud (Amazon EC2) instance has been created for the PDF document ingestion and an Amazon Elastic File System (Amazon EFS) file system is mounted on the EC2 instance to save the PDF documents. An AWS DataSync task is run every hour to fetch PDF documents found in the EFS file system path and upload them to an S3 bucket to start the text embedding process. This process embeds the reference documents and saves the embeddings in OpenSearch Service. It also saves an embedding archive to an S3 bucket through Kinesis Data Firehose for later analysis.
To ingest the reference documents, complete the following steps:
Retrieve the sample EC2 instance ID that was created (see the AWS CDK output JumpHostId) and connect using Session Manager, a capability of AWS Systems Manager. For instructions, refer to Connect to your Linux instance with AWS Systems Manager Session Manager.
Go to the directory /mnt/efs/fs1, which is where the EFS file system is mounted, and create a folder called ingest:
$ cd /mnt/efs/fs1
$ mkdir ingest && cd ingest
Add your reference PDF documents to the ingest directory.
The DataSync task is configured to upload all files found in this directory to Amazon S3 to start the embedding process.
The DataSync task runs on an hourly schedule; you can optionally start the task manually to start the embedding process immediately for the PDF documents you added.
To start the task, locate the task ID from the AWS CDK output DataSyncTaskID and start the task with defaults.
After the embeddings are created, you can start the RAG question and answering through a Jupyter notebook, as shown in the next section.
Question and answering using a Jupyter notebook
Complete the following steps:
Retrieve the SageMaker notebook instance name from the AWS CDK output NotebookInstanceName and connect to JupyterLab from the SageMaker console.
Go to the directory fmops/full-stack/pattern1-rag/notebooks/.
Open and run the notebook query-llm.ipynb in the notebook instance to perform question and answering using RAG.
Make sure to use the conda_python3 kernel for the notebook.
This pattern is useful to explore the backend solution without needing to provision additional prerequisites that are required for the full-stack application. The next section covers the implementation of a full-stack application, including both the frontend and backend components, to provide a user interface for interacting with your generative AI application.
Option 2: Deploy the full-stack sample application with a Streamlit frontend
This pattern allows you to deploy the solution with a user frontend interface for question and answering.
Prerequisites
To deploy the sample application, you must have the following prerequisites:
SageMaker JumpStart model endpoint deployed – Deploy the models to your SageMaker real-time endpoints using SageMaker JumpStart, as outlined in the previous section, using the provided notebooks.
Amazon Route 53 hosted zone – Create an Amazon Route 53 public hosted zone to use for this solution. You can also use an existing Route 53 public hosted zone, such as example.com.
AWS Certificate Manager certificate – Provision an AWS Certificate Manager (ACM) TLS certificate for the Route 53 hosted zone domain name and its applicable subdomains, such as example.com and *.example.com for all subdomains. For instructions, refer to Requesting a public certificate. This certificate is used to configure HTTPS on Amazon CloudFront and the origin load balancer.
Deployment parameters – Record the following:
Frontend application custom domain name – A custom domain name used to access the frontend sample application. The domain name provided is used to create a Route 53 DNS record pointing to the frontend CloudFront distribution; for example, app.example.com.