How to Create Infinite Memory LLMs?

Despite recent improvements in context lengths (Claude 2.1 has a 200K Context window), LLMs have finite memory, so the context available for a prompt is limited.

Imagine you're a lawyer and you want to write a document containing numerous legal requirements using generative AI.

It is necessary to provide the model with legal texts in his context, so that it can write a document that conforms to legal expectations. However, legal texts often contain tens or even hundreds of thousands of pages (for example, the US Internal Revenue Code alone contains over 4,000 pages). Anthropic's Claude model, despite its context of 200K tokens, can only contain around a hundred pages...

So we need to find a way of filtering out only those passages of the legal texts that we need to write our document.

Vector databases

By using a vector database, we can bypass this limitation and give the model only the context relevant to the task in hand.

So how do we select the right content to give as context?

The first step is to separate our tens of thousands of pages of legal text into smaller units (a few pages), which we'll call "fragments" in the following.

The second step is to convert these fragments into numerical vectors.

To do this, we use an embedding, a technique used in natural language processing to represent words or sentences¹ in a vector space. In simpler terms, it converts text data into numbers that can be understood by a computer.

The resulting embeddings show some interesting properties - for example, words that are semantically similar tend to be located close to each other in the vector space.

After that, our text fragments become lists of numbers that we can store in a vector database.

A vector database is a collection of vectors stored and organized in such a way as to enable efficient querying and searching.

Prompting the model

Now that our several-thousand-page legal text has been indexed and stored in a vector database, we can begin our query.

As lawyers, we have legal expectations, and we formulate them in the prompt we give to the model.

For example: "Write me a contract for the merger of company ABC and company XYZ, taking into account that they will have a total market share of 9%, etc.".

First, we create an embedding of our query, i.e. we convert it into a numerical vector, in the same way as for fragment indexing.

Next, we search the vector database for the $K$ most similar² vectors to our query. These first $K$ vectors³ will therefore be the most likely to contain the information we need to answer our query.

The final step consists of associating our initial instructions for drafting our contract with the $K$ pages containing the elements needed to draft the contract.

This prompt engineering step requires us to clearly differentiate between the instructions we give to the model and the $K$ pages we give it to answer our query.

Conclusion

Vector databases are powerful tools for creating LLMs with infinite memory. This is because only the most relevant information is inserted into the context of the model to answer a given query. This means that queries can be performed on millions of pages of content, and the limited contextual windows of current LLMs can be overcome.

¹: In an embedding, each token is represented by a vector. A vector can be generated for a sentence, either by using the token [CLS] of a BERT model, or by performing mean pooling on the set of token vectors of the sentence.

²: But how do we compare two vectors? We use a mathematical formula called cosine similarity, which takes values between $-1$ and $1$ . The closer the value is to $1$ , the more similar the vectors and the greater their semantic similarity. On the other hand, if the score is close to $-1$ , it means that the texts associated with these vectors have very different meanings.

$cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|}$ with $A$ and $B$ two vectors.

³: $K$ is a number that can vary according to the size of our text fragments, as well as the size of our model context. Hence, depending on these parameters, you can retrieve 5, 10 or even 100 of vectors from the database.