TN002 Learning from Human Feedback
Objective
Allow Foyle to learn from human feedback.
TL;DR
If a user corrects a command generated by the AI, we want to be able to use that feedback to improve the AI. The simplest way to do this is using few shot prompting. This tech note focuses on how we will collect and retrieve the examples to enrich the prompts. At a high level, the process will look like the following
- Process logs to identify examples where the human corrected the AI generated response
- Turn each mistake into one or more examples of a query and response
- Compute and store the embeddings of the query
- Use brute force to compute distance to all embeddings
This initial design prioritizes simplicity and flexibility over performance. For example, we may want to experiment with using AI to generate alternative versions of a query. For now, we avoid using a vector database to efficiently store and query a large corpus of documents. I think its premature to optimize for large corpi given users may not have large corpi.
Generate Documents
The first step of learning from human feedback is to generate examples that can be used to train the AI. We can
think of each example as a tuple (Doc, Blocks)
where Doc
is the document sent to the AI for completion and
Blocks
are the Blocks
the AI should return.
We can obtain these examples from our block logs. A good starting point is to look
at where the AI made a mistake; i.e. the human had to edit the command the AI provided. We can obtain these
examples by looking at our BlockLogs
and finding logs where the executed cell differed from the generated block.
There’s lots of ways we could store tuples (Doc, Blocks)
but the simplest most obvious way is to append the desired
block to Doc.blocks
and then serialize the Doc
proto as a .foyle
file. We can start by assuming that the query
corresponds to all but the last block Doc.blocks[:-1]
and the expected answer is the last block Doc.Blocks[-1]
.
This will break when we want to allow for the AI to respond with more than 1 block. However, to get started this
should be good enough. Since the embeddings will be stored in an auxilary file containing a serialized proto we
could extend that to include information about which blocks to use as the answer.
Embeddings
In Memory + Brute Force
With text-embedding-3-small embeddings have dimension 1536 and are float32; so 6KB/Embedding. If we have 1000 documents that’s 6MB. This should easily fit in memory for the near future.
Computation wise computing a dot product against 1000 documents is about 3.1 Million Floating Point Operations(FLOPS). This is orders of magnitude less than LLAMA2 which clocks in at 1700 Giga Flops. Given people are running LLAMA2 locally (albeit on GPUs) seems like we should be able to get pretty far with a brute force approach.
A brute force in memory option would work as follows
- Load all the vector embeddings of documents into memory
- Compute a dot product using matrix multiplication
- Find the K-values with the smallest values
Serializing the embeddings.
To store the embeddings we will use a serialized proto that lives side by side with the file we computed the embeddings
for. As proposed in the previous section we will store the example in the file ${BLOCKLOG}.foyle
then its embeddings will live in the file
“${BLOCKLOG}.binpb”. This will contain a serialized proto like the following
message Example {
repeated float32 embedding;
}
We use a proto so that we can potentially enrich the data format over time. For example, we may want to
- Store a hash of the source text so we can determine when to recompute embeddings
- Store additional metadata
- Store multiple embeddings for the same document corresponding to different segmentations
This design makes it easy to add/remove documents from the collection we can
- Add “.foyle” or “.md” documents
- Use globs to match files
- Check if the embeddings already exist
The downside of this approach is likely performance. Opening and deserializing large numbers of files is almost certainly going to be less efficient then using a format like hdf5 that is optimized for matrices.
Learn command
We can add a command to the Foyle CLI to perform all these steps.
foyle learn
This command will operate in a level based, declarative way. Each time it is invoked it will determine what work needs to be done and then perform it. If no additional work is needed it will be a null op. This design means we can run it periodically as a background process so that learning happens automatically.
Here’s how it will work; it will iterate over the log entries in the block logs to identify logs that need to
be processed. We can use a watermark to keep track of processed logs to avoid constantly rereading the entire log
history. For each BlockLog that should be turned into an example we can look for the file {BLOCK_ID}.foyle
;
if the file doesn’t exist then we will create it.
Next, we can check that for each {BLOCK_ID}.foyle
file there is a corresponding {BLOCK_ID}.embeddings.binpb
file. If it doesn’t exist then we will compute the embeddings.
Discussion
Why Only Mistakes
In the current proposal we only turn mistakes into examples. In principle, we could use examples where the model got things right as well. The intuition is that if the model is already correctly handling a query; there’s no reason to include few shot examples. Arguably, those positive examples might end up confusing the AI if we end up retrieving them rather than examples corresponding to mistakes.
Why duplicate the documents?
The ${BLOCK_ID}.foyle
files are likely very similar to the actual .foyle
files the user created. An alternative
design would be to just reuse the original .foyle
documents. This is problematic for several reasons.
The {BLOCK_ID}.foyle
files are best considered internal to Foyle’s self-learning and shouldn’t be directly under
the user’s control. Under the current proposal the {BLOCK_ID}.foyle
are generated from logs and represent snapshots
of the user’s documents at specific points in time. If we used the user’s actual files we’d have to worry about
them changing over time and causing problems. Treating them as internal to Foyle also makes it easier to move them
in the future to a different storage backend. It also doesn’t require Foyle to have access to the user’s storage system.
Use a vector DB
Another option would be to use a vector DB. Using a vector database (e.g. Weaviate, Chroma, Pinecone) adds a lot of complexity. In particular, it creates an additional dependency which could be a barrier for users just interested in trying Foyle out. Furthermore, a vector DB means we need to think about schemas, indexes, updates, etc… Until its clear we have sufficient data to benefit from a vector db its not worth introducing.
References
OpenAI Embeddings Are Normalized
- Reference doc explaining that OpenAI embeddings are normalized to length 1. Weaviate Schemas Common File Extensions For Protos
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.