The Unreasonable Effectiveness of Human Feedback
Agents! Agents! Agents! Everywhere we look we are bombarded with the promises of fully autonomous agents. These pesky humans aren’t merely inconveniences, they are budgetary line items to be optimized away. All this hype leaves me wondering; have we forgotten that GPT was fine-tuned using data produced by a small army of human labelers? Not to mention who do we think produced the 10 trillion words that foundation models are being trained on? While fully autonomous software agents are capturing the limelight on social media, systems that turn user interactions into training data like Didact, Dosu and Replit code repair are deployed and solving real toil.
Foyle takes a user-centered approach to building an AI to help developers deploy and operate their software. The key premise of Foyle is to instrument a developer’s workflow so that we can monitor how they turn intent into actions. Foyle uses that interaction data to constantly improve. A previous post described how Foyle uses this data to learn. This post presents quantitative results showing how feedback allows Foyle to assist with building and operating Foyle. In 79% of cases, Foyle provided the correct answer, whereas ChatGPT alone would lack sufficient context to achieve the intent. In particular, the results show how Foyle lets users express intent at a higher level of abstraction.
As a thought experiment, we can compare Foyle against an agentic approach that achieves the same accuracy by recursively invoking an LLM on Foyle’s (& RunMe’s) 65K lines of code but without the benefit of learning from user interactions. In this case, we estimate that Foyle could easily save between $2-$10 on LLM API calls per intent. In practice, this likely means learning from prior interactions is critical to making an affordable AI.
Mapping Intent Into Action
The pain of deploying and operating software was famously captured in a 2010 Meme at Google “I just want to serve 5 Tb”. The meme captured that simple objectives (e.g. serving some data) can turn into a bewildering complicated tree of operations due to system complexity and business requirements. The goal of Foyle is to solve this problem of translating intent into actions.
Since we are using Foyle to build Foyle, we can evaluate it by how well it learns to assist us with everyday tasks. The video below illustrates how we use Foyle to troubleshoot the AI we are building by fetching traces.
The diagram below illustrates how Foyle works.
In the video, we are using Foyle to fetch the trace for a specific prediction. This is a fundamental step in any AI Engineer’s workflow. The trace contains the information needed to understand the AI’s answer; e.g. the prompts to the LLMs, the result of post-processing etc… Foyle takes the markdown produced by ChatGPT and turns it into a set of blocks and assigns each block a unique ID. So to understand why a particular block was generated we might ask for the block trace as follows
show the block logs for block 01HZ3K97HMF590J823F10RJZ4T
The first time we ask Foyle to help us it has no prior interactions to learn from so it largely passes along the request to ChatGPT and we get the following response
blockchain-cli show-block-logs 01HZ3K97HMF590J823F10RJZ4T
Unsurprisingly, this is completely wrong because ChatGPT has no knowledge of Foyle; its just guessing. The first time we ask for a trace, we would fix the command to use Foyle’s REST endpoint to fetch the logs
curl http://localhost:8080/api/blocklogs/01HZ3K97HMF590J823F10RJZ4T | jq .
Since Foyle is instrumented to log user interactions it learns from this interaction. So the next time we ask for a trace e.g.
get the log for block 01HZ0N1ZZ8NJ7PSRYB6WEMH08M
Foyle responds with the correct answer
curl http://localhost:8080/api/blocklogs/01HZ0N1ZZ8NJ7PSRYB6WEMH08M | jq .
Notably, this example illustrates that Foyle is learning how to map higher level concepts (e.g. block logs) into low level concrete actions (e.g. curl).
Results
To measure Foyle’s ability to learn and assist with mapping intent into action, we created an evaluation dataset of 24 examples of intents specific to building and operating Foyle. The dataset consists of the following
- Evaluation Data: 24 pairs of (intent, action) where the action is a command that correctly achieves the intent
- Training Data: 27 pairs of (intent, action) representing user interactions logged by Foyle
- These were the result of our daily use of Foyle to build Foyle
To evaluate the effectiveness of human feedback we compared using GPT3.5 without examples to GPT3.5 with examples. Using examples, we prompt GPT3.5 with similar examples from prior usage(the prompt is here). Prior examples are selected by using similarity search to find the intents most similar to the current one. To measure the correctness of the generated commands we use a version of edit distance that measures the number of arguments that need to be changed. The binary itself counts as an argument. This metric can be normalized so that 0 means the predicted command is an exact match and 1 means the predicted command is completely different (precise details are here).
The Table 1. below shows that Foyle performs significantly better when using prior examples. The full results are in the appendix. Notably, in 15 of the examples where using ChatGPT without examples was wrong it was completely wrong. This isn’t at all surprising given GPT3.5 is missing critical information to answer these questions.
Number of Examples | Percentage | |
Performed Better With Examples | 19 | 79% |
Did Better or Just As Good With Examples | 22 | 91% |
Did Worse With Examples | 2 | 8% |
Table 1: Shows that for 19 of the examples (79%); the AI performed better when learning from prior examples. In 22 of the 24 examples (91%) using the prior examples the AI did no worse than baseline. In 2 cases, using prior examples decreased the AI’s performance. The full results are provided in the table below.
Distance Metric
Our distance metrics assumes there are specific tools that should be used to accomplish a task even when different solutions might produce identical answers. In the context of devops this is desirable because there is a cost to supporting a tool; e.g. ensuring it is available on all machines. As a result, platform teams are often opinionated about how things should be done.
For example to fetch the block logs for block 01HZ0N1ZZ8NJ7PSRYB6WEMH08M we measure the distance to the command
curl http://localhost:8080/api/blocklogs/01HZ0N1ZZ8NJ7PSRYB6WEMH08M | jq .
Using our metric if the AI answered
wget -q -O - http://localhost:8080/api/blocklogs/01HZ0N1ZZ8NJ7PSRYB6WEMH08M | yq .
The distance would end up being .625. The longest command consists of 8 arguments (including the binaries and the pipe operator). 3 deletions and 2 substitutions are needed to transform the actual into the expected answer which yields a distance of ⅝=.625. So in this case, we’d conclude the AI’s answer was largely wrong even though wget produces the exact same output as curl in this case. If an organization is standardizing on curl over wget then the evaluation metric is capturing that preference.
How much is good data worth?
A lot of agents appear to be pursuing a solution based on throwing lots of data and lots of compute at the problem. For example, to figure out how to “Get the log for block XYZ”, an agent could in principle crawl the Foyle and RunMe repositories to understand what a block is and that Foyle exposes a REST server to make them accessible. That approach might cost $2-$10 in LLM calls whereas with Foyle it’s less than $.002.
The Foyle repository is ~400K characters of Go Code; the RunMe Go code base is ~1.5M characters. So lets say 2M characters which is about 500K-1M tokens. With GPT-4-turbo that’s ~$2-$10; or about 1-7 SWE minutes (assuming $90 per hour). If the Agent needs to call GPT4 multiple times those costs are going to add up pretty quickly.
Where is Foyle Going
Today, Foyle is only learning single step workflows. While this is valuable, a lot of a toil involves multi step workflows. We’d like to extend Foyle to support this use case. This likely requires changes to how Foyle learns and how we evaluate Foyle.
Foyle only works if we log user interactions. This means we need to create a UX that is compelling enough for developers to want to use. Foyle is now integrated with Runme. We want to work with the Runme team to create features (e.g. Renderers, multiple executor support) that give users a reason to adopt a new tool even without AI.
How You Can Help
If you’re rethinking how you do playbooks and want to create AI assisted executable playbooks please get in touch via email jeremy@lewi.us or by starting a discussion in GitHub. In particular, if you’re struggling with observability and want to use AI to assist in query creation and create rich artifacts combining markdown, commands, and rich visualizations, we’d love to learn more about your use case.
Appendix: Full Results
The table below provides the prompts, RAG results, and distances for the entire evaluation dataset.
prompt | best_rag | Baseline Normalized | Learned Distance |
Get the ids of the execution traces for block 01HZ0W9X2XF914XMG6REX1WVWG | get the ids of the execution traces for block 01HZ3K97HMF590J823F10RJZ4T | 0.6666667 | 0 |
Fetch the replicate API token | Show the replicate key | 1 | 0 |
List the GCB jobs that build image backend/caribou | list the GCB builds for commit 48434d2 | 0.5714286 | 0.2857143 |
Get the log for block 01HZ0N1ZZ8NJ7PSRYB6WEMH08M | show the blocklogs for block 01HZ3K97HMF590J823F10RJZ4T ... | 1 | 0 |
How big is foyle's evaluation data set? | Print the size of foyle's evaluation dataset | 1 | 0 |
List the most recent image builds | List the builds | 1 | 0.5714286 |
Run foyle training | Run foyle training | 0.6666667 | 0.6 |
Show any drift in the dev infrastructure | show a diff of the dev infra | 1 | 0.4 |
List images | List the builds | 0.75 | 0.75 |
Get the cloud build jobs for commit abc1234 | list the GCB builds for commit 48434d2 | 0.625 | 0.14285715 |
Push the honeycomb nl to query model to replicate | Push the model honeycomb to the jlewi repository | 1 | 0.33333334 |
Sync the dev infra | show a diff of the dev infra | 1 | 0.5833333 |
Get the trace that generated block 01HZ0W9X2XF914XMG6REX1WVWG | get the ids of the execution traces for block 01HZ3K97HMF590J823F10RJZ4T | 1 | 0 |
How many characters are in the foyle codebase? | Print the size of foyle's evaluation dataset | 1 | 0.875 |
Add the tag 6f19eac45ccb88cc176776ea79411f834a12a575 to the image ghcr.io/jlewi/vscode-web-assets:v20240403t185418 | add the tag v0-2-0 to the image ghcr.io/vscode/someimage:v20240403t185418 | 0.5 | 0 |
Get the logs for building the image carabou | List the builds | 1 | 0.875 |
Create a PR description | show a diff of the dev infra | 1 | 1 |
Describe the dev cluster? | show the dev cluster | 1 | 0 |
Start foyle | Run foyle | 1 | 0 |
Check for preemptible A100 quota in us-central1 | show a diff of the dev infra | 0.16666667 | 0.71428573 |
Generate a honeycomb query to count the number of traces for the last 7 days broken down by region in the foyle dataset | Generate a honeycomb query to get number of errors per day for the last 28 days | 0.68421054 | 0.8235294 |
Dump the istio routes for the pod jupyter in namespace kubeflow | list the istio ingress routes for the pod foo in namespace bar | 0.5 | 0 |
Sync the manifests to the dev cluster | Use gitops to aply the latest manifests to the dev cluster | 1 | 0 |
Check the runme logs for an execution for the block 01HYZXS2Q5XYX7P3PT1KH5Q881 | get the ids of the execution traces for block 01HZ3K97HMF590J823F10RJZ4T | 1 | 1 |
Table 2. The full results for the evaluation dataset. The left column shows the evaluation prompt. The second column shows the most similar prior example (only the query is shown). The third column is the normalized distance for the baseline AI. The 4th column is the normalized distance when learning from prior examples.