Improving Memory Retrieval: How New Computer achieved 50% higher recall with LangSmith

About New Computer

New Computer is the team behind Dot, the first personal AI designed to truly understand its users. Dot’s long-term memory system learns users preferences over time by observing verbal and behavioral cues. Dot’s memory system goes beyond just recall— it constantly evolves its picture of who the user is in order to provide timely and personalized assistance, creating a perception of true understanding.

With LangSmith, New Computer has been able to test and improve their memory retrieval systems, leading to 50% higher recall and 40% higher precision compared to a previous baseline implementation of dynamic memory retrieval.

A brief overview of Dot’s agentic memory

The New Computer team has built an innovative, first-of-its-kind agentic memory system. Unlike standard RAG methods that rely on a static set of documents, agentic memory involves dynamically creating or pre-calculating documents that will only be retrieved later. This means that information must be structured during memory creation in order to make retrieval possible and, as memories accumulate over time, accurate & efficient.

In addition to the raw content, Dot’s memories have a set of optional “meta-fields” that are useful for retrieval. These include status (e.g. COMPLETED or IN PROGRESS) and datetime fields like start or due date. These can be used as additional filter methods for high-frequency queries during retrieval, such as “Which tasks did I want to get done this week?”, or “What do I have left to complete for today?”

Improving memory retrieval with LangSmith

With their diverse range of retrieval methods— one or multiple of semantic, keyword, BM25, meta-field filter techniques — New Computer needed a new way to iterate quickly on a dataset of labeled examples. To test performance while preserving user privacy, they generated synthetic data by creating a cohort of synthetic users with LLM-generated backstories. After an initial conversation to seed the memory database for each synthetic user, the team began storing queries (messages by synthetic users) along with the full set of available memories in a LangSmith dataset.

Using an in-house tool connected to LangSmith, the New Computer team labeled relevant memories for each query and defined evaluation metrics like precision, recall and F1, allowing them to quickly iterate on improving retrieval for the agentic memory system.

For this set of experiments, they started out with a simple baseline system using semantic search that retrieves a fixed number of the most relevant memories per query. They then tested other techniques to assess performance across different query types. In some cases, similarity search or keyword methods like BM25 worked better; in others, these methods required some pre-filtering by meta-fields in order to perform effectively.

As you might imagine, running these multiple methods in parallel can lead to a combinatorial explosion of experiments— thus, validating different methods quickly on a diverse dataset is crucial for making progress. LangSmith’s easy-to-use SDK and Experiments UI enabled New Computer to run, evaluate, and inspect the results of these experiments quickly and efficiently.

An overview of F1 performance across different experiments that New Computer ran in LangSmith

These experiments enabled New Computer to significantly improve their memory systems, leading to 50% higher recall and 40% higher precision compared to a previous baseline implementation of dynamic memory retrieval.

Adjusting the conversation prompt with LangSmith

Dot’s responses are generated by a dynamic conversational prompt— which means that in addition to including relevant memories, the system might also rely upon tool usage (e.g. search results) and highly-contextual behavioral instructions in order to respond in an accurate and natural way.

Developing a highly variable system like this can be challenging, as a change that improves one query can have detrimental effects on others.

To optimize the prompt, the New Computer team again used a cohort of synthetic users to generate user queries for a wide range of intents. They were then able to easily inspect the global effects of prompt changes in LangSmith’s experiment comparison view. This let them identify regressed runs derived from prompt changes in a highly-visual manner.

In addition, in failure cases where the output was inaccurate, the team could directly adjust prompts without leaving the LangSmith UI using the built-in prompt playground. This greatly improved the team’s iteration speed while evaluating and adjusting their conversation prompts.

What’s next for New Computer

As New Computer pushes to deepen human-AI relationships, the team is constantly seeking ways to make users feel truly perceived and understood. This includes enabling Dot to adapt to conversational or tonal preferences of the user, or becoming fully bespoke on a per-user basis by proactively reaching out to users with tailored messages.

Their recent launch has brought in a new wave of users— more than 45% of which converted to the app’s paid tier after hitting the free message limit— who expect Dot to grow and evolve alongside them over time. New Computer’s partnership with the LangChain team and use of LangSmith will remain pivotal to how the team uses novel AI materials to simulate the complexities of a deepening relationship with human users.