1:20 AM. My phone buzzed violently on the nightstand. The head of customer support pinged me three times in the group chat, saying a VIP client had complained: yesterday they taught the AI assistant “I’m vegetarian, no spicy food, budget 200 per person,” but when they asked for a restaurant recommendation today, it suggested Chongqing hot pot. The client snapped, “Does your AI have amnesia?” and immediately cancelled their renewal.
I jolted awake, opened the logs, and started digging. The code was using LangChain’s ConversationSummaryBufferMemory, with Chroma as the underlying vector store for persistence. On the surface each conversation round saved correctly, but checking the recall records revealed that yesterday afternoon, right after the user expressed their preferences, a system-triggered high-frequency Q&A inserted a summary snippet without any conversation context. That snippet pushed the earlier user-preference embedding out of the top_k retrieval window. Root cause in one sentence: memory writes and summary generation were not atomic, lacked consistency checks, and our regression tests only covered single turns — never the multi-turn, interleaved insertion scenarios that happen in the wild.
That night I spent four hours manually fixing data, then two more days building an automated regression testing suite for the memory layer. This article is the post-mortem plus the solution, for all the brothers and sisters tormented by “model amnesia.”
Problem breakdown: why are RAG memory bugs so hard to catch?
The scenario isn’t complicated: users add personal preferences over multiple turns, and the AI must remember those facts and recall them accurately later. Our implementation was LangChain’s ConversationSummaryBufferMemory using Chroma as persistent storage for conversation summaries and raw messages. Every new message gets appended, triggers a summary chain update, and is upserted back into Chroma.
Routine monitoring only looked at QPS and error rates — it completely missed the “silent drift” of memory content. Three reasons:
- Vector similarity is not an exact match. Even if the recall veers off, the answer just becomes more generic. No hard error gets thrown, so it’s easily dismissed as “the model occasionally acting weird.”
-
Non-atomic writes under concurrency. Two near-simultaneous messages from user A may be handled by two async tasks, each calling
memory.save_context. The order in which their embeddings land is unpredictable. On top of that, summary updates are asynchronous, so the final list of documents stored may not reflect the actual conversation order. - No regressable semantic assertions. Previously, tests just printed a few outputs for human eyes. There was no way to turn logic like “were user preferences accidentally overwritten” into a red/green result in CI.
In short: we weren’t missing a feature — we were missing a framework that can automatically construct multi-turn dialogs and precisely assert semantic consistency.
Design choices: why not switch to SQL joins — and why we decided to “hack” LangChain’s memory instead
When the bug surfaced, somebody suggested, “Let’s just switch to a relational database and use SQL joins to pull all user history and feed it to the LLM.” But the reality is the LLM context window is limited; you can’t just dump everything in. You still need semantic summarization and retrieval. So a vector memory layer is unavoidable — and if it’s unavoidable, you have to be able to test it.
We stuck to:
- LangChain: already in production, no migration headaches. The key was to wrap its memory abstraction with a testable shell.
- Chroma: lightweight, supports both embedded and client‑server modes, can be deployed locally during tests — no cloud dependencies, no unpredictable latency or leftover data.
- pytest + custom fixtures: able to construct dialog sequences, simulate real user behavior, and clear the Chroma collection after each test so suites stay isolated.
Core idea: abstract “memory write → recall → check that critical facts still exist” into a test function, then call that function with different dialog sequences to build a regression suite. To deal with LangChain’s internal caching and async flushes, we explicitly force refresh via memory.load_memory_variables and (where needed) sleep to wait for Chroma persistence. The pitfall section below gets into the details.
Core implementation: from memory wrapper to automated assertions
1. A testable memory wrapper
This snippet solves one problem: hard-code the persistence behavior of LangChain’s native memory so you can control cleanup and collection isolation in a test environment without introducing extra wrappers in production.
import os
from typing import List, Dict
from langchain.memory import ConversationSummaryBufferMemory
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
class TestableMemory:
"""封装 LangChain memory,强制持久化到 Chroma,并暴露清理接口"""
def __init__(self, collection_name: str = "test_memory", persist_dir: str = "./chroma_test"):
self.llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
self.embeddings = OpenAIEmbeddings()
self.collection_name = collection_name
self.persist_dir = persist_dir
# 关键:使用独立目录的 Chroma,测试互不污染
self.vectorstore = Chroma(
collection_name=collection_name,
embedding_function=self.embeddings,
persist_directory=persist_dir,
)
self.memory = ConversationSummaryBufferMemory(
llm=self.llm,
max_token_limit=500,
return_messages=True,
chat_memory=self.vectorstore, # 把 Chroma 当作 chat_memory 的存储后端
)
def add_interaction(self, user_msg: str, ai_msg: str):
self.memory.save_context(
{"input": user_msg},
{"output": ai_msg}
)
# 立即强制加载一次,确保写入对后续检索可见(降级缓存影响)
self.memory.lo
Top comments (0)