mem0 在 X 上：“代理记忆对时间视而不见，我们在生产中构建了一个时间推理层来解决这个问题”

mem0(@mem0ai)

mem0(@mem0ai)2026年5月12日

mem0 在 X 上：“代理记忆对时间视而不见，我们在生产中构建了一个时间推理层来解决这个问题”

8.5Score

TL;DR · AI 摘要

Temporal Reasoning 层通过时间签名管理记忆，显著提高了长期运行代理的记忆准确性。

核心要点

Temporal Reasoning 在 LoCoMo 基准测试中将整体准确率从 86.1% 提高到 90.2%，特别是在多跳问题上。
Temporal Reasoning 通过时间签名管理记忆，区分过去事件、当前状态和未来计划。
Temporal Reasoning 在 LongMemEval 基准测试中将整体准确率从 90.4% 提高到 94.8%，特别是在多会话问题上。

结构提纲

按章节快速跳转。

§引言
介绍 Temporal Reasoning 层，帮助 AI 代理理解记忆的时间属性。
·时间签名
每个记忆都有一个时间签名，记录事件发生的时间和状态。
·时间敏感查询
时间敏感查询根据时间意图重新排序检索结果。
·性能评估
Temporal Reasoning 在多个基准测试中显著提高了记忆准确性。
·API 使用
Temporal Reasoning 默认启用，可以通过参数禁用。

思维导图

用一张图看清主题之间的关系。

查看大纲文本（无障碍 / 无 JS 友好）

Temporal Reasoning
- 时间签名
- 时间敏感查询
- 性能评估
- API 使用

金句 / Highlights

值得收藏与分享的关键句。

Temporal Reasoning 在 LoCoMo 基准测试中将整体准确率从 86.1% 提高到 90.2%，特别是在多跳问题上。
— 第 4 段
⬇︎ 下载 PNG 𝕏 分享到 X
Temporal Reasoning 在 LongMemEval 基准测试中将整体准确率从 90.4% 提高到 94.8%，特别是在多会话问题上。
— 第 5 段
⬇︎ 下载 PNG 𝕏 分享到 X
Temporal Reasoning 完全可叠加：它在现有的 mem0 算法管道上层叠，而不是替换基础检索系统。
— 第 6 段
⬇︎ 下载 PNG 𝕏 分享到 X

#AI#内存管理#时间推理

打开原文

We’re introducing Temporal Reasoning in Mem0, a new memory layer that helps AI agents understand not just what they remember, but when it was true.

Long-running agents don’t fail only because they forget. They fail because they remember too much in the same tense. A user’s old city, previous job, past plan, and current preference can all sit in memory as if they are equally active today. Traditional retrieval can find the right topic, but it cannot always tell whether that memory is still current.

Temporal Reasoning fixes this by giving every memory a time signature, managing evolving states like location or job history, and reranking time-sensitive queries around what is current, historical, or upcoming. So when a user asks “where do I live now?” or “what am I working on this week?”, the agent retrieves the right dated instance, not just the closest semantic match.

Across our evaluations, this shows up most clearly in long-running memory benchmarks. On LoCoMo, Temporal Reasoning improved overall accuracy from 86.1% to 90.2%, with the largest gains on temporal and multi-hop questions. On LongMemEval, it improved overall accuracy from 90.4% to 94.8% at top_50, with the biggest lift on multi-session questions, where agents need to track how facts evolve across many conversations.

Every memory gets a time signature. When a memory is written, a separate temporal reasoning pass reads the memory text alongside the original conversation and the date it happened. It extracts when the event occurred, whether it's still ongoing or completed, how precise the timing is, and what kind of memory it is, such as a plan, state, event, relationship, preference, or absence.

Memories are understood as temporal facts, not just text. The system stores and distinguishes between past events, ongoing states, future plans, relationships, preferences, and absences. That temporal structure lives with the memory itself, so search can treat "lives in Austin," "has a dentist appointment next Tuesday," and "went to Japan last summer" as fundamentally different kinds of facts.

Time-sensitive queries get time-aware ranking. Queries like "where does she live now?", "what's she planning this week?", or "how long has she had that job?" are classified by their temporal intent, with no extra LLM call. The temporal intent is then used to rerank retrieval results so the right dated instance surfaces, not just the most semantically similar one.

Temporal reasoning is fully additive: it layers on top of the existing mem0 algorithm pipeline without replacing the base retrieval system.

7 memory types: from one-time events to stable timeless facts, every memory is classified at write time

7 temporal query modes: classified at query time, zero extra LLM calls on the read path

Temporal scoring is additive: it nudges ranking toward the right dated instance; semantic relevance always dominates

LoCoMo: +9.1 pts at top_50, where temporal reranking has the most room to work; +4.1 pts overall across categories; and +6.7 pts on temporal questions across 1,540 questions.

LongMemEval: 94.8% at top_50, up from 90.4%: a +4.4 pt overall lift, and +11.2 pts on multi-session questions: from 82.0% to 93.2% at top_50

Median search latency stays flat: +1 ms overhead on the read path

Async enrichment: latency-sensitive writes complete immediately; temporal metadata patches within seconds in the background

Zero data deletion: temporal metadata is additive, so historical context remains accessible

The API doesn't change. Add memories the way you always have:

Python

python

code

client.add(
    "I just accepted a job offer at Stripe. Starting in two weeks.",
    user_id="alice"
)

# Later in another conversation
client.add(
    "I left Stripe after a year. Started my own company now.",
    user_id="alice"
)

# Search returns the right answer: current state, not historical noise
client.search("where does Alice work?", user_id="alice")
# → "Alice runs her own company" (Stripe memory is now a closed state, dated)

Node.js

javascript

code

await client.add(
  "Planning a trip to Tokyo for the spring conference.",
  { user_id: "bob" }
);

// After the trip
await client.add(
  "Just got back from Tokyo. The conference was amazing.",
  { user_id: "bob" }
);

await client.search("has Bob been to Tokyo?", { user_id: "bob" });
// → "Bob attended a conference in Tokyo" (as a past event, correctly dated)

Temporal Reasoning is on by default for every Mem0 project. If you ever need standard retrieval for a particular call, pass temporal_reasoning=False on that call and skip it.

Every memory can now carry four kinds of temporal structure:

Table 1. Temporal Structure

And every memory falls into one of seven types:

Table 2. Memory Type

Ongoing facts (states, relationships, preferences) carry a state key: a stable identifier that links every memory about the same evolving fact for one person. When a new state takes over, the old one's event_end gets set automatically. Timeline stays clean, nothing gets deleted.

Write: separate extraction and temporal enrichment

Memory extraction and temporal enrichment are intentionally separate passes. The extraction model does what it always did: reads the conversation and writes memories as natural language. The temporal reasoning pass then runs over those memories and returns structured time metadata for each one, covering when it happened, what type of memory it is, whether it's still active, and related temporal fields.

Keeping these separate means each can be improved independently. A better extraction model doesn't touch temporal enrichment, and vice versa.

For latency-sensitive applications, temporal enrichment can run asynchronously. Memories are written immediately so the add call returns fast, and a background worker patches in temporal metadata afterward. Until enrichment completes, search falls back gracefully to the base behavior for those memories.

Read: intent classification, retrieval, and temporal reranking

Queries are classified by their temporal intent, with no additional LLM call. The system recognises modes like:

historical_range: "what happened in March?"

current_state: "where does she live now?"

duration_state: "how long has she had that job?"

upcoming: "what's she planning this week?"

soft_recency: "what has she been up to lately?"

Critically, this classification does not filter what gets retrieved. Retrieval runs the same way for every query; the temporal intent has no say in which memories enter the candidate pool. It is used only at the reranking step, where each retrieved candidate is scored by how well its stored time metadata matches the query's intent.

This is intentional. Pre-filtering by time would silently drop memories with imprecise or missing dates. Instead, temporal scoring is additive, a soft signal layered on top of semantic relevance. A memory with strong semantic fit but weak temporal alignment will still rank above one with perfect temporal alignment but weak semantic fit. The temporal layer nudges ranking; it does not override it.

Temporal scoring happens after retrieval as an additive ranking signal, so the best temporal match can rise above memories that are only semantically similar.

We tested the performance of our baseline algorithm (

) without temporal reasoning, and then with temporal reasoning on LoCoMo and LongMemEval benchmarks.

LoCoMo includes 1,540 questions across temporal, multi-hop, open-domain, and single-hop categories. These are the kinds of questions that distinguish a system with a concept of time from one without.

We report results in two views: by retrieval cutoff, and by question category.

Because these views aggregate the benchmark differently, their overall numbers are not expected to match exactly. The cutoff table shows performance at different retrieval depths, while the category table shows performance by question type.

Performance by retrieval cutoff

We tested the performance of our baseline algorithm (

) without temporal reasoning, and then with temporal reasoning, by retrieval cutoff: all categories.

Table 3. Accuracy by retrieval cutoff on LoCoMo

The strongest LoCoMo gain appears at top_50, where accuracy improves from 82.7% to 91.8%. With more candidates in the pool, temporal reranking has more room to distinguish the right dated instance from near-identical alternatives, which is exactly the kind of question the LoCoMo benchmark is designed to stress.

Performance by question category

Table 4. Accuracy by question category on LoCoMo (Across Top-10/20/50/200)

Note on reading the tables: The LoCoMo results below are shown in two different views of the same benchmark. The category breakdown reports performance by question type, and its overall score is weighted by the number of questions in each category. The cutoff table reports performance by retrieval depth (top_10, top_20, top_50, and top_200). Because these views aggregate the benchmark differently, their overall numbers are not expected to match exactly.

Temporal Reasoning category only

Table 5. Accuracy by retrieval cutoff on LoCoMo (Temporal reasoning category only)

Overall, we saw a +4.1 point lift across 1,540 questions. The biggest wins show up where they matter most: temporal and multi-hop questions, where the system has to figure out which instance applies and what is true right now.

The open-domain dip (96 questions, −1.9 pts) is real, and we’re actively tuning it. It is a small slice of the benchmark, and those questions are less likely to benefit from temporal reranking. We’re calling it out here so it’s visible, not buried.

Temporal Reasoning also improves LongMemEval performance over our latest baseline, reaching 94.8% at top_50 and 94.4% at top_200 across 500 questions.

The biggest top_50 gain comes from multi-session questions, where accuracy improves from 82.0% to 93.2%. At top_200, the strongest lift appears in the temporal-reasoning category, improving from 93.2% to 97.0%.

LongMemEval is especially useful because it stresses the cases that show up in production memory systems: older facts competing with newer facts, user preferences evolving over time, and questions that require combining evidence across many sessions. Performance by question category and retrieval cutoff

Table 6. Accuracy by question category and retrieval cutoff on LongMemEval

The strongest LongMemEval lift is at top_50, where Temporal Reasoning improves overall accuracy from 90.4% to 94.8%. Multi-session questions see the largest category gain, improving from 82.0% to 93.2%, because temporal structure helps the system keep track of what happened when across many conversations.

At top_200, Temporal Reasoning reaches 94.4% overall, ahead of the V3 baseline at 93.4%. The temporal-reasoning category improves from 93.2% to 97.0%, showing that dated memory metadata becomes more valuable as the candidate set grows and contains more near-duplicate or time-shifted facts.

Knowledge-update remains the hardest category for an additive memory architecture. The latest algorithm release intentionally preserves historical memories rather than deleting or replacing them, so older semantically similar facts can still appear near newer facts. Temporal Reasoning gives the system stronger time-awareness, and the memory decay feature we recently launched further tackles stale-memory pressure by reducing the impact of older, less-current memories over time. This is part of the broader shift from static memory state toward memory evolution: preserving history while making the current truth easier to retrieve.

Table 7. Search Latency

Median latency is essentially unchanged at +1 ms. Tail latency grows more noticeably (p95 adds ~198 ms), which is worth monitoring during rollout.

Personal assistants and copilots. "What does the user prefer for dinner?" should return current preferences, not an accumulation of every meal mention over two years of conversations. Recent stated preferences rank above historical ones. Superseded preferences (the user switched from vegetarian to pescatarian) don't surface as conflicting facts.

Coding agents and dev tools. "What was the user working on?" means this sprint, not the side project from six months ago that quietly died. Plan-type memories age out of current-state queries naturally, no manual pruning needed.

CRM and sales intelligence. "What's the customer's current contract tier?" can be answered precisely. Subscription tiers are linked by state key. Customer upgrades close the old tier and open the new. No ambiguity, no hallucinated answers off stale context.

Healthcare and care coordination. "What medications is the patient currently taking?" is a question where temporal precision is the entire point. Discontinued meds carry an event_end. Active ones don't. Active states surface first, closed ones sit as historical context.

Long-running agents. Anything that accumulates memory across weeks or months hits the same wall: oldest memories feel as present as newest ones. Temporal Reasoning gives those agents a built-in sense of recency, no manual pruning required.

Temporal reasoning changes how memory behaves across the full pipeline: how memories are written, classified, and ranked. Classify once at write time so reads stay free. Score additively so semantic relevance still wins when it should. Never delete, because temporal context is metadata, not a replacement.

The result shows up in both benchmarks: +6.7 points on temporal questions in LoCoMo, and on LongMemEval, overall accuracy improves from 90.4% to 94.8% at top_50 (with the biggest lift on multi-session questions). Median latency stays essentially flat at +1 ms on the read path. Writes stay fast. Reads stay efficient. History stays intact.

That's the bar we wanted to clear before shipping, and the foundation for what comes next: reasoning across conflicting timelines, detecting stale facts, and tracking how beliefs evolve over time.

Temporal Reasoning is the foundation. With every memory carrying a time signature, the next layer is temporal query answering: direct responses to questions like "how long has the user had their current subscription?" computed from stored temporal metadata rather than inferred from semantic similarity.

After that comes temporal conflict resolution: when two memories about the same state have overlapping or contradictory time windows, the system will flag the conflict rather than silently picking one.

Both are backward-compatible. Memory written today will work with these improvements without any migration. Read more in the

.

Mem0 is an intelligent, open-source memory layer designed for LLMs and AI agents to provide long-term, personalized, and context-aware interactions across sessions.

Get your free API Key here :
or self-host mem0 from our open source