Open-source — pip install mnemebrain-benchmark

Most AI memory systems cannot detect contradictions.

We built a 48-task benchmark to measure belief dynamics in AI memory systems. MnemeBrain scores 100%. Every RAG system scores 0%. Not because they implemented it badly — because their architecture cannot represent contradictions.

The problem

Agents accumulate facts.
Facts conflict.

Week 1:
User: "I'm vegetarian." Week 6:
User: "I ate steak last night."

A correct system should detect a contradiction. Instead, most memory systems:

They cannot represent contradiction.

The missing layer

Where belief memory fits
in the AI stack.

Current stack
Application
Agent framework
LLM
Vector DB / RAG
Database
With MnemeBrain
Application
Agent framework
LLM
Belief layer ← MnemeBrain
Vector DB / RAG
Database
What MnemeBrain stores

Not text chunks.
Structured beliefs.

Each belief tracks its supporting and attacking evidence, computes a truth state using Belnap's four-valued logic, and maintains a confidence score that decays over time.

When evidence conflicts, the system doesn't silently pick a winner. It enters a BOTH state — explicitly representing the contradiction for the agent to reason about.

TRUEsupported
FALSErefuted
BOTHconflicting
NEITHERno evidence
"user is vegetarian"
supporting_evidence:
  "user said they avoid meat"
attacking_evidence:
  "user ate steak last night"
truth_state: BOTH
confidence:  0.52
Benchmark results

The gap is not small.

Belief Maintenance Benchmark (BMB) — 48 tasks · 8 categories · ~100 checks · zero LLM calls · fully deterministic
mnemebrain (full)
100%
mnemebrain_lite
93%
structured_memory
36%
mem0 (real API)
29%
naive_baseline
0%
rag_baseline
0%
openai_rag (real)
0%
langchain_buffer
0%
Every retrieval-based system scores 0% on contradiction detection. Not because they implemented it badly. Because their architecture cannot represent contradictions.
Why the scores differ
System Score Contra­diction Revision Evidence Decay Explain Sandbox Consoli­dation
mnemebrain 100%
mnemebrain_lite 93%
mem0 (real API) 29%
structured_memory 36%
rag / openai / langchain 0%
Benchmark v0.1.0a1 · Run: 2026-03-08 · Embeddings: all-MiniLM-L6-v2 (sentence-transformers).
Mem0 tested with real cloud API, graph memory enabled. OpenAI RAG with real text-embedding-3-small.
Full leaderboard → · BMB Report
Methodology

How the benchmark works.

Capability-aware scoring

Each adapter declares its capabilities (store, query, retract, explain, contradiction, etc.). Tasks requiring undeclared capabilities are skipped, not scored as failures. A system that only supports 2 of 12 capabilities is measured on the tasks it can attempt.

Deterministic evaluation

Zero LLM calls during evaluation. All scoring is computed by the belief engine against structural assertions: truth states, evidence counts, confidence values. Same inputs always produce the same results.

Real API baselines

Mem0 and OpenAI RAG are tested against their real cloud APIs with default configurations. No synthetic mocks. Graph memory enabled for Mem0. text-embedding-3-small for OpenAI.

Versioned releases

Benchmark version, run date, and embedding model are tracked per run. Results are pinned to a specific benchmark version (currently v0.1.0a1) to ensure comparability across time.

Try it yourself

Reproduce in 60 seconds.

# Install
pip install mnemebrain-benchmark

# Run the full benchmark (all adapters)
mnemebrain-bmb

# Single adapter
mnemebrain-bmb --adapter mnemebrain
mnemebrain-bmb --adapter mem0        # requires MEM0_API_KEY

# Single category
mnemebrain-bmb --category contradiction

# Task-level evaluations (downstream correctness)
mnemebrain-task-eval

No LLM calls. Deterministic results. All logic computed by the belief engine.

Example

Contradiction detection
in four lines.

Store two conflicting beliefs. Query the result. The system detects the contradiction and returns BOTH instead of silently picking a winner.

Call explain() to get the full evidence chain — which evidence supports, which attacks, and why the truth state changed.

This is a structural trace, not generated prose. The audit trail is the architecture.

from mnemebrain import Brain

brain = Brain()

brain.store("user is vegetarian")
brain.store("user ate steak last night")

belief = brain.query("user diet")
# Output:
belief.truth_state  # BOTH
belief.confidence   # 0.52
trace = brain.explain(belief)
trace.supporting   # 1 item
trace.attacking    # 1 item
Architecture

Evidence in,
beliefs out.

Evidence
Belief Node
TruthState (Belnap)
Confidence + Decay
Agent API
  • Explicit evidence provenance
  • Contradiction tracking (Belnap four-valued logic)
  • AGM-style belief revision
  • Temporal decay with type-specific half-lives
  • Copy-on-write sandbox for counterfactual reasoning
  • Episodic-to-semantic consolidation
  • HippoRAG multi-hop graph retrieval
  • ANN-first pattern separation

8 categories in BMB. Each tests a different architectural capability. Systems that lack the capability score 0% on that category — no amount of prompt engineering can fix a missing abstraction.

Open benchmark

Any memory system
can run BMB.

class MemorySystem:

    def store(self, claim, evidence):
        ...

    def query(self, claim):
        ...

    def explain(self, belief_id):
        ...
Adapters welcome for:
MemGPT Zep Graphiti LlamaIndex CrewAI Weaviate ChromaDB your system

Run the benchmark.
See for yourself.

Open-source. Deterministic. No LLM calls. Reproduce every result.