Open-source — pip install mnemebrain-benchmark

Most AI memory systems cannot detect contradictions.

We built a 48-task benchmark to measure belief dynamics in AI memory systems. MnemeBrain scores 100%. Every RAG system scores 0%. Not because they implemented it badly — because their architecture cannot represent contradictions.

Reproduce in 60s See results

The problem

Agents accumulate facts.
Facts conflict.

Week 1:
User: "I'm vegetarian." Week 6:
User: "I ate steak last night."

A correct system should detect a contradiction. Instead, most memory systems:

overwrite the old fact silently
return whichever sentence is retrieved first
hallucinate a resolution

They cannot represent contradiction.

The missing layer

Where belief memory fits
in the AI stack.

Current stack

Application

Agent framework

LLM

Vector DB / RAG

Database

→

With MnemeBrain

Application

Agent framework

LLM

Belief layer ← MnemeBrain

Vector DB / RAG

Database

What MnemeBrain stores

Not text chunks.
Structured beliefs.

Each belief tracks its supporting and attacking evidence, computes a truth state using Belnap's four-valued logic, and maintains a confidence score that decays over time.

When evidence conflicts, the system doesn't silently pick a winner. It enters a BOTH state — explicitly representing the contradiction for the agent to reason about.

TRUEsupported

FALSErefuted

BOTHconflicting

NEITHERno evidence

"user is vegetarian"

supporting_evidence:
"user said they avoid meat"

attacking_evidence:
"user ate steak last night"

truth_state: BOTH
confidence: 0.52

Benchmark results

The gap is not small.

Belief Maintenance Benchmark (BMB) — 48 tasks · 8 categories · ~100 checks · zero LLM calls · fully deterministic

mnemebrain (full)

100%

mnemebrain_lite

93%

structured_memory

36%

mem0 (real API)

29%

naive_baseline

rag_baseline

openai_rag (real)

langchain_buffer

Every retrieval-based system scores 0% on contradiction detection. Not because they implemented it badly. Because their architecture cannot represent contradictions.

Why the scores differ

System	Score	Contradiction	Revision	Evidence	Decay	Explain	Sandbox	Consolidation
mnemebrain	100%	✓	✓	✓	✓	✓	✓	✓
mnemebrain_lite	93%	✓	✓	✓	✓	✓	—	—
mem0 (real API)	29%	—	—	—	—	✓	—	—
structured_memory	36%	—	—	✓	—	✓	—	—
rag / openai / langchain	0%	—	—	—	—	—	—	—

Benchmark v0.1.0a1 · Run: 2026-03-08 · Embeddings: all-MiniLM-L6-v2 (sentence-transformers).
Mem0 tested with real cloud API, graph memory enabled. OpenAI RAG with real text-embedding-3-small.
Full leaderboard → · BMB Report

Methodology

How the benchmark works.

Capability-aware scoring

Each adapter declares its capabilities (store, query, retract, explain, contradiction, etc.). Tasks requiring undeclared capabilities are skipped, not scored as failures. A system that only supports 2 of 12 capabilities is measured on the tasks it can attempt.

Deterministic evaluation

Zero LLM calls during evaluation. All scoring is computed by the belief engine against structural assertions: truth states, evidence counts, confidence values. Same inputs always produce the same results.

Real API baselines

Mem0 and OpenAI RAG are tested against their real cloud APIs with default configurations. No synthetic mocks. Graph memory enabled for Mem0. text-embedding-3-small for OpenAI.

Versioned releases

Benchmark version, run date, and embedding model are tracked per run. Results are pinned to a specific benchmark version (currently v0.1.0a1) to ensure comparability across time.

Full methodology: BMB Technical Report · Full leaderboard with per-category data →

Try it yourself

Reproduce in 60 seconds.

# Install
pip install mnemebrain-benchmark

# Run the full benchmark (all adapters)
mnemebrain-bmb

# Single adapter
mnemebrain-bmb --adapter mnemebrain
mnemebrain-bmb --adapter mem0 # requires MEM0_API_KEY

# Single category
mnemebrain-bmb --category contradiction

# Task-level evaluations (downstream correctness)
mnemebrain-task-eval

No LLM calls. Deterministic results. All logic computed by the belief engine.

Example

Contradiction detection
in four lines.

Store two conflicting beliefs. Query the result. The system detects the contradiction and returns BOTH instead of silently picking a winner.

Call explain() to get the full evidence chain — which evidence supports, which attacks, and why the truth state changed.

This is a structural trace, not generated prose. The audit trail is the architecture.

from mnemebrain import Brain

brain = Brain()

brain.store("user is vegetarian")
brain.store("user ate steak last night")

belief = brain.query("user diet")

# Output:
belief.truth_state # BOTH
belief.confidence # 0.52

trace = brain.explain(belief)
trace.supporting # 1 item
trace.attacking # 1 item

Architecture

Evidence in,
beliefs out.

Evidence

Belief Node

TruthState (Belnap)

Confidence + Decay

Agent API

Explicit evidence provenance
Contradiction tracking (Belnap four-valued logic)
AGM-style belief revision
Temporal decay with type-specific half-lives
Copy-on-write sandbox for counterfactual reasoning
Episodic-to-semantic consolidation
HippoRAG multi-hop graph retrieval
ANN-first pattern separation

8 categories in BMB. Each tests a different architectural capability. Systems that lack the capability score 0% on that category — no amount of prompt engineering can fix a missing abstraction.

Open benchmark

Any memory system
can run BMB.

class MemorySystem:

    def store(self, claim, evidence):
        ...

    def query(self, claim):
        ...

    def explain(self, belief_id):
        ...

Adapters welcome for:

MemGPT Zep Graphiti LlamaIndex CrewAI Weaviate ChromaDB your system

Most AI memory systems cannot detect contradictions.

Agents accumulate facts.Facts conflict.

Where belief memory fitsin the AI stack.

Not text chunks.Structured beliefs.