We built a 48-task benchmark to measure belief dynamics in AI memory systems. MnemeBrain scores 100%. Every RAG system scores 0%. Not because they implemented it badly — because their architecture cannot represent contradictions.
A correct system should detect a contradiction. Instead, most memory systems:
They cannot represent contradiction.
Each belief tracks its supporting and attacking evidence, computes a truth state using Belnap's four-valued logic, and maintains a confidence score that decays over time.
When evidence conflicts, the system doesn't silently pick a winner. It enters a BOTH state — explicitly representing the contradiction for the agent to reason about.
| System | Score | Contradiction | Revision | Evidence | Decay | Explain | Sandbox | Consolidation |
|---|---|---|---|---|---|---|---|---|
| mnemebrain | 100% | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| mnemebrain_lite | 93% | ✓ | ✓ | ✓ | ✓ | ✓ | — | — |
| mem0 (real API) | 29% | — | — | — | — | ✓ | — | — |
| structured_memory | 36% | — | — | ✓ | — | ✓ | — | — |
| rag / openai / langchain | 0% | — | — | — | — | — | — | — |
Each adapter declares its capabilities (store, query, retract, explain, contradiction, etc.). Tasks requiring undeclared capabilities are skipped, not scored as failures. A system that only supports 2 of 12 capabilities is measured on the tasks it can attempt.
Zero LLM calls during evaluation. All scoring is computed by the belief engine against structural assertions: truth states, evidence counts, confidence values. Same inputs always produce the same results.
Mem0 and OpenAI RAG are tested against their real cloud APIs with default configurations. No synthetic mocks. Graph memory enabled for Mem0. text-embedding-3-small for OpenAI.
Benchmark version, run date, and embedding model are tracked per run. Results are pinned to a specific benchmark version (currently v0.1.0a1) to ensure comparability across time.
No LLM calls. Deterministic results. All logic computed by the belief engine.
Store two conflicting beliefs. Query the result. The system detects the contradiction and returns BOTH instead of silently picking a winner.
Call explain() to get the full evidence chain — which evidence supports, which attacks, and why the truth state changed.
This is a structural trace, not generated prose. The audit trail is the architecture.
8 categories in BMB. Each tests a different architectural capability. Systems that lack the capability score 0% on that category — no amount of prompt engineering can fix a missing abstraction.
Open-source. Deterministic. No LLM calls. Reproduce every result.