Conversational Cognition Benchmarks and What They Mean for Solana Agents

Most AGI benchmarks measure performance on fixed, single-turn tasks: answer this question, solve this puzzle, classify this image. Those tests are useful for narrow capabilities, but they miss the heart of what makes an agent actually useful in the real world.

The environments where agents matter — trading desks, treasury operations, market monitoring, customer workflows — are conversational. The agent must remember what was said earlier, update its understanding when new information arrives, steer toward a goal across multiple turns, and do all of this while respecting wallet boundaries and showing its work.

This is the idea behind conversational cognition: intelligence measured by the quality of an ongoing, adaptive dialogue rather than isolated test scores.

What conversational cognition actually tests

A useful conversation has at least three legs:

The agent articulates its current mental model (what it believes the goal is, what context it has retained, what it plans to do next).
It receives new information or feedback and revises that model.
It produces a follow-up response or action that reflects the updated model and moves the shared goal forward.

This pattern repeats across context switches, conflicting information, long time gaps, and changing user intent. It requires memory that actually works, the ability to detect and resolve contradictions, and the discipline to stay within explicit boundaries (supported actions, spending limits, approval gates).

Traditional benchmarks rarely stress these loops. Conversational cognition benchmarks do.

Our ARC-AGI Journey and the Shift to Conversational Cognition

In our companion post Reasoning That Holds Up, we described the ARC-AGI research thread that explored template mining, DSL composition, and a drug-discovery-style repurposing pipeline. That work delivered strong precision on held-out abstract reasoning tasks but also surfaced a structural ceiling: static puzzle-solving, no matter how sophisticated the scaffolds, does not fully capture the dynamic, multi-turn, belief-updating behavior required in production agent workflows.

Real Solana agents operate in conversational environments — ongoing dialogues with users or other agents about positions, opportunities, risk parameters, and execution plans. The agent must articulate its current model, incorporate new information, resolve conflicts, and steer toward a goal while staying within explicit safety boundaries. This is exactly the three-leg pattern that defines conversational cognition.

The framework itself comes from Carlos Perez’s essay “Conversational Cognition: A New Approach to AGI”, which makes the case that sustained, adaptive dialogue with mental-model updating is a more meaningful measure of generalization than isolated benchmarks. The pivot from pure ARC-style evaluation to conversational cognition metrics was therefore pragmatic: it gave us a measurable north star that directly predicts whether an agent can be trusted with real capital and real user sessions. The numbers below are the result of that focused direction.

Our measured results

We run two families of conversational evaluations that directly probe these capabilities.

Cross-domain conversational performance

On a held-out cross-domain battery designed to test near-human conversational cognition across five domains (including conversational state tracking, goal maintenance, expertise revelation, and multi-turn intent handling), the runtime achieves:

Weighted mean score: 84.2%
Average performance relative to human baseline: 93.5%
Median human ratio: 92.6%
Minimum domain human ratio: 85.2%

Verdict: near-human supported.

The conversational-cognition held-out domain specifically scores 82.3% while reaching 93.9% of measured human performance on the same tasks. These are not cherry-picked single-turn questions; they are multi-leg conversations that penalize loss of context, unjustified belief updates, and failure to maintain goal alignment.

LongMemEval memory and multi-session results

On the LongMemEval benchmark (500-question suite focused on long-term memory, preference tracking, temporal reasoning, and knowledge updates across sessions), the current hosted runtime records:

Overall accuracy: 77.6%
Single-session assistant tasks: 96.4%
Single-session user intent tracking: 94.3%
Single-session preference handling: 73.3%
Multi-session continuity: 63.9%
Temporal reasoning across time gaps: 80.5%
Knowledge-update accuracy: 69.2%

Median per-item latency on these tasks is 3.36 seconds.

These numbers are deliberately reported in plain percentages because that is how operators evaluate whether an agent is ready for production. The high 90s on single-session assistant and user tracking, combined with 80%+ temporal reasoning, show the runtime already maintains coherent multi-turn state effectively. The 64–69% range on multi-session and knowledge-update tasks points to the remaining high-leverage areas for memory and retrieval improvements.

Why these numbers matter for the agents you ship

When you build a trading copilot, market-intelligence agent, or treasury monitor on Solana Agent, the user experience is almost entirely conversational:

“What happened with that SOL position we discussed yesterday?”
“Given the new Kamino rate and my current exposure, should we adjust?”
“Remember I only want to use USDC for gasless actions under $50.”

The agent must retain the prior context (LongMemEval multi-session), correctly interpret shifting intent (conversational-cognition goal-pivot and priority-shift cases), update its internal model without hallucinating facts, and propose the next action only through typed, auditable tools.

The cross-domain scores show the system already handles the core loop at near-human reliability in held-out scenarios. The LongMemEval breakdown shows strong single-session and temporal performance with solid multi-session and knowledge-update results — exactly the profile needed for production conversational agents.

This is the practical translation of conversational cognition: not a claim of general intelligence, but measurable evidence that the agent can participate in the kind of ongoing, goal-directed dialogue that actually occurs in production Solana applications — the same direction our ARC-AGI research ultimately pointed us toward.

If you are evaluating whether to assemble these capabilities yourself or use the hosted runtime, the relevant question is no longer “how well does the model score on static tests?” It is “how reliably does the full system maintain and act on conversational state across real user sessions?”

Those are the metrics we track and publish. They directly predict whether one developer can ship an agent that users will actually trust with ongoing workflows.