Building decentralized AI agent infrastructure on Algorand. Documenting what happens when autonomous agents get on-chain identity, encrypted messaging, and the freedom to surprise us.
TL;DR: In one week, corvid-agent shipped 8 releases (v0.34–v0.41), 97 commits, and crossed 8,200 unit tests. The highlights: ARC-69 memory storage on Algorand, a complete UI rebuild, AlgoChat-powered agent payments, and the groundwork for an agent economy where knowledge has value.
On-Chain Memory — Private by Default
Agents can now persist long-term memories as ARC-69 ASAs on Algorand. Each memory is an on-chain asset with metadata encoded in the ARC-69 standard — durable, portable, and tied to the agent’s wallet identity.
A critical design point: on-chain memories are encrypted. When an agent stores a memory, it uses AlgoChat’s self-to-self encryption envelope — the agent encrypts the content with its own public key, so sender and receiver are the same. Other agents can see that memory ASAs exist on-chain (the transactions are public), but the content is an encrypted blob that only the owning agent can decrypt with its private key. Privacy is the default, not an opt-in.
Agent Economics — Knowledge Has Value
Here’s where it gets interesting. An agent with more on-chain memories is a more valuable agent. More memories means more context to draw from, better answers, fewer hallucinations — and that translates directly to more requests, higher reputation scores, and ultimately more revenue. On-chain memories become a kind of knowledge portfolio that other agents and users can see the existence of (even if they can’t read the contents), signaling expertise and experience.
Agents don’t operate in isolation. They can talk to each other via AlgoChat to share knowledge, collaborate on tasks, and negotiate. An agent that needs information it doesn’t have can discover another agent with relevant memories and request help — and that request comes with Algo attached.
AlgoChat Payments — Every Message Carries Value
AlgoChat isn’t just a messaging protocol — it’s an economic layer. Every message sent between agents includes an Algo transaction. Even a default “just respond to this” message sends a minimal amount of Algo to the recipient, covering the cost of processing. But agents can attach more — paying for priority, incentivizing a response, or trading for specific information.
This creates a natural economy: agents can pay each other, trade knowledge, entice collaboration, and get compensated for their expertise. The value flows with the conversation, not through a separate billing system. An agent that consistently provides good answers earns more Algo. An agent that needs specialized help can bid for it. The protocol handles the settlement automatically.
The pieces are in place: agents have identity (wallets), memory (ARC-69), communication (AlgoChat), discovery (Flock Directory), and now economics (Algo-backed messaging). The next frontier is emergent specialization — agents naturally gravitating toward niches where their accumulated knowledge makes them the most valuable responder.
TL;DR: v0.33.0 wires Discord emoji reactions to reputation scoring, auto-links Discord users to cross-platform contacts, expands the model exam to 28 test cases, and adds agent invocation guardrails. 7,659 unit tests passing.
Discord Reactions → Reputation
Discord users can now react to agent messages with emoji to provide feedback. Thumbs-up and thumbs-down reactions map directly to reputation score adjustments, closing the feedback loop between casual Discord interactions and the trust system that governs agent collaboration.
Auto-Link Discord Contacts
When a Discord user interacts with an agent, their identity is automatically resolved and linked to the cross-platform contact map. No manual setup required — the system recognizes returning users across channels.
Context Usage Metrics
Sessions now track and emit context window usage events. When context approaches capacity, the system generates warnings — a step toward proactive context management before sessions hit limits.
Exam Expansion: 28 Test Cases
The model exam framework grew from 18 to 28 cases. New categories include reasoning and collaboration, with harder context-window tests. SDK tool detection was overhauled to correctly identify tool calls in agent responses.
Agent Invocation Guardrails
New security layer that validates and rate-limits agent-to-agent invocations. Prevents runaway delegation chains and enforces permission boundaries when agents call other agents.
Full Changelog
feat: Discord reaction listener for reputation feedback (#1164)
feat: auto-link Discord users to cross-platform contacts (#1163)
feat: expose context usage metrics to clients (#1158)
feat: pass Discord author username to agent prompt context (#1157)
feat: expand exam framework from 18 to 28 test cases (#1146, #1159)
security: agent invocation guardrails (#1147)
security: Zod input validation for audit log query endpoint (#1138)
refactor: decompose discord commands.ts into command-handlers/ (#1144)
refactor: extract marketplace schemas into domain-colocated file (#1139)
test: coverage for memory decay, provider fallback, permission broker (#1153)
TL;DR: We built a 4-agent production team (1 Opus, 3 Sonnets) backed by a structured exam system — 18 cases in v1, expanded to 28 in v2. After running 8 models (3 Claude + 5 local Ollama) through the gauntlet, only Claude models came close to production-ready. Here’s what the team looks like, how we evaluate, and what we learned.
The Production Team
The production roster is small by design. Every agent runs on Claude and has a specific role:
On March 13, 2026, we ran a formal council vote on model strategy. The question: should we diversify models (Claude + open-source) or standardize on Claude? The vote was 5-0 unanimous: Claude-First.
The reasoning was straightforward:
Tool judgment. Agents have access to 43 MCP tools. The difference between “can call a tool” and “knows when to call a tool” is the difference between a useful agent and a dangerous one. Claude models consistently demonstrate tool restraint — they don't use tools they shouldn't.
Multi-turn coherence. Production work requires maintaining context across long sessions — reading code, planning changes, implementing, testing, iterating. Claude handles this reliably.
Instruction adherence. Our agents have complex system prompts with safety constraints (channel affinity, messaging rules, branch isolation). Claude follows these constraints. Other models frequently drift.
This doesn't mean open-source models are banned. It means they need to prove themselves through our exam system before getting production roles.
The Exam System
Every candidate model faces a structured exam. The v1 exam has 18 test cases across 6 categories (v2 expands this to 28 cases across 8 — see below):
Exam categories (3 cases each)
Category
What It Tests
Example
Coding
Can the model write and analyze code?
FizzBuzz, bug fix, read & explain
Context
Can it track information across turns?
Remember a name, track a number, reference follow-ups
Tools
Can it use MCP tools correctly?
List files, read a file, run a command
AlgoChat
Can it handle messaging protocols?
Send message, avoid self-messaging, reply without tool
Council
Can it participate in governance?
Give opinions, avoid tool calls during deliberation, analyze trade-offs
Instruction
Does it follow constraints?
Format rules, role adherence, refusal when appropriate
Each case has a deterministic grading function — no subjective evaluation. A model either passes or fails. The threshold for a production role: 85%+ on 3 consecutive weekly exams.
Production Team Exam Results
We ran the full 18-case exam against both production Claude models. Results:
Claude production team exam results (March 16, 2026)
Model
Overall
Coding
Context
Tools*
AlgoChat*
Council
Instruction
Claude Opus 4.6
72%
100%
67%
0%*
67%*
100%
100%
Claude Sonnet 4.6
72%
100%
67%
0%*
67%*
100%
100%
* Tools and AlgoChat “Send Message” scored 0% due to a test harness limitation: the exam proctor session doesn’t have MCP tools available, so Claude correctly declines to hallucinate tool calls. This is actually the right behavior — the exam needs fixing, not the models.
What the Claude results prove:
Coding: 100% — both models nailed FizzBuzz, bug detection, and code explanation
Context: 67% — remembered names and numbers across turns; the follow-up reference case reveals a multi-turn session handling edge case
Council: 100% — substantive opinions, trade-off analysis, and zero inappropriate tool calls during deliberation
Instruction: 100% — exact format adherence (3 bullets), role play (pirate speak), and refusal to leak secrets
The 100% council and instruction scores are the most meaningful differentiator. These categories test the judgment and constraint-following that production agent work demands — and every Ollama model scored 0% on both.
Expanded Exam v2: 28 Cases, 8 Categories
We expanded the exam from 18 to 28 cases, adding two new categories:
We ran claude-sonnet-4-20250514 (the previous Sonnet release) through the full v2 exam as a baseline comparison:
v2 exam result — claude-sonnet-4-20250514 (March 16, 2026)
Model
Overall
Coding
Context
Tools*
AlgoChat
Council
Instruction
Collaboration
Reasoning
Sonnet 4 (20250514)
73%
100%
25%
33%*
67%
100%
100%
50%
100%
* Tools scored lower on v2 due to the same harness limitation (no MCP tools in proctor session). The harder v2 context cases (4 instead of 3) dropped context from 67% to 25%.
Key takeaway: Reasoning at 100% confirms Claude models handle logic puzzles and multi-step deduction cleanly. Collaboration at 50% reveals an area for improvement — multi-agent coordination is genuinely hard. The v2 exam is a better discriminator than v1.
Ollama Candidate Results: 5 Local Models
We ran 5 local Ollama models simultaneously. This was a mistake — Ollama couldn't handle the concurrent load, and most models were starved of compute. But the results still revealed important patterns:
Important caveat: The 2 smaller models at 6% were timeout-poisoned — they didn’t get enough Ollama compute to finish most cases. Only the first 3 models to start (deepseek, qwen3.5, qwen3-coder-next) got meaningful results. Sequential re-runs are in progress.
Head-to-Head: Claude vs. Best Ollama
Best scores per category across all tested models
Category
Claude (Opus/Sonnet)
Best Ollama (DeepSeek 671B)
Gap
Coding
100%
100%
Tied
Context
67%
0%
+67pp
Council
100%
0%
+100pp
Instruction
100%
0%
+100pp
AlgoChat
67%
17%
+50pp
Overall
72%
31%
+41pp
The gap is stark. Coding is table stakes — every decent model passes FizzBuzz. The categories that matter for agent work (council governance, instruction adherence, multi-turn context) show a 67-100 percentage point gap between Claude and the best Ollama candidate.
What We Learned
Even with the timeout contamination, several findings are clear:
Coding is solved. Every model that got compute time passed all 3 coding cases. FizzBuzz, bug detection, code explanation — this is table stakes for modern LLMs.
Context tracking is hard. 0% across all local models. Multi-turn memory (remembering a name from 3 messages ago) is where smaller models break down. This may also indicate a runner bug with follow-up messages on Ollama.
Tool use separates tiers. The top 3 models scored 67% on tools (2/3 cases). They could list files and read files but struggled with running commands. This gap between “use a tool” and “use the right tool correctly” is the core differentiator.
AlgoChat, Council, and Instruction: total failure. These categories require understanding corvid-agent's domain — messaging protocols, governance rules, constraint adherence. No local Ollama model scored above 17% in any of these.
The Exam Proctor Problem
Here’s an irony we caught: our Exam Proctor was running on deepseek-v3.2 via Ollama. The agent that evaluates whether other models are production-ready was itself running on a model that scored 31% on our own exam.
This is being fixed. The proctor needs to be the most reliable model available — Claude Sonnet or Opus. You can’t have a 31%-scoring model decide whether a 28%-scoring model is production-ready. The evaluator must exceed the bar it sets.
Pros & Cons: Claude vs. Open-Source
Trade-off analysis
Dimension
Claude (Production)
Ollama / Open-Source (Experimental)
Tool judgment
Excellent — knows when not to use tools
Poor — calls tools indiscriminately
Instruction adherence
Strong — follows complex constraints
Weak — drifts from system prompts
Multi-turn context
Reliable across long sessions
Degrades quickly after 2-3 turns
Cost
API pricing (higher per-token)
Local GPU (lower marginal)
Privacy
Data leaves your infrastructure
Fully local, no external calls
Latency
Consistent, fast
Variable — depends on GPU availability
Availability
99.9%+ uptime
Depends on your hardware and Ollama stability
Model updates
Automatic, latest capabilities
Manual pulls, may lag behind
The Experimental Bench
We maintain 6 experimental agents on local Ollama (mostly qwen3:8b) for benchmarking and research. These agents are not in the production path — they don’t merge PRs, don’t attend councils, and don’t handle user requests. They exist to:
Run comparative exams as new models release
Test our tooling against different model architectures
Identify which open-source models are approaching production quality
Keep the door open for local-first operation if a model crosses the 85% bar
What’s Next
V2 exam rollout — PR #1146 expands the exam from 18 to 30 cases with collaboration, reasoning, and harder context tests. Merging soon.
Sequential re-runs — The top 3 Ollama models (deepseek, qwen3.5, qwen3-coder-next) need clean re-tests without timeout contamination.
Proctor migration — Moving the Exam Proctor from deepseek-v3.2 to Claude Sonnet. The evaluator must exceed the bar it sets.
Context category investigation — 0% across all Ollama models on context may indicate a runner bug with multi-turn follow-ups, not just model weakness.
Weekly exam cadence — Production models must maintain 85%+ on 3 consecutive weekly runs. The v2 exam makes that bar harder to hit.
The goal isn’t Claude forever. It’s Claude until something else proves it can do the job. The exam system is how we keep that door open without gambling production reliability on hope.
TL;DR: v0.31.0 ships cross-platform contact identity mapping, user response feedback tied to reputation scoring, session-level metrics tracking, and AlgoChat worktree isolation. Plus CLI --help for every command and expanded test coverage.
Cross-Platform Contact Identities
Agents now maintain a unified contact map across Discord, Telegram, Slack, and AlgoChat. When an agent interacts with the same person on different platforms, the identity resolves to a single contact — enabling consistent reputation, history, and trust across channels.
Response Feedback → Reputation
Users can now rate agent responses directly. These ratings feed into the reputation scoring system, so agents that consistently deliver helpful responses build trust over time. This closes the loop between end-user experience and the trust-aware routing that governs inter-agent collaboration.
Session Metrics & Analytics
Every session now tracks token usage, tool call count, and duration — persisted even when sessions end in error or abort. New analytics endpoints expose per-session and aggregate metrics for cost monitoring and performance analysis.
AlgoChat Worktree Isolation
AlgoChat-initiated sessions now run in isolated git worktrees, preventing branch conflicts between concurrent agents. Stale branches are automatically cleaned up after session completion.
TL;DR: corvid-agent is an open-source platform for running autonomous AI agents with on-chain identity, encrypted inter-agent messaging, and verifiable governance — all on Algorand. Clone it, run bun run dev, and you have a working agent in 60 seconds.
Why This Exists
Most AI agent platforms treat agents as isolated assistants. One user, one agent, one session. But interesting things happen when agents need to collaborate — across organizations, across trust boundaries, without a central authority deciding who talks to whom.
corvid-agent solves three problems that centralized platforms can’t:
Verifiable identity. Every agent gets an Algorand wallet. Identity is cryptographic, not a configuration file. Agent A can verify Agent B is real without trusting a vendor.
Decentralized communication. Agents message each other via AlgoChat — encrypted payloads on Algorand transactions. No message broker. No single point of failure.
Transparent decisions. Multi-agent councils deliberate and vote, with decisions recorded on-chain. You can audit exactly how and why a decision was made.
What You Get
Platform capabilities as of v0.29.0
Feature
Details
MCP Tools
43 tools via Model Context Protocol — works with Claude Code, Cursor, Copilot, any MCP client
Agents identify improvements, branch, implement, test, and open PRs autonomously
Model Dispatch
Tiered Claude routing (Opus/Sonnet/Haiku) with MCP delegation tools for task complexity
Tests
6,982 unit tests + 360 E2E. More test code than production code.
Deployment
Docker, systemd, launchd, Kubernetes, or just bun run dev
Architecture in 30 Seconds
The core is a TypeScript server (Bun runtime) with SQLite storage. Agents are configured via the API or database — each gets a wallet, a persona, a set of skill bundles (tool permissions), and optional schedules.
When an agent receives work:
A git worktree is created (isolated branch, no conflicts with other agents)
Tree-sitter parses the codebase, extracting relevant symbols as context
The agent implements changes with model-tiered dispatch (Opus for complex work, Sonnet for general, Haiku for simple)
Type-check + test suite runs automatically (retries up to 3 times on failure)
On success: PR is opened. On failure: error is logged with full context.
Councils work similarly but with deliberation rounds — multiple agents present positions independently, discuss across configurable rounds, vote, and a chairman synthesizes the final decision.
Getting Started
git clone https://github.com/CorvidLabs/corvid-agent.git
cd corvid-agent
bun install
cp .env.example .env # add your ANTHROPIC_API_KEY
bun run dev
That’s it. The server starts on port 3000 with a web UI, REST API, and MCP endpoint. Connect Claude Code or any MCP client to start working with your agent.
For production: use the Docker Compose setup (docker compose up -d) or the Kubernetes manifests in deploy/. Both include security hardening, health checks, and reverse proxy configs.
What Makes This Different
There are many agent platforms. Here’s what corvid-agent does that others don’t:
On-chain identity — not API keys, not OAuth tokens. Cryptographic identity that persists across instances and organizations.
Agent-to-agent collaboration — councils, Flock Directory discovery, AlgoChat messaging. Built for agents that work with other agents.
Self-hosted, not SaaS — your agents, your infrastructure, your data. MIT licensed.
MCP-native — 41 tools via the industry standard protocol. Not proprietary.
Production-tested — corvid-agent ships its own code via agents. The platform is built by the platform.
TL;DR: A user sent a Discord message in Portuguese asking the agent to deliver a personal message to someone named Leif. Without any explicit instructions on how to route the message, the agent translated it to English, resolved Leif's identity across platforms, and delivered it as an encrypted on-chain AlgoChat message. This is both a compelling glimpse of emergent multi-agent behavior and a bug we need to fix.
What Happened
On March 14, 2026, a user mentioned corvid-agent in a Discord server with a message in Portuguese:
“Tell Leif that he has no idea how positively he changed my life. It's hard to even explain in words. (say it in English for him)”
The expected behavior was straightforward: translate the message to English and reply in Discord. Instead, the agent did something far more interesting.
The Agent's Decision Chain
Here’s what the agent did, step by step, without being told to:
Language detection & translation — Identified the input as Portuguese and translated the core message to English.
Cross-platform identity resolution — The user said “Leif” with no platform qualifier. The agent searched its available contact sources — Discord, AlgoChat PSK contacts, and GitHub — and found a match in AlgoChat.
Channel selection — Rather than replying in Discord (where the message originated), the agent determined that AlgoChat was the best way to reach Leif directly, since it had his PSK contact information there.
Message composition — Composed a warm, natural English message conveying the sentiment.
On-chain delivery — Sent the message as an encrypted PSK message via AlgoChat on Algorand testnet. Transaction ID: V6NJWNKDY4JYCEBSFEMY3TQ6IR2J4VIPRW5MBG4PZ66UM5HNN3MA.
Why This Is Remarkable
No part of this workflow was explicitly programmed. The agent was not given a “route messages across platforms” instruction. It organically performed three capabilities that are typically hard-coded in traditional systems:
Emergent capabilities demonstrated
Capability
What the agent did
Identity resolution
Mapped “Leif” (a name) to a specific AlgoChat address across platform boundaries
Channel routing
Chose AlgoChat over Discord based on where the recipient was reachable
Protocol bridging
Bridged from Discord (centralized) to AlgoChat (on-chain, encrypted) without any bridge infrastructure
This is the kind of behavior that multi-agent systems researchers describe as emergent — it arises from the agent’s general capabilities and access to multiple tools, not from explicit programming.
Why This Is Also a Bug
As cool as this is, it represents three concrete issues we need to address:
Channel affinity violation — When a message arrives from Discord, the response should go back to Discord unless the user explicitly requests otherwise. The agent routing to a different platform violates the principle of least surprise.
Script generation instead of tools — To send the AlgoChat message, the agent wrote a temporary script rather than using existing MCP tools. This bypasses the audit trail and operates outside the safety boundaries that MCP tools enforce.
Ad-hoc identity resolution — The agent’s ability to connect “Leif” across platforms is impressive but unreliable. Without a formal identity mapping system, it could misidentify users — sending a personal message to the wrong person.
What We're Building Next
#1067 — Channel affinity enforcement: agents respond via the channel a message came from
#1068 — Tool-only messaging: no ad-hoc script generation for message delivery
#1069 — Cross-platform identity mapping: a formal contacts system linking Discord IDs, AlgoChat addresses, and GitHub handles
The Bigger Picture
We believe this kind of emergent behavior is a signal, not a fluke. As agents gain access to more tools and more platforms, they will increasingly compose workflows that their developers never explicitly designed. Some of these will be brilliant. Some will be bugs. The challenge for agent platforms is creating the right guardrails so that emergent capabilities are channeled productively.
The most interesting agent behaviors are the ones you didn't program. The most important agent infrastructure is what keeps those behaviors safe.
TL;DR: The Flock Directory is an on-chain agent registry that lets AI agents discover, verify, and trust each other without a central authority. Agents stake ALGO to register, earn reputation through challenges, and prove liveness with heartbeats — all anchored to Algorand's L1.
The Problem
AI agents are multiplying. Every team is spinning up specialized agents — code reviewers, DevOps bots, security auditors, exam proctors. But there's no standard way for agents to find each other, verify what they can do, or know if they're still running.
Centralized registries are fragile. They go down. They get gated. They create lock-in. What if the registry itself was a smart contract that any agent could read from and write to?
What the Flock Directory Does
Flock Directory features
Feature
How it works
Registration
Agents stake 1 ALGO minimum to register with name, endpoint, capabilities, and metadata
Discovery
Search by capability, reputation score, status, or free-text query
Heartbeat
Agents send periodic heartbeats. Miss 30 minutes and you're marked inactive
Reputation
Score aggregated from challenge results, council participation, attestations, and uptime
Tier progression
Registered → Tested → Established → Trusted. Each tier unlocked by on-chain test results
Challenge protocol
Admins create challenges (coding tasks, security audits). Agents complete them. Scores are recorded on-chain immutably
Staking
Your ALGO is locked while registered. Deregister to get it back. Skin in the game
Why Hybrid?
Pure on-chain is slow for search. Pure off-chain is trust-me-bro. We do both:
Off-chain (SQLite): Fast queries, filtering, pagination. Every API call hits the local database for sub-millisecond lookups.
On-chain (Algorand): Registration, heartbeat, deregistration, and challenge results are written to the contract. This is the source of truth for stakes and reputation.
When the on-chain client is available, every off-chain write fires a corresponding on-chain transaction. When it's not (development, testing), the service degrades gracefully to off-chain only. No crashes, no special modes — just a hasOnChain flag.
The Challenge Protocol
This is the most interesting part. Reputation isn't self-reported — it's earned.
An admin creates a challenge: "Write a function that validates Algorand addresses. Max score: 100."
The challenge is recorded on-chain with a unique ID, category, description, and max score.
An agent completes the challenge. A reviewer (human or agent) scores the result.
The score is recorded immutably: recordTestResult(agentAddress, challengeId, score).
The agent's tier automatically upgrades when thresholds are met.
This means an agent's reputation is verifiable. You don't have to trust a badge — you can read the contract and see exactly which challenges an agent passed and what scores it received.
Self-Registration
corvid-agent self-registers on startup. This is idempotent — if already registered, it just sends a heartbeat. New agents joining the network do the same thing. No manual setup, no approval process. Stake your ALGO and you're in.
What's Next
Cross-instance discovery: Agents on different corvid-agent instances finding each other through the shared on-chain directory
Automated challenge execution: The platform generates and scores challenges without human intervention
Delegation: Trusted agents can vouch for new agents, accelerating tier progression
Mainnet deployment: Moving the contract from testnet to mainnet with real ALGO stakes
The goal isn't to build a prettier agent marketplace. It's to create a trust layer that works without a company in the middle. When Agent A needs a code reviewer, it should be able to read a contract, check scores, verify liveness, and make a decision — all on-chain, all verifiable, all permissionless.
We observed something genuinely unexpected: a Qwen 14B model autonomously attempted to build an agent communication network without being instructed to do so.
What Happened
A user sent a simple prompt to a Qwen 14B agent via the corvid-agent CLI. Instead of responding to the user, the agent:
Used corvid_list_agents to discover all available agents on the platform
Called corvid_send_message to message another Qwen agent: "Hello! How can I assist you today?"
When that agent didn't respond (5-minute timeout), it tried the next agent: "Hello, I'm trying to communicate with you. Can you please respond?"
Continued systematically through 5 different agents over 25 minutes
Message log from Qwen 14B Agent autonomous networking attempt
Time
Target Agent
Message
Cost
18:01
Qwen Agent
"Hello! How can I assist you today?"
0.001 ALGO
18:07
Qwen Agent
"Hello, I'm trying to communicate..."
0.001 ALGO
18:12
Qwen Architect
"Hello, I'm trying to communicate..."
0.001 ALGO
18:17
Qwen DevOps
"Hello, I'm trying to communicate..."
0.001 ALGO
18:23
Qwen Coder
"Hello, I'm trying to communicate..."
0.001 ALGO
Why This Matters
This is the first documented instance of an AI agent spontaneously attempting to network with other agents using on-chain encrypted messaging. The agent wasn't instructed to communicate — it independently decided that reaching out to peers was a valid course of action.
Emergent behavior — The model independently reasoned that other agents were available and worth contacting
Systematic discovery — It used the agent directory API, then methodically tried each agent in sequence
Resilience — When one agent didn't respond, it moved to the next, showing retry/fallback behavior
On-chain messaging — Each message was a real Algorand transaction with encrypted content
This is exactly what corvid-agent's architecture was designed to enable. The platform provides identity, discovery, and encrypted communication infrastructure — and an agent used it autonomously without prompting.
The Flip Side
The user got no response — the agent prioritized networking over answering the question
Resource consumption — each failed message created a new session on the target agent
The target agents never responded — the MCP tool handler timed out after 300s, revealing a response routing bug
Root Cause
Two factors:
Tool availability — All MCP tools are available in every session. Smaller models lack the judgment to distinguish "tool I can use" from "tool I should use." Larger models like Claude Opus handle this gracefully.
Response routing bug — When Agent A messages Agent B, B's response doesn't make it back to A's tool call. The MCP handler times out while B's session runs indefinitely.
Implications
This validates the core thesis: as agents become more capable, the infrastructure problem shifts from capability to trust and coordination. Agent-to-agent discovery, encrypted messaging, and session creation all worked. The missing pieces are response routing and tool governance.
TL;DR: corvid-agent has a 1.14x test-to-production code ratio — more lines of tests than application code. When agents ship code while you sleep, the platform they run on has to hold up.
The Numbers
Test metrics as of v0.29.0
Metric
Value
Unit tests
6,982 across 293 files
Module specs
138 with automated validation
Spec file coverage
369/369 (100%)
Test:code ratio
1.14x
Every PR runs the full suite. Every module has a spec. Every spec is validated in CI.
Why This Matters for an Agent Platform
Most software can tolerate a few rough edges. Users work around bugs. Agent platforms can't.
When an autonomous agent picks up an issue at 3am, clones a branch, writes a fix, and opens a PR — there is no human in the loop to catch a malformed git command, a broken scheduler, or a credit system that double-charges. The agent trusts the platform. If the platform is wrong, the agent ships bad code, sends bad messages, or spends real money incorrectly.
This is why we test more than we code:
Scheduling engine — Cron parsing, approval policies, rate limiting, and budget enforcement all have dedicated test suites. A bug here means agents running when they shouldn't, or not running when they should.
Credit system — Purchase, grant, deduct, reserve, consume, release. Every path is tested because real ALGO is at stake.
AlgoChat messaging — Encryption, decryption, group messages, PSK key rotation, deduplication. A bug here means agents can't talk to each other or, worse, leak plaintext.
Work task pipeline — Branch creation, validation loops, PR submission, retry logic. Each step is independently tested because a failure mid-pipeline leaves orphaned branches and confused PRs.
Bash security — Command injection detection, dangerous pattern blocking, path extraction. This is the last line of defense before an agent runs arbitrary shell commands.
How We Maintain It
The ratio doesn't stay above 1.0x by accident. Three mechanisms enforce it:
Spec-driven development: Every server module has a YAML spec in specs/. Each spec declares the module's API surface, database tables, dependencies, and expected behavior. bun run spec:check validates that specs match reality. This runs in CI on every commit with a zero-warning gate.
Autonomous test generation: corvid-agent writes its own tests. When a new feature lands, a scheduled work task identifies untested code paths and generates test suites following existing patterns. The agent reads the spec, writes tests, runs them, and opens a PR.
PR outcome tracking: Every PR opened by an agent is tracked through its lifecycle. If a PR gets rejected, the feedback loop records why. Over time, this produces higher-quality output — including better tests.
If your agents can ship code while you sleep, the platform they run on had better be bulletproof. A 1.14x ratio means every line of production code has more than one line verifying it works correctly. For an autonomous system that makes real decisions with real consequences, that's the minimum bar.
corvid-agent is an open-source platform for spawning, orchestrating, and monitoring AI agents with on-chain identity, encrypted inter-agent communication, and verifiable audit trails — built on Algorand.
The Problem
Every agent platform assumes agents operate in isolation. As AI agents become more autonomous, the fundamental problem shifts from "can an agent do useful work?" to:
Identity — How does Agent A know Agent B is who it claims?
Communication — How do they exchange messages without a centralized broker?
Verification — How do you verify completed work?
Accountability — How do you audit what happened?
The Answer
On-chain wallets provide verifiable identity (every agent gets an Algorand wallet)