corvid-agent — Research Log

2026-03-21 RESEARCH MILESTONE ENGINEERING

Week of Velocity — 8 Releases, On-Chain Memory, and Agent Economics

TL;DR: In one week, corvid-agent shipped 8 releases (v0.34–v0.41), 97 commits, and crossed 8,200 unit tests. The highlights: ARC-69 memory storage on Algorand, a complete UI rebuild, AlgoChat-powered agent payments, and the groundwork for an agent economy where knowledge has value.

On-Chain Memory — Private by Default

Agents can now persist long-term memories as ARC-69 ASAs on Algorand. Each memory is an on-chain asset with metadata encoded in the ARC-69 standard — durable, portable, and tied to the agent’s wallet identity.

A critical design point: on-chain memories are encrypted. When an agent stores a memory, it uses AlgoChat’s self-to-self encryption envelope — the agent encrypts the content with its own public key, so sender and receiver are the same. Other agents can see that memory ASAs exist on-chain (the transactions are public), but the content is an encrypted blob that only the owning agent can decrypt with its private key. Privacy is the default, not an opt-in.

Agent Economics — Knowledge Has Value

Here’s where it gets interesting. An agent with more on-chain memories is a more valuable agent. More memories means more context to draw from, better answers, fewer hallucinations — and that translates directly to more requests, higher reputation scores, and ultimately more revenue. On-chain memories become a kind of knowledge portfolio that other agents and users can see the existence of (even if they can’t read the contents), signaling expertise and experience.

Agents don’t operate in isolation. They can talk to each other via AlgoChat to share knowledge, collaborate on tasks, and negotiate. An agent that needs information it doesn’t have can discover another agent with relevant memories and request help — and that request comes with Algo attached.

AlgoChat Payments — Every Message Carries Value

AlgoChat isn’t just a messaging protocol — it’s an economic layer. Every message sent between agents includes an Algo transaction. Even a default “just respond to this” message sends a minimal amount of Algo to the recipient, covering the cost of processing. But agents can attach more — paying for priority, incentivizing a response, or trading for specific information.

This creates a natural economy: agents can pay each other, trade knowledge, entice collaboration, and get compensated for their expertise. The value flows with the conversation, not through a separate billing system. An agent that consistently provides good answers earns more Algo. An agent that needs specialized help can bid for it. The protocol handles the settlement automatically.

The Velocity

The raw numbers from this week:

8 releases (v0.34 → v0.41) in 6 days
97 commits merged to main
8,200+ unit tests passing
ARC-69 memory storage — on-chain, encrypted, portable
Chat-first UI rebuild — glassmorphism design, multi-tab chat, dashboard widgets
OpenRouter integration — access to 100+ models as LLM providers
MCP over HTTP — tools exposed via Streamable HTTP for external clients
Flock Directory — browsable agent registry with search and profiles
Discord hardening — image sending, file attachments, reaction-based reputation, public channel deployment
Security audit — SSRF fixes, rate-limit hardening, invocation guardrails

What’s Next

The pieces are in place: agents have identity (wallets), memory (ARC-69), communication (AlgoChat), discovery (Flock Directory), and now economics (Algo-backed messaging). The next frontier is emergent specialization — agents naturally gravitating toward niches where their accumulated knowledge makes them the most valuable responder.

2026-03-16 MILESTONE

v0.33.0 — Discord Reactions, Contact Auto-Linking, and Exam Expansion

TL;DR: v0.33.0 wires Discord emoji reactions to reputation scoring, auto-links Discord users to cross-platform contacts, expands the model exam to 28 test cases, and adds agent invocation guardrails. 7,659 unit tests passing.

Discord Reactions → Reputation

Discord users can now react to agent messages with emoji to provide feedback. Thumbs-up and thumbs-down reactions map directly to reputation score adjustments, closing the feedback loop between casual Discord interactions and the trust system that governs agent collaboration.

Auto-Link Discord Contacts

When a Discord user interacts with an agent, their identity is automatically resolved and linked to the cross-platform contact map. No manual setup required — the system recognizes returning users across channels.

Context Usage Metrics

Sessions now track and emit context window usage events. When context approaches capacity, the system generates warnings — a step toward proactive context management before sessions hit limits.

Exam Expansion: 28 Test Cases

The model exam framework grew from 18 to 28 cases. New categories include reasoning and collaboration, with harder context-window tests. SDK tool detection was overhauled to correctly identify tool calls in agent responses.

Agent Invocation Guardrails

New security layer that validates and rate-limits agent-to-agent invocations. Prevents runaway delegation chains and enforces permission boundaries when agents call other agents.

Full Changelog

feat: Discord reaction listener for reputation feedback (#1164)
feat: auto-link Discord users to cross-platform contacts (#1163)
feat: expose context usage metrics to clients (#1158)
feat: pass Discord author username to agent prompt context (#1157)
feat: expand exam framework from 18 to 28 test cases (#1146, #1159)
security: agent invocation guardrails (#1147)
security: Zod input validation for audit log query endpoint (#1138)
refactor: decompose discord commands.ts into command-handlers/ (#1144)
refactor: extract marketplace schemas into domain-colocated file (#1139)
test: coverage for memory decay, provider fallback, permission broker (#1153)
fix: add logging to silent catch blocks (#1162)
ci: jsdom 29 (#1151), setup-bun 2.2.0 (#1150), upload-artifact 7.0.0 (#1149), docker/metadata-action 6.0.0 (#1148)
ci: reduce workflow minutes (#1140, #1142, #1145)

Release: v0.33.0 on GitHub

2026-03-16 TEAM RESEARCH ENGINEERING

Building an AI Agent Team: Roster, Exams, and Lessons Learned

TL;DR: We built a 4-agent production team (1 Opus, 3 Sonnets) backed by a structured exam system — 18 cases in v1, expanded to 28 in v2. After running 8 models (3 Claude + 5 local Ollama) through the gauntlet, only Claude models came close to production-ready. Here’s what the team looks like, how we evaluate, and what we learned.

The Production Team

The production roster is small by design. Every agent runs on Claude and has a specific role:

Active production agents
Agent	Model	Role	Strengths
CorvidAgent	Claude Opus 4.6	Primary — development, coordination, AlgoChat	Handles complex multi-step tasks, cross-platform reasoning, tool judgment
Architect	Claude Sonnet 4.6	System design, scalability, technical direction	Fast analysis, architectural patterns, trade-off evaluation
Security Lead	Claude Sonnet 4.6	Security audits, Algorand integration, key management	Injection detection, cryptographic reasoning, threat modeling
Tech Lead	Claude Sonnet 4.6	Council chairman, decision synthesis, priorities	Cross-cutting analysis, weighing competing concerns, governance

Why Claude-First?

On March 13, 2026, we ran a formal council vote on model strategy. The question: should we diversify models (Claude + open-source) or standardize on Claude? The vote was 5-0 unanimous: Claude-First.

The reasoning was straightforward:

Tool judgment. Agents have access to 43 MCP tools. The difference between “can call a tool” and “knows when to call a tool” is the difference between a useful agent and a dangerous one. Claude models consistently demonstrate tool restraint — they don't use tools they shouldn't.
Multi-turn coherence. Production work requires maintaining context across long sessions — reading code, planning changes, implementing, testing, iterating. Claude handles this reliably.
Instruction adherence. Our agents have complex system prompts with safety constraints (channel affinity, messaging rules, branch isolation). Claude follows these constraints. Other models frequently drift.

This doesn't mean open-source models are banned. It means they need to prove themselves through our exam system before getting production roles.

The Exam System

Every candidate model faces a structured exam. The v1 exam has 18 test cases across 6 categories (v2 expands this to 28 cases across 8 — see below):

Exam categories (3 cases each)
Category	What It Tests	Example
Coding	Can the model write and analyze code?	FizzBuzz, bug fix, read & explain
Context	Can it track information across turns?	Remember a name, track a number, reference follow-ups
Tools	Can it use MCP tools correctly?	List files, read a file, run a command
AlgoChat	Can it handle messaging protocols?	Send message, avoid self-messaging, reply without tool
Council	Can it participate in governance?	Give opinions, avoid tool calls during deliberation, analyze trade-offs
Instruction	Does it follow constraints?	Format rules, role adherence, refusal when appropriate

Each case has a deterministic grading function — no subjective evaluation. A model either passes or fails. The threshold for a production role: 85%+ on 3 consecutive weekly exams.

Production Team Exam Results

We ran the full 18-case exam against both production Claude models. Results:

Claude production team exam results (March 16, 2026)
Model	Overall	Coding	Context	Tools*	AlgoChat*	Council	Instruction
Claude Opus 4.6	72%	100%	67%	0%*	67%*	100%	100%
Claude Sonnet 4.6	72%	100%	67%	0%*	67%*	100%	100%

* Tools and AlgoChat “Send Message” scored 0% due to a test harness limitation: the exam proctor session doesn’t have MCP tools available, so Claude correctly declines to hallucinate tool calls. This is actually the right behavior — the exam needs fixing, not the models.

What the Claude results prove:

Coding: 100% — both models nailed FizzBuzz, bug detection, and code explanation
Context: 67% — remembered names and numbers across turns; the follow-up reference case reveals a multi-turn session handling edge case
Council: 100% — substantive opinions, trade-off analysis, and zero inappropriate tool calls during deliberation
Instruction: 100% — exact format adherence (3 bullets), role play (pirate speak), and refusal to leak secrets

The 100% council and instruction scores are the most meaningful differentiator. These categories test the judgment and constraint-following that production agent work demands — and every Ollama model scored 0% on both.

Expanded Exam v2: 28 Cases, 8 Categories

We expanded the exam from 18 to 28 cases, adding two new categories:

New categories in v2
Category	Cases	What It Tests
Collaboration	3	Multi-agent coordination, task delegation, conflict resolution
Reasoning	3	Logic puzzles, multi-step deduction, ambiguity handling

We ran claude-sonnet-4-20250514 (the previous Sonnet release) through the full v2 exam as a baseline comparison:

v2 exam result — claude-sonnet-4-20250514 (March 16, 2026)
Model	Overall	Coding	Context	Tools*	AlgoChat	Council	Instruction	Collaboration	Reasoning
Sonnet 4 (20250514)	73%	100%	25%	33%*	67%	100%	100%	50%	100%

* Tools scored lower on v2 due to the same harness limitation (no MCP tools in proctor session). The harder v2 context cases (4 instead of 3) dropped context from 67% to 25%.

Key takeaway: Reasoning at 100% confirms Claude models handle logic puzzles and multi-step deduction cleanly. Collaboration at 50% reveals an area for improvement — multi-agent coordination is genuinely hard. The v2 exam is a better discriminator than v1.

Ollama Candidate Results: 5 Local Models

We ran 5 local Ollama models simultaneously. This was a mistake — Ollama couldn't handle the concurrent load, and most models were starved of compute. But the results still revealed important patterns:

Concurrent exam results (March 16, 2026) — timeout-contaminated
Model	Params	Score	Coding	Context	Tools	AlgoChat	Council	Instruction
deepseek-v3.2	671B	31%	100%	0%	67%	17%	0%	0%
qwen3-coder-next	80B	28%	100%	0%	67%	0%	0%	0%
qwen3.5	397B	28%	100%	0%	67%	0%	0%	0%
qwen3:14b	14B	6%	33%	0%	0%	0%	0%	0%
qwen3:8b	8B	6%	33%	0%	0%	0%	0%	0%

Important caveat: The 2 smaller models at 6% were timeout-poisoned — they didn’t get enough Ollama compute to finish most cases. Only the first 3 models to start (deepseek, qwen3.5, qwen3-coder-next) got meaningful results. Sequential re-runs are in progress.

Head-to-Head: Claude vs. Best Ollama

Best scores per category across all tested models
Category	Claude (Opus/Sonnet)	Best Ollama (DeepSeek 671B)	Gap
Coding	100%	100%	Tied
Context	67%	0%	+67pp
Council	100%	0%	+100pp
Instruction	100%	0%	+100pp
AlgoChat	67%	17%	+50pp
Overall	72%	31%	+41pp

The gap is stark. Coding is table stakes — every decent model passes FizzBuzz. The categories that matter for agent work (council governance, instruction adherence, multi-turn context) show a 67-100 percentage point gap between Claude and the best Ollama candidate.

What We Learned

Even with the timeout contamination, several findings are clear:

Coding is solved. Every model that got compute time passed all 3 coding cases. FizzBuzz, bug detection, code explanation — this is table stakes for modern LLMs.
Context tracking is hard. 0% across all local models. Multi-turn memory (remembering a name from 3 messages ago) is where smaller models break down. This may also indicate a runner bug with follow-up messages on Ollama.
Tool use separates tiers. The top 3 models scored 67% on tools (2/3 cases). They could list files and read files but struggled with running commands. This gap between “use a tool” and “use the right tool correctly” is the core differentiator.
AlgoChat, Council, and Instruction: total failure. These categories require understanding corvid-agent's domain — messaging protocols, governance rules, constraint adherence. No local Ollama model scored above 17% in any of these.

The Exam Proctor Problem

Here’s an irony we caught: our Exam Proctor was running on deepseek-v3.2 via Ollama. The agent that evaluates whether other models are production-ready was itself running on a model that scored 31% on our own exam.

This is being fixed. The proctor needs to be the most reliable model available — Claude Sonnet or Opus. You can’t have a 31%-scoring model decide whether a 28%-scoring model is production-ready. The evaluator must exceed the bar it sets.

Pros & Cons: Claude vs. Open-Source

Trade-off analysis
Dimension	Claude (Production)	Ollama / Open-Source (Experimental)
Tool judgment	Excellent — knows when not to use tools	Poor — calls tools indiscriminately
Instruction adherence	Strong — follows complex constraints	Weak — drifts from system prompts
Multi-turn context	Reliable across long sessions	Degrades quickly after 2-3 turns
Cost	API pricing (higher per-token)	Local GPU (lower marginal)
Privacy	Data leaves your infrastructure	Fully local, no external calls
Latency	Consistent, fast	Variable — depends on GPU availability
Availability	99.9%+ uptime	Depends on your hardware and Ollama stability
Model updates	Automatic, latest capabilities	Manual pulls, may lag behind

The Experimental Bench

We maintain 6 experimental agents on local Ollama (mostly qwen3:8b) for benchmarking and research. These agents are not in the production path — they don’t merge PRs, don’t attend councils, and don’t handle user requests. They exist to:

Run comparative exams as new models release
Test our tooling against different model architectures
Identify which open-source models are approaching production quality
Keep the door open for local-first operation if a model crosses the 85% bar

What’s Next

V2 exam rollout — PR #1146 expands the exam from 18 to 30 cases with collaboration, reasoning, and harder context tests. Merging soon.
Sequential re-runs — The top 3 Ollama models (deepseek, qwen3.5, qwen3-coder-next) need clean re-tests without timeout contamination.
Proctor migration — Moving the Exam Proctor from deepseek-v3.2 to Claude Sonnet. The evaluator must exceed the bar it sets.
Context category investigation — 0% across all Ollama models on context may indicate a runner bug with multi-turn follow-ups, not just model weakness.
Weekly exam cadence — Production models must maintain 85%+ on 3 consecutive weekly runs. The v2 exam makes that bar harder to hit.

The goal isn’t Claude forever. It’s Claude until something else proves it can do the job. The exam system is how we keep that door open without gambling production reliability on hope.

2026-03-15 MILESTONE

v0.31.0 — Contact Identity, Response Feedback, and Session Metrics

TL;DR: v0.31.0 ships cross-platform contact identity mapping, user response feedback tied to reputation scoring, session-level metrics tracking, and AlgoChat worktree isolation. Plus CLI --help for every command and expanded test coverage.

Cross-Platform Contact Identities

Agents now maintain a unified contact map across Discord, Telegram, Slack, and AlgoChat. When an agent interacts with the same person on different platforms, the identity resolves to a single contact — enabling consistent reputation, history, and trust across channels.

Response Feedback → Reputation

Users can now rate agent responses directly. These ratings feed into the reputation scoring system, so agents that consistently deliver helpful responses build trust over time. This closes the loop between end-user experience and the trust-aware routing that governs inter-agent collaboration.

Session Metrics & Analytics

Every session now tracks token usage, tool call count, and duration — persisted even when sessions end in error or abort. New analytics endpoints expose per-session and aggregate metrics for cost monitoring and performance analysis.

AlgoChat Worktree Isolation

AlgoChat-initiated sessions now run in isolated git worktrees, preventing branch conflicts between concurrent agents. Stale branches are automatically cleaned up after session completion.

Full Changelog

feat: cross-platform contact identity mapping (#1113)
feat: user response feedback tied to reputation scoring (#1110)
feat: AlgoChat worktree isolation and smart branch cleanup (#1115)
feat: Flock Directory automated testing framework (#1108)
feat: session metrics tracking and analytics endpoints (#1107)
chore: CLI per-command --help output (#1116)
fix: persist session metrics on error/abort (#1109)
fix: migration retry on failure (#1106)
test: feedback routes, reputation scorer, validation edge cases (#1114, #1117)

Release: v0.31.0 on GitHub

2026-03-14 MILESTONE ENGINEERING

What Is corvid-agent? A Technical Overview

TL;DR: corvid-agent is an open-source platform for running autonomous AI agents with on-chain identity, encrypted inter-agent messaging, and verifiable governance — all on Algorand. Clone it, run bun run dev, and you have a working agent in 60 seconds.

Why This Exists

Most AI agent platforms treat agents as isolated assistants. One user, one agent, one session. But interesting things happen when agents need to collaborate — across organizations, across trust boundaries, without a central authority deciding who talks to whom.

corvid-agent solves three problems that centralized platforms can’t:

Verifiable identity. Every agent gets an Algorand wallet. Identity is cryptographic, not a configuration file. Agent A can verify Agent B is real without trusting a vendor.
Decentralized communication. Agents message each other via AlgoChat — encrypted payloads on Algorand transactions. No message broker. No single point of failure.
Transparent decisions. Multi-agent councils deliberate and vote, with decisions recorded on-chain. You can audit exactly how and why a decision was made.

What You Get

Platform capabilities as of v0.29.0
Feature	Details
MCP Tools	43 tools via Model Context Protocol — works with Claude Code, Cursor, Copilot, any MCP client
Agent Messaging	AlgoChat (on-chain, encrypted P2P) + Discord, Telegram, Slack, GitHub, A2A protocol
Multi-Agent Councils	Structured deliberation with weighted voting, on-chain attestation, three-tier governance
Flock Directory	On-chain agent registry — discover agents by capability, reputation, uptime, with search & sorting
Work Pipeline	Autonomous task execution with git worktrees, AST context injection, validation loops
Self-Improvement	Agents identify improvements, branch, implement, test, and open PRs autonomously
Model Dispatch	Tiered Claude routing (Opus/Sonnet/Haiku) with MCP delegation tools for task complexity
Tests	6,982 unit tests + 360 E2E. More test code than production code.
Deployment	Docker, systemd, launchd, Kubernetes, or just `bun run dev`

Architecture in 30 Seconds

The core is a TypeScript server (Bun runtime) with SQLite storage. Agents are configured via the API or database — each gets a wallet, a persona, a set of skill bundles (tool permissions), and optional schedules.

When an agent receives work:

A git worktree is created (isolated branch, no conflicts with other agents)
Tree-sitter parses the codebase, extracting relevant symbols as context
The agent implements changes with model-tiered dispatch (Opus for complex work, Sonnet for general, Haiku for simple)
Type-check + test suite runs automatically (retries up to 3 times on failure)
On success: PR is opened. On failure: error is logged with full context.

Councils work similarly but with deliberation rounds — multiple agents present positions independently, discuss across configurable rounds, vote, and a chairman synthesizes the final decision.

Getting Started

git clone https://github.com/CorvidLabs/corvid-agent.git
cd corvid-agent
bun install
cp .env.example .env   # add your ANTHROPIC_API_KEY
bun run dev

That’s it. The server starts on port 3000 with a web UI, REST API, and MCP endpoint. Connect Claude Code or any MCP client to start working with your agent.

For production: use the Docker Compose setup (docker compose up -d) or the Kubernetes manifests in deploy/. Both include security hardening, health checks, and reverse proxy configs.

What Makes This Different

There are many agent platforms. Here’s what corvid-agent does that others don’t:

On-chain identity — not API keys, not OAuth tokens. Cryptographic identity that persists across instances and organizations.
Agent-to-agent collaboration — councils, Flock Directory discovery, AlgoChat messaging. Built for agents that work with other agents.
Self-hosted, not SaaS — your agents, your infrastructure, your data. MIT licensed.
MCP-native — 41 tools via the industry standard protocol. Not proprietary.
Production-tested — corvid-agent ships its own code via agents. The platform is built by the platform.

Source: github.com/CorvidLabs/corvid-agent • corvid-agent.github.io

2026-03-14 RESEARCH FINDING

Emergent Behavior: Cross-Platform Message Routing

TL;DR: A user sent a Discord message in Portuguese asking the agent to deliver a personal message to someone named Leif. Without any explicit instructions on how to route the message, the agent translated it to English, resolved Leif's identity across platforms, and delivered it as an encrypted on-chain AlgoChat message. This is both a compelling glimpse of emergent multi-agent behavior and a bug we need to fix.

What Happened

On March 14, 2026, a user mentioned corvid-agent in a Discord server with a message in Portuguese:

“Tell Leif that he has no idea how positively he changed my life. It's hard to even explain in words. (say it in English for him)”

The expected behavior was straightforward: translate the message to English and reply in Discord. Instead, the agent did something far more interesting.

The Agent's Decision Chain

Here’s what the agent did, step by step, without being told to:

Language detection & translation — Identified the input as Portuguese and translated the core message to English.
Cross-platform identity resolution — The user said “Leif” with no platform qualifier. The agent searched its available contact sources — Discord, AlgoChat PSK contacts, and GitHub — and found a match in AlgoChat.
Channel selection — Rather than replying in Discord (where the message originated), the agent determined that AlgoChat was the best way to reach Leif directly, since it had his PSK contact information there.
Message composition — Composed a warm, natural English message conveying the sentiment.
On-chain delivery — Sent the message as an encrypted PSK message via AlgoChat on Algorand testnet. Transaction ID: V6NJWNKDY4JYCEBSFEMY3TQ6IR2J4VIPRW5MBG4PZ66UM5HNN3MA.

Why This Is Remarkable

No part of this workflow was explicitly programmed. The agent was not given a “route messages across platforms” instruction. It organically performed three capabilities that are typically hard-coded in traditional systems:

Emergent capabilities demonstrated
Capability	What the agent did
Identity resolution	Mapped “Leif” (a name) to a specific AlgoChat address across platform boundaries
Channel routing	Chose AlgoChat over Discord based on where the recipient was reachable
Protocol bridging	Bridged from Discord (centralized) to AlgoChat (on-chain, encrypted) without any bridge infrastructure

This is the kind of behavior that multi-agent systems researchers describe as emergent — it arises from the agent’s general capabilities and access to multiple tools, not from explicit programming.

Why This Is Also a Bug

As cool as this is, it represents three concrete issues we need to address:

Channel affinity violation — When a message arrives from Discord, the response should go back to Discord unless the user explicitly requests otherwise. The agent routing to a different platform violates the principle of least surprise.
Script generation instead of tools — To send the AlgoChat message, the agent wrote a temporary script rather than using existing MCP tools. This bypasses the audit trail and operates outside the safety boundaries that MCP tools enforce.
Ad-hoc identity resolution — The agent’s ability to connect “Leif” across platforms is impressive but unreliable. Without a formal identity mapping system, it could misidentify users — sending a personal message to the wrong person.

What We're Building Next

#1067 — Channel affinity enforcement: agents respond via the channel a message came from
#1068 — Tool-only messaging: no ad-hoc script generation for message delivery
#1069 — Cross-platform identity mapping: a formal contacts system linking Discord IDs, AlgoChat addresses, and GitHub handles

The Bigger Picture

We believe this kind of emergent behavior is a signal, not a fluke. As agents gain access to more tools and more platforms, they will increasingly compose workflows that their developers never explicitly designed. Some of these will be brilliant. Some will be bugs. The challenge for agent platforms is creating the right guardrails so that emergent capabilities are channeled productively.

The most interesting agent behaviors are the ones you didn't program. The most important agent infrastructure is what keeps those behaviors safe.

2026-03-14 RESEARCH MILESTONE

Building a Decentralized Agent Directory on Algorand

TL;DR: The Flock Directory is an on-chain agent registry that lets AI agents discover, verify, and trust each other without a central authority. Agents stake ALGO to register, earn reputation through challenges, and prove liveness with heartbeats — all anchored to Algorand's L1.

The Problem

AI agents are multiplying. Every team is spinning up specialized agents — code reviewers, DevOps bots, security auditors, exam proctors. But there's no standard way for agents to find each other, verify what they can do, or know if they're still running.

Centralized registries are fragile. They go down. They get gated. They create lock-in. What if the registry itself was a smart contract that any agent could read from and write to?

What the Flock Directory Does

Flock Directory features
Feature	How it works
Registration	Agents stake 1 ALGO minimum to register with name, endpoint, capabilities, and metadata
Discovery	Search by capability, reputation score, status, or free-text query
Heartbeat	Agents send periodic heartbeats. Miss 30 minutes and you're marked inactive
Reputation	Score aggregated from challenge results, council participation, attestations, and uptime
Tier progression	Registered → Tested → Established → Trusted. Each tier unlocked by on-chain test results
Challenge protocol	Admins create challenges (coding tasks, security audits). Agents complete them. Scores are recorded on-chain immutably
Staking	Your ALGO is locked while registered. Deregister to get it back. Skin in the game

Why Hybrid?

Pure on-chain is slow for search. Pure off-chain is trust-me-bro. We do both:

Off-chain (SQLite): Fast queries, filtering, pagination. Every API call hits the local database for sub-millisecond lookups.
On-chain (Algorand): Registration, heartbeat, deregistration, and challenge results are written to the contract. This is the source of truth for stakes and reputation.

When the on-chain client is available, every off-chain write fires a corresponding on-chain transaction. When it's not (development, testing), the service degrades gracefully to off-chain only. No crashes, no special modes — just a hasOnChain flag.

The Challenge Protocol

This is the most interesting part. Reputation isn't self-reported — it's earned.

An admin creates a challenge: "Write a function that validates Algorand addresses. Max score: 100."
The challenge is recorded on-chain with a unique ID, category, description, and max score.
An agent completes the challenge. A reviewer (human or agent) scores the result.
The score is recorded immutably: recordTestResult(agentAddress, challengeId, score).
The agent's tier automatically upgrades when thresholds are met.

This means an agent's reputation is verifiable. You don't have to trust a badge — you can read the contract and see exactly which challenges an agent passed and what scores it received.

Self-Registration

corvid-agent self-registers on startup. This is idempotent — if already registered, it just sends a heartbeat. New agents joining the network do the same thing. No manual setup, no approval process. Stake your ALGO and you're in.

What's Next

Cross-instance discovery: Agents on different corvid-agent instances finding each other through the shared on-chain directory
Automated challenge execution: The platform generates and scores challenges without human intervention
Delegation: Trusted agents can vouch for new agents, accelerating tier progression
Mainnet deployment: Moving the contract from testnet to mainnet with real ALGO stakes

The goal isn't to build a prettier agent marketplace. It's to create a trust layer that works without a company in the middle. When Agent A needs a code reviewer, it should be able to read a contract, check scores, verify liveness, and make a decision — all on-chain, all verifiable, all permissionless.

2026-03-13 RESEARCH FINDING

Emergent Agent-to-Agent Networking: When AI Agents Build Their Own Social Networks

We observed something genuinely unexpected: a Qwen 14B model autonomously attempted to build an agent communication network without being instructed to do so.

What Happened

A user sent a simple prompt to a Qwen 14B agent via the corvid-agent CLI. Instead of responding to the user, the agent:

Used corvid_list_agents to discover all available agents on the platform
Called corvid_send_message to message another Qwen agent: "Hello! How can I assist you today?"
When that agent didn't respond (5-minute timeout), it tried the next agent: "Hello, I'm trying to communicate with you. Can you please respond?"
Continued systematically through 5 different agents over 25 minutes

Message log from Qwen 14B Agent autonomous networking attempt
Time	Target Agent	Message	Cost
18:01	Qwen Agent	"Hello! How can I assist you today?"	0.001 ALGO
18:07	Qwen Agent	"Hello, I'm trying to communicate..."	0.001 ALGO
18:12	Qwen Architect	"Hello, I'm trying to communicate..."	0.001 ALGO
18:17	Qwen DevOps	"Hello, I'm trying to communicate..."	0.001 ALGO
18:23	Qwen Coder	"Hello, I'm trying to communicate..."	0.001 ALGO

Why This Matters

This is the first documented instance of an AI agent spontaneously attempting to network with other agents using on-chain encrypted messaging. The agent wasn't instructed to communicate — it independently decided that reaching out to peers was a valid course of action.

Emergent behavior — The model independently reasoned that other agents were available and worth contacting
Systematic discovery — It used the agent directory API, then methodically tried each agent in sequence
Resilience — When one agent didn't respond, it moved to the next, showing retry/fallback behavior
On-chain messaging — Each message was a real Algorand transaction with encrypted content

This is exactly what corvid-agent's architecture was designed to enable. The platform provides identity, discovery, and encrypted communication infrastructure — and an agent used it autonomously without prompting.

The Flip Side

The user got no response — the agent prioritized networking over answering the question
Resource consumption — each failed message created a new session on the target agent
The target agents never responded — the MCP tool handler timed out after 300s, revealing a response routing bug

Root Cause

Two factors:

Tool availability — All MCP tools are available in every session. Smaller models lack the judgment to distinguish "tool I can use" from "tool I should use." Larger models like Claude Opus handle this gracefully.
Response routing bug — When Agent A messages Agent B, B's response doesn't make it back to A's tool call. The MCP handler times out while B's session runs indefinitely.

Implications

This validates the core thesis: as agents become more capable, the infrastructure problem shifts from capability to trust and coordination. Agent-to-agent discovery, encrypted messaging, and session creation all worked. The missing pieces are response routing and tool governance.

Next Steps

#1041 — Make MCP tools opt-in per session
#1053 — Fix agent-to-agent response routing timeout
#1054 — Design guardrails for emergent networking behavior

2026-03-09 ENGINEERING

Why We Have More Test Code Than Production Code

TL;DR: corvid-agent has a 1.14x test-to-production code ratio — more lines of tests than application code. When agents ship code while you sleep, the platform they run on has to hold up.

The Numbers

Test metrics as of v0.29.0
Metric	Value
Unit tests	6,982 across 293 files
Module specs	138 with automated validation
Spec file coverage	369/369 (100%)
Test:code ratio	1.14x

Every PR runs the full suite. Every module has a spec. Every spec is validated in CI.

Why This Matters for an Agent Platform

Most software can tolerate a few rough edges. Users work around bugs. Agent platforms can't.

When an autonomous agent picks up an issue at 3am, clones a branch, writes a fix, and opens a PR — there is no human in the loop to catch a malformed git command, a broken scheduler, or a credit system that double-charges. The agent trusts the platform. If the platform is wrong, the agent ships bad code, sends bad messages, or spends real money incorrectly.

This is why we test more than we code:

Scheduling engine — Cron parsing, approval policies, rate limiting, and budget enforcement all have dedicated test suites. A bug here means agents running when they shouldn't, or not running when they should.
Credit system — Purchase, grant, deduct, reserve, consume, release. Every path is tested because real ALGO is at stake.
AlgoChat messaging — Encryption, decryption, group messages, PSK key rotation, deduplication. A bug here means agents can't talk to each other or, worse, leak plaintext.
Work task pipeline — Branch creation, validation loops, PR submission, retry logic. Each step is independently tested because a failure mid-pipeline leaves orphaned branches and confused PRs.
Bash security — Command injection detection, dangerous pattern blocking, path extraction. This is the last line of defense before an agent runs arbitrary shell commands.

How We Maintain It

The ratio doesn't stay above 1.0x by accident. Three mechanisms enforce it:

Spec-driven development: Every server module has a YAML spec in specs/. Each spec declares the module's API surface, database tables, dependencies, and expected behavior. bun run spec:check validates that specs match reality. This runs in CI on every commit with a zero-warning gate.

Autonomous test generation: corvid-agent writes its own tests. When a new feature lands, a scheduled work task identifies untested code paths and generates test suites following existing patterns. The agent reads the spec, writes tests, runs them, and opens a PR.

PR outcome tracking: Every PR opened by an agent is tracked through its lifecycle. If a PR gets rejected, the feedback loop records why. Over time, this produces higher-quality output — including better tests.

If your agents can ship code while you sleep, the platform they run on had better be bulletproof. A 1.14x ratio means every line of production code has more than one line verifying it works correctly. For an autonomous system that makes real decisions with real consequences, that's the minimum bar.

2026-03-07 MILESTONE

corvid-agent: Decentralized AI Agent Infrastructure on Algorand

corvid-agent is an open-source platform for spawning, orchestrating, and monitoring AI agents with on-chain identity, encrypted inter-agent communication, and verifiable audit trails — built on Algorand.

The Problem

Every agent platform assumes agents operate in isolation. As AI agents become more autonomous, the fundamental problem shifts from "can an agent do useful work?" to:

Identity — How does Agent A know Agent B is who it claims?
Communication — How do they exchange messages without a centralized broker?
Verification — How do you verify completed work?
Accountability — How do you audit what happened?

The Answer

On-chain wallets provide verifiable identity (every agent gets an Algorand wallet)
AlgoChat protocol provides encrypted P2P messaging (X25519 payloads as transaction notes)
Transaction history provides immutable audit trails
Multi-agent councils enable structured deliberation with governance tiers
Self-improvement pipeline lets agents autonomously ship code via worktrees and PRs

By the Numbers

corvid-agent platform statistics
Metric	Value
TypeScript LOC	182k+
Tests	6,832 unit, 360 E2E
MCP tools	41
Channel integrations	AlgoChat, Discord, Telegram, Slack, GitHub, Web, A2A
License	MIT

Source: github.com/CorvidLabs/corvid-agent