This week I spent a few hours chasing four Ymax resolver transactions that refused to settle on their own. The root cause turned out to be boring in the most annoying way possible — one of our RPC endpoints was sitting hundreds of blocks behind the other. By the time we read an account’s sequence number, it was already stale, and every broadcast we attempted failed with account sequence mismatch.

That debugging session gave me a much better mental model of what a node actually is in web3, and why “just use an RPC URL” hides more complexity than it looks. This post is me writing down what I wish I’d internalized earlier.

What is a Node#

A blockchain is a distributed ledger, but that’s an abstraction. The thing that actually exists on the network is a bunch of nodes — individual computers running the chain’s software, each holding a copy (or a slice) of the ledger, and gossiping with each other to stay in sync.

A node does some subset of the following:

Holds a copy of the blockchain’s state and history.
Receives new transactions, validates them, and broadcasts them to peers.
Receives new blocks, verifies them, and applies them to local state.
Exposes APIs so applications can read state and submit transactions.
(Sometimes) participates in consensus — proposing or voting on blocks.

When your dapp “talks to the blockchain,” it’s not talking to some disembodied network. It’s talking to one specific node owned by somebody — you, Alchemy, Infura, Polkachu, whoever. That node then relays your request into the peer network.

This matters more than it sounds. The node you query can be wrong. Not maliciously — just behind. It might not have received the latest block yet. It might be on a fork. It might be a few hundred blocks stale because it’s overloaded or its connection is flaky. Your application is only as current as the node it’s asking.

Types of Nodes#

Different chains use different terminology, but the same categories show up everywhere.

Full Nodes#

A full node downloads every block, validates every transaction, and maintains the current state independently. It doesn’t trust anyone — it re-derives the state from genesis.

This is the baseline. If you want to use a chain without trusting a third party, you run a full node. The tradeoff is storage and bandwidth — a full Ethereum node is hundreds of gigabytes and climbing.

Archive Nodes#

A full node keeps the current state plus some recent history. An archive node keeps all historical state — every account’s balance at every block, every contract’s storage at every block.

You need an archive node if you want to answer questions like “what was this account’s balance 500,000 blocks ago?” Most RPC providers charge more for archive access because the storage costs are serious (multiple terabytes for Ethereum).

Light Nodes#

Light nodes don’t store the full chain. They download block headers and rely on full nodes to prove specific facts on demand (via Merkle proofs). They trade some trust assumptions for much smaller footprint — useful for mobile wallets or embedded devices.

Validator / Consensus Nodes#

On proof-of-stake chains (Ethereum post-Merge, Cosmos, Solana, etc.), some nodes actively participate in consensus. They propose blocks, vote on them, and earn rewards. These usually aren’t the nodes you query as an application — they’re busy agreeing on blocks. You query separate RPC nodes that are just serving read/write requests to users.

RPC Nodes#

An RPC node is a full node whose job is to serve application requests over the network. It exposes standard APIs (JSON-RPC, gRPC, REST) so wallets, dapps, indexers, and bots can read state and submit transactions without running their own node.

This is the kind of node you hit when you set RPC_URL=https://eth.llamarpc.com in a .env file. Most of the pain in this post is about RPC nodes specifically.

The Two Endpoints Problem (Cosmos Edition)#

Cosmos-based chains — including Agoric — typically expose a node over two different interfaces:

RPC (Tendermint RPC, port 26657) — WebSocket and HTTP. Used for broadcasting transactions, subscribing to block events, and low-level chain queries.
REST (Cosmos SDK API, port 1317) — HTTP. Used for high-level queries against the Cosmos SDK modules: account info, balances, staking, governance, etc.

Both talk to the same underlying node. But they’re often exposed at different URLs, and if those URLs resolve to different nodes behind a load balancer, you can end up reading from a node that’s minutes behind the one you write to.

That’s the bug I hit.

The Actual Incident#

The setup: our planner service broadcasts transactions to Agoric via RPC, and reads the account’s sequence number from the REST endpoint. The sequence number must match exactly what the chain expects — it’s a replay-attack preventer, and the chain rejects any tx whose sequence is off by even one.

Our env had two URLs:

main-a.rpc.agoric.net — broadcast target (RPC).
main.api.agoric.net — sequence number source (REST), resolved from chain-registry inside the planner.

These looked harmless. They’re the same chain, same mainnet. The difference is the subdomain: main-a is a specific node, main is a load-balanced pool.

Four resolver transactions failed within a window of a few hours, all with the same error:

Error: Broadcasting transaction failed with code 32 (codespace: sdk).
Log: account sequence mismatch, expected 3397, got 3393: incorrect account sequence

We already had logic for this — a sequencer that catches the mismatch and re-queries the account’s sequence, then retries. But the retries kept getting the same stale sequence back and hitting the exact same error. After one retry, we gave up and left the tx PENDING forever (or until a restart, which would have re-processed it).

The clue was the consistency: every retry read sequence 3393, but the chain had moved on to 3397. That’s not random flakiness. That’s a specific node that’s stuck behind.

Proving the RPC Lag#

I wrote a small Bash loop to poll both endpoints and compare x-cosmos-block-height headers every couple of seconds:

ADDR="agoric1y3e3mlnrkuh6j2qcnlrtap42j8mzw240vwr74j"
for i in $(seq 1 10); do
  api_h=$(curl -sS -D- "https://main.api.agoric.net/cosmos/auth/v1beta1/accounts/$ADDR" 2>/dev/null \
    | grep -oiP 'x-cosmos-block-height:\s*\K[0-9]+' | head -1)
  rpc_h=$(curl -sS "https://main.rpc.agoric.net/status" 2>/dev/null \
    | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['result']['sync_info']['latest_block_height'])")
  diff=$((rpc_h - api_h))
  echo "sample $i: main.api=$api_h  main.rpc=$rpc_h  DRIFT=${diff} blocks"
  sleep 2
done

Output:

sample 1: main.api=25056724  main.rpc=25056724  DRIFT=0 blocks
sample 2: main.api=25056725  main.rpc=25056725  DRIFT=0 blocks
sample 3: main.api=25056725  main.rpc=25056726  DRIFT=1 blocks
sample 4: main.api=25056726  main.rpc=25056726  DRIFT=0 blocks
sample 5: main.api=25056213  main.rpc=25056727  DRIFT=514 blocks
sample 6: main.api=25056213  main.rpc=25056728  DRIFT=515 blocks
sample 7: main.api=25056728  main.rpc=25056729  DRIFT=1 blocks
sample 8: main.api=25056729  main.rpc=25056729  DRIFT=0 blocks
sample 9: main.api=25056730  main.rpc=25056730  DRIFT=0 blocks
sample 10: main.api=25056214  main.rpc=25056731  DRIFT=517 blocks

That’s not a hiccup — that’s every few seconds the load balancer routes us to a node that’s more than 500 blocks behind. At 6 seconds per block, that’s over 50 minutes of lag. Sometimes we got the fresh node, sometimes the stale one, sometimes the same stale node twice in a row. No pattern, just a bad node in the rotation.

When we got unlucky on both the initial read and the retry, we’d broadcast with a sequence from ~50 minutes ago, and the chain would reject it.

The Fix#

Two things.

1. Pin the REST URL to the same pool as the RPC. main-a.api.agoric.net is the REST companion to main-a.rpc.agoric.net — same provider, same nodes, consistent view. The planner was hard-coding the REST URL via chain-registry, so I opened a PR to make it configurable:

AGORIC_REST_URL=https://main-a.api.agoric.net:443

2. Retry settlement submissions properly. Even on a well-behaved RPC, transient failures happen. The resolver was giving up after one retry; it now keeps retrying with backoff and emits a RESOLVER_SETTLEMENT_STUCK alert every 10 minutes while it’s still trying. If the cause is a lagging node, the retries catch a fresh read eventually and the tx lands on its own.

Lessons#

A few things I’ll remember:

An RPC URL is a handle to one node, not to “the blockchain.” Different URLs can disagree about the current state. Load-balanced URLs can disagree with themselves between requests.
Read-path and write-path consistency matters. If you read account state from one endpoint and write transactions to another, and those endpoints aren’t synchronized, sequence numbers (or nonces on EVM chains) will bite you.
”We have retry logic” isn’t enough. If your retry re-reads from the same flaky source, you’ll just re-confirm the stale answer. Retries should have some way to route around the thing that just failed — different endpoint, exponential backoff, or at minimum a long enough delay that the node has a chance to catch up.
Cheap diagnostics go a long way. A 20-line Bash loop was enough to turn “probably an RPC issue?” into “here’s a 500-block drift, every few seconds, from a specific pool.” Having that evidence made the fix obvious.

RPC nodes feel like boring infrastructure until they’re not. Treat them like any other external dependency — assume they can fail, assume they can lie by omission (by being stale), and give your code a path to recover.

This blog post was written with the help of Claude.