Building a Parallel EVM Execution Engine in Rust

/ ARTICLE

I wrote earlier about whether Solana-style parallel execution could work on EVM. This is the implementation: a Rust project called pevm that takes a batch of transactions, statically analyzes the contracts involved to determine storage access patterns, builds a dependency graph, and executes independent transactions concurrently on revm. Parallel and sequential execution produce identical results.

Storage access analysis

Two analysis paths feed into the same pipeline, one from Solidity source and one from bytecode.

The source analyzer extracts the storage layout, each function’s read/write sets, and the cross-contract call graph. Because you have the source, you can trace how function parameters flow into mapping keys: balances[msg.sender] -= amount resolves to a write at a specific base slot keyed by the caller’s address, which at scheduling time (once you know the actual caller) becomes a concrete slot number. Transfers between different address pairs therefore write to different concrete slots and can parallelize.

The bytecode analyzer parses the PUSH4/EQ/JUMPI dispatch table to identify function entry points, then walks reachable paths from each to find SLOAD/SSTORE operations. Each access gets classified as Static(slot), Mapping{base_slot}, or Unknown, but since mapping keys can’t be resolved without the actual inputs, bytecode-derived mapping accesses use a wildcard that conflicts with any access to the same base slot. Fewer things parallelize as a result, but you never miss a real conflict.

Contracts too complex for either path (heavy dynamic jumps, computed storage offsets) get marked opaque, which serializes all transactions that touch them at the cost of parallelism but with correctness guaranteed.

In practice the source analyzer covers most common contracts well, while the bytecode path with wildcards handles the rest at reduced parallelism. Both produce the same intermediate representation so you can mix them freely in one batch.

The dependency DAG

For each transaction the DAG builder resolves the complete storage access set by walking cross-contract calls recursively (depth-capped at 4), tracking how arguments map through call frames. If function A on contract X calls function B on contract Y passing its first parameter as Y’s second, the resolver follows that mapping to determine Y’s concrete slots.

msg.sender needs special handling through the call stack since in the top-level frame it’s the transaction’s caller, but inside a cross-contract call it becomes the calling contract’s address. Getting this wrong means resolving slots to wrong keys and missing real conflicts.

With resolved access sets in hand, two transactions conflict when they access the same (contract, variable, resolved_keys) tuple and at least one writes. Read-read pairs also get flagged since static analysis can’t determine whether a read feeds into a conditional write further down. From the conflict graph the builder computes topological layers with no intra-layer conflicts.

Concrete example

A batch of 6 transactions:

tx0: USDC.transfer(alice → bob, 100)        writes: USDC.balances[alice], USDC.balances[bob]
tx1: Bank.deposit(alice, USDC, 500)          writes: Bank.deposits[alice][USDC], USDC.balances[alice], USDC.balances[bank]
tx2: WETH.transfer(charlie → dave, 50)       writes: WETH.balances[charlie], WETH.balances[dave]
tx3: Bank.deposit(charlie, WETH, 200)        writes: Bank.deposits[charlie][WETH], WETH.balances[charlie], WETH.balances[bank]
tx4: DAI.transfer(eve → alice, 100)          writes: DAI.balances[eve], DAI.balances[alice]
tx5: Bank.withdraw(alice, USDC, 200)         writes: Bank.deposits[alice][USDC], USDC.balances[bank], USDC.balances[alice]

tx0 and tx1 conflict on USDC.balances[alice], tx1 and tx5 conflict on Bank.deposits[alice][USDC], but tx0 and tx2 are fully independent (different tokens, different addresses) and tx4 touches DAI which nothing else in the batch uses.

Layer 0: tx0, tx2, tx4    ← independent, parallel
Layer 1: tx1, tx3          ← tx1 needs tx0, tx3 needs tx2, but they don't conflict with each other
Layer 2: tx5               ← needs tx1 (shared Bank.deposits[alice][USDC])

Three steps instead of six. A contract-level check would serialize tx0 and tx4 since they both involve ERC-20 contracts, but slot-level analysis sees that USDC.balances[alice] and DAI.balances[eve] are different locations entirely.

Parallel execution on revm

revm’s Evm struct is !Send (it contains Rc<RefCell<...>> and raw pointers), so EVM instances can’t cross thread boundaries, which rules out pooling or dispatching to a thread-pool. Each EVM must be created in the thread that uses it.

The ParallelExecutor works around this with persistent worker threads and channel-based dispatch:

Shared state lives in Arc<RwLock<CacheDB>>, with workers reading concurrently during execution.
N worker threads spawn at construction, each creating its own EVM locally and persisting for the executor’s lifetime.
The main thread sends TxEnv over channels; workers call transact() which returns an EvmState delta without modifying the shared database, then send results back.
Between layers the main thread write-locks the database, applies all deltas from the completed layer, and unlocks before the next layer begins.

The transact()/commit() split is what makes this safe: transact() reads shared state and produces a diff while commit() applies it, so multiple workers can transact() concurrently since they only read. The DAG guarantees no two same-layer transactions write overlapping state, which means delta merge order doesn’t matter.

A SharedDb wrapper implements DatabaseRef by read-locking the RwLock on each access, but since every SLOAD from every worker contends on this lock, throughput degrades beyond 8 workers. A lock-free per-layer snapshot would eliminate this contention.

Nonce handling

The DAG builder tracks storage conflicts but doesn’t yet account for sender nonce ordering, so transactions from the same sender that land in the same layer all see the same base nonce from the shared database. The fix is straightforward: consecutive same-sender transactions need an implicit dependency edge forcing them into separate layers, which the DAG builder already supports as a mechanism since it’s the same kind of ordering constraint used for storage conflicts. Not implemented yet because the benchmarks use unique senders and it hasn’t caused issues there.

Contract registry

Analysis results persist in an SQLite registry with a normalized schema (contracts → state variables → functions → storage accesses + external calls), so contracts get analyzed once on deploy and the per-block cost is just looking up metadata for involved contracts and running the DAG builder over cached data. Proxy upgrades invalidate the cached analysis for the implementation address.

A graph database would be a natural fit since the data is literally a call graph and the cross-contract resolution maps directly to graph traversal queries, but for my test setup SQLite works fine.

Benchmarks

Setup: 100-400 transactions per batch across 10 ERC-20 tokens and 50 user accounts interacting with a Bank contract (transfers, approvals, deposits, withdrawals), all release builds averaged over 100 iterations with different random seeds.

DAG build

The builder went through several optimization passes: chain edges instead of pairwise enumeration (O(k) per resource bucket rather than O(k²)), Kahn’s algorithm for topological layering, and string interning with the interner inlined into the resolution phase to avoid intermediate Vec<String> allocations that were immediately re-hashed.

Isolated measurement (1000 iterations, 300 txs): 737 µs, with resolve at 47%, index 46%, edges 3%, layers 4%. In-context (inside the full benchmark loop with EVM execution between iterations): ~1.8ms for 300 txs, where the 2.4x gap likely comes from CPU cache pollution as the EVM executor and database operations evict the builder’s working set between runs.

Txs	Isolated	In-context	Layers	Max layer
100	~250 µs	~830 µs	8.7	55
200	~500 µs	~1340 µs	15.2	95
300	~740 µs	~1820 µs	21.4	128
400	~980 µs	~2190 µs	27.7	159

Linear scaling at ~5-6 µs/tx isolated, with max layer sizes of 55-159 indicating plenty of available parallelism per layer.

Execution

Txs	Workers	Exec only	Exec speedup	End-to-end
100	1 (seq)	721 µs	1.00x	1.00x
100	8	856 µs	0.84x	0.43x
200	1 (seq)	1333 µs	1.00x	1.00x
200	8	1311 µs	1.02x	0.50x
300	1 (seq)	1959 µs	1.00x	1.00x
300	8	1801 µs	1.09x	0.54x
400	1 (seq)	2621 µs	1.00x	1.00x
400	8	2323 µs	1.13x	0.58x
400	16	2430 µs	1.08x	0.57x

Parallelism barely breaks even on execution alone and loses end-to-end. Against an in-memory CacheDB a transaction takes ~7 µs with storage reads completing in nanoseconds (HashMap lookups), so thread coordination overhead dominates and adding workers beyond 8 makes things slower due to lock contention. The DAG build compounds the problem since at 300 txs the in-context cost (~1.8ms) roughly equals the entire sequential execution time (~2.0ms).

This makes sense because the case for parallel execution was always about I/O-bound workloads. On a real node with LevelDB or MDBX a cold SLOAD costs 100µs+ instead of nanoseconds, at which point coordination overhead becomes noise and there’s enough work per transaction to amortize distribution costs. The benchmark contracts (ERC-20 transfers, basic vault operations) are also on the simple end; heavier contracts like DEX routers or lending liquidations would shift the ratio further toward parallelism.

Correctness is validated across all iterations and worker counts, with transactions that revert (randomly generated insufficient balances) producing identical outcomes in both modes.

Future directions

DAG-based scheduling. The layer model is conservative since everything in layer N finishes before N+1 starts, even when a transaction in N+1 only depends on one specific transaction in N. A per-dependency scheduler could release transactions as soon as their individual predecessors complete, but the challenge is database access: layers provide a clean global read/write boundary while per-dependency scheduling needs transactions to see specific predecessor results without a global commit. A sharded database with per-slot-region commits could work but is nontrivial.

State pre-loading. The DAG gives you the entire block’s storage working set before execution starts, so on a disk-backed database you could batch-prefetch every slot upfront to saturate I/O bandwidth and warm the cache before the first transaction runs. This helps even without parallel execution since you’re front-loading I/O instead of paying it incrementally per-SLOAD. EIP-2930 access lists do something similar per-transaction, but the DAG gives you the full block in one shot.

Transaction ordering. Within a layer the DAG guarantees no conflicts so execution order doesn’t affect results, and between layers block ordering is preserved automatically. Block builders could optimize transaction ordering to maximize independent groups (no protocol change needed), or alternatively the protocol could commit the DAG structure alongside the transaction list to let validators skip re-deriving it.

Analysis improvements. Conditional branch analysis to reduce false conflicts in contracts with complex control flow, and EIP-2930 access lists as runtime ground-truth to validate static analysis results.

The code is available on github: github.com/Wally869/parallel_evm