Tendermint is a consensus protocol developed especially for blockchains. It is widely accepted in academia, with hundreds of citations for each of its core documents (Kwon’s paper, Buchman’s thesis, and Buchman, Kwon, Milosevic). It has also been used in production throughout the Cosmos ecosystem since 2018, making it one of the most battle-tested production protocols in the blockchain space. This reliability makes Tendermint a great candidate for Starknet’s consensus protocol.
This post aims to describe a variant of the Tendermint consensus protocol that is adapted for the needs of Starknet.
The proposal is structured in three parts:
- Tendermint terminology
- Tendermint for Starknet
- Summary
The second part outlines the modifications we have in mind to adapt Tendermint for Starknet.
Tendermint terminology
Participants in the Tendermint consensus protocol shall henceforth be referred to as voters. We will assume n voters with identical voting power.
Tendermint is designed to have good properties in the eventual synchrony model (referred to as partial synchrony in post 1 on decentralization). More specifically, when the number of faulty votes f is less than one third of the total number of votes 3f+1≤n, Tendermint has two key properties (both explained in this appendix):
- It is always safe (even during periods of asynchrony)
- It is moreover live during synchrony.
The height of a block is its distance from genesis in the chain, i.e its “block number” in the chain.
Tendermint proceeds in rounds at every height. At each round, one of the voters takes the role of a proposer/leader and proposes a block of transactions to the network. Afterward, the round has two voting phases: prevote
and precommit
.
A node concludes a round in its local view when it is convinced of consensus, either on a valid block B
or on nil
. In the former case, block B
is appended to the chain, the height is incremented, and the round number resets to zero. In the latter case, the round is incremented and the height is unchanged.
In a bird’s eye view of the network, each round begins with a proposal phase, where a block is broadcast by the leader. Upon receiving a block from the proposer, voters broadcast prevote
messages to each other and enter the prevote phase. As prevote
messages propagate throughout the network, voters broadcast precommit
messages and move to the precommit phase. In the happy flow, voters lock on the received block upon receiving >â…” prevote
messages for it. As precommit
messages propagate throughout the network, voters either append their locked block to the blockchain (in the happy flow) or increment the round in the same height (in the unhappy flow).
Tendermint for Starknet
A “Tendermint for Starknet” protocol will essentially follow the state machine outlined in Algorithm 1 of the Tendermint paper. There may be subtle changes to the state machine as a result of nuances and optimization as part of a production system.
In this section we shall highlight several design choices and modifications beyond the state machine itself.
- Transaction streaming
- Coupling sequencing and execution
- Recovery from corrupted consensus
Let’s go one by one.
Transaction streaming
Tendermint has each proposer broadcast its block in one chunk. Upon processing each block, voters will vote on it. Similar to Solana, Tendermint for Starknet will have each proposer sequentially stream its selected transactions upon deciding to include them in its block, without waiting until the block building is finished. Voters will only vote on the block as a whole, and not on streamed transactions.
Streaming has two noteworthy benefits:
- Improved throughput – a more constant stream of transactions provides a more constant input to every voter’s execution process(es), minimizing their idle time. If too many transactions are broadcast in a batch, the execution process(es) must wait to receive input, and this wasted execution time amounts to wasted throughput.
- Smaller validation latency – obviously, waiting for block building to finish also incurs additional latency from the moment a proposer chooses to include a transaction until the moment a voter receives it. This means transactions take longer to reach consensus.
There is still a question of “batch size”: should transactions be streamed individually or in small batches? Each streamed batch incurs the fixed cost of verifying the proposer’s signature, and may require a more robust connection to peers. The question of batch size will be addressed as we familiarize ourselves with engineering subtleties.
Coupling sequencing and execution
A consensus protocol need not involve any execution: it can output a ledger of transactions, and the ledger can be fed to an execution process later. Decoupling transaction sequencing from execution may result in “sequencing runaway”, with execution lagging behind. Observe, the sequencing throughput depends only on bandwidth and connectivity, without any awareness of the underlying semantics of transactions and their execution.
Severe sequencing runaway can negatively impact user experience by interfering with the fee market. For example, suppose transactions are sequenced five minutes before their execution. In this case, there is a five minute queue for execution which cannot be bypassed even by users who are willing to pay more. In summary, sequencer runaway would diminish the effectiveness of a fee market as a short-term priority mechanism to handle congestion.
The runaway problem warrants a backpressure mechanism for the execution process to slow down sequencing. Since the execution resources consumed by a transaction can only be determined upon execution, an obvious way to achieve backpressure is to put execution in the critical path of consensus: transactions must be executed before they reach consensus. Tendermint for Starknet will employ this coupling.
We are already facing open questions around the coupling between sequencing and execution. To give an example, consider the following options for when the proposer executes transactions.
- The proposer will broadcast its selected transactions after executing them, and stop upon reaching the designated maximum of execution resources. An obvious disadvantage is that voters must wait for the proposer to execute each transaction before it can be received, delaying consensus. However, an example advantage is that the proposer can facilitate deterministic concurrency for voters who re-execute, by broadcasting e.g transaction dependencies or intermediate state-diffs.
- The proposer will broadcast transactions before executing them. The broadcast list will not be externalized to users. Instead, voters will execute transactions from the ordered list until the allocated block resources are depleted, at which point they will truncate the list. After execution, voters will reach consensus on the state obtained by executing the truncated list. In this scheme, it is less clear how to facilitate deterministic concurrency.
In a sense, this discussion is premature, because we do not yet understand the subtleties surrounding it. However, we would love to learn from discussions and community feedback, especially from those experienced in these matters.
Recovery from corrupted consensus
Every consensus protocol faces the threat of agreement on an invalid state transition. In BFT protocols with finality – such as Tendermint – the problem of corrupted consensus is delegated to social recovery mechanisms external to the protocol. This approach is only reasonable if we expect essentially no recourse to social recovery, as it can be a sluggish and cumbersome process. Indeed we wish to avoid extended downtime as much as possible, without compromising too much safety.
Two noteworthy points may undermine the above expectation, possibly calling for an alternative approach.
- The presence of proofs outside the critical path of consensus.
- The threat of lazy voters who do not validate.
Before discussing a potential solution, let us elaborate.
We begin with the first point. Currently, any validation of Ethereum’s state transitions requires (re)execution of all transactions. In other words, the validation method is uniform throughout the network. Consequently, any (sufficiently vigilant) full node can quickly:
- detect corrupted consensus,
- convince other full nodes of corrupted consensus.
In other words, the uniform validation method ensures that few vigilant full nodes can quickly sound the alarm and trigger social recovery. Now compare this to Starknet, whose full nodes validate state transitions by verifying validity proofs. The presence of proofs breaks the symmetry between sequencers and other full nodes, allowing weak machines to validate state transitions. If the proving layer halts, e.g due to a bug, then the network is fragmented: powerful machines who can validate by (re)execution are unaffected, but weaker machines who rely on proof verification are shrouded in a mist of uncertainty, unable to detect corrupted consensus. Now each person faces a dilemma: wait for a proof, or pursue social recovery and overturn the consensus on the problematic state transition? Such a scenario (e.g a completeness bug in the prover) may justify a defined recovery mechanism to help coordinate the network.
The second point is simpler. Naive analysis of BFT consensus protocols such as Tendermint assumes >⅔ honest voters, who vigilantly validate before voting. While it’s clear that having too many malicious voters will wreck protocol security, there is also a fine line which is not so easy to categorize as widespread malicious behavior: lazy voters. Our scenario is a tragedy of the commons: each voter quietly stops validating, trusting the job to the others, and subsequently resulting in too few vigilant voters. In such a horrific world, a single malicious proposal of an invalid block can end in consensus on an invalid state transition. How likely is widespread lazy voting? How to combat it? Such problems are not explicitly handled by the Ethereum protocol now, nor do we plan to tackle them in the near future in the Starknet protocol. As such, perhaps they warrant a recovery mechanism in Starknet.
How will Tendermint for Starknet mitigate these concerns, you ask? A simple proposal is to add a fork choice rule to the following effect: if a Starknet block has not been finalized on L1 for k hours, it is invalidated alongside all of its ancestors. This modified fork choice rule defines a clear recovery mechanism in case of a completeness bug in the prover. It also addresses lazy voters, because the soundness of proofs prevents invalid state transitions from being proven, leading to the same flow.
Summary
In this post, we outlined the Tendermint consensus protocol alongside the main modifications required from its adaptation to Starknet. In the next post we will outline the entire Starknet protocol (including proof production and L1 interaction) and define the block validation process.
Looking forward to reading your thoughts!