Tendermint for Starknet

Tendermint is a consensus protocol developed especially for blockchains. It is widely accepted in academia, with hundreds of citations for each of its core documents (Kwon’s paper, Buchman’s thesis, and Buchman, Kwon, Milosevic). It has also been used in production throughout the Cosmos ecosystem since 2018, making it one of the most battle-tested production protocols in the blockchain space. This reliability makes Tendermint a great candidate for Starknet’s consensus protocol.

This post aims to describe a variant of the Tendermint consensus protocol that is adapted for the needs of Starknet.

The proposal is structured in three parts:

  1. Tendermint terminology
  2. Tendermint for Starknet
  3. Summary

The second part outlines the modifications we have in mind to adapt Tendermint for Starknet.

Tendermint terminology

Participants in the Tendermint consensus protocol shall henceforth be referred to as voters. We will assume n voters with identical voting power.

Tendermint is designed to have good properties in the eventual synchrony model (referred to as partial synchrony in post 1 on decentralization). More specifically, when the number of faulty votes f is less than one third of the total number of votes 3f+1≤n, Tendermint has two key properties (both explained in this appendix):

  • It is always safe (even during periods of asynchrony)
  • It is moreover live during synchrony.

The height of a block is its distance from genesis in the chain, i.e its “block number” in the chain.

Tendermint proceeds in rounds at every height. At each round, one of the voters takes the role of a proposer/leader and proposes a block of transactions to the network. Afterward, the round has two voting phases: prevote and precommit.

A node concludes a round in its local view when it is convinced of consensus, either on a valid block B or on nil. In the former case, block B is appended to the chain, the height is incremented, and the round number resets to zero. In the latter case, the round is incremented and the height is unchanged.

In a bird’s eye view of the network, each round begins with a proposal phase, where a block is broadcast by the leader. Upon receiving a block from the proposer, voters broadcast prevote messages to each other and enter the prevote phase. As prevote messages propagate throughout the network, voters broadcast precommit messages and move to the precommit phase. In the happy flow, voters lock on the received block upon receiving >⅔ prevote messages for it. As precommit messages propagate throughout the network, voters either append their locked block to the blockchain (in the happy flow) or increment the round in the same height (in the unhappy flow).

Tendermint for Starknet

A “Tendermint for Starknet” protocol will essentially follow the state machine outlined in Algorithm 1 of the Tendermint paper. There may be subtle changes to the state machine as a result of nuances and optimization as part of a production system.

In this section we shall highlight several design choices and modifications beyond the state machine itself.

  1. Transaction streaming
  2. Coupling sequencing and execution
  3. Recovery from corrupted consensus

Let’s go one by one.

Transaction streaming

Tendermint has each proposer broadcast its block in one chunk. Upon processing each block, voters will vote on it. Similar to Solana, Tendermint for Starknet will have each proposer sequentially stream its selected transactions upon deciding to include them in its block, without waiting until the block building is finished. Voters will only vote on the block as a whole, and not on streamed transactions.

Streaming has two noteworthy benefits:

  1. Improved throughput – a more constant stream of transactions provides a more constant input to every voter’s execution process(es), minimizing their idle time. If too many transactions are broadcast in a batch, the execution process(es) must wait to receive input, and this wasted execution time amounts to wasted throughput.
  2. Smaller validation latency – obviously, waiting for block building to finish also incurs additional latency from the moment a proposer chooses to include a transaction until the moment a voter receives it. This means transactions take longer to reach consensus.

There is still a question of “batch size”: should transactions be streamed individually or in small batches? Each streamed batch incurs the fixed cost of verifying the proposer’s signature, and may require a more robust connection to peers. The question of batch size will be addressed as we familiarize ourselves with engineering subtleties.

Coupling sequencing and execution

A consensus protocol need not involve any execution: it can output a ledger of transactions, and the ledger can be fed to an execution process later. Decoupling transaction sequencing from execution may result in “sequencing runaway”, with execution lagging behind. Observe, the sequencing throughput depends only on bandwidth and connectivity, without any awareness of the underlying semantics of transactions and their execution.

Severe sequencing runaway can negatively impact user experience by interfering with the fee market. For example, suppose transactions are sequenced five minutes before their execution. In this case, there is a five minute queue for execution which cannot be bypassed even by users who are willing to pay more. In summary, sequencer runaway would diminish the effectiveness of a fee market as a short-term priority mechanism to handle congestion.

The runaway problem warrants a backpressure mechanism for the execution process to slow down sequencing. Since the execution resources consumed by a transaction can only be determined upon execution, an obvious way to achieve backpressure is to put execution in the critical path of consensus: transactions must be executed before they reach consensus. Tendermint for Starknet will employ this coupling.

We are already facing open questions around the coupling between sequencing and execution. To give an example, consider the following options for when the proposer executes transactions.

  • The proposer will broadcast its selected transactions after executing them, and stop upon reaching the designated maximum of execution resources. An obvious disadvantage is that voters must wait for the proposer to execute each transaction before it can be received, delaying consensus. However, an example advantage is that the proposer can facilitate deterministic concurrency for voters who re-execute, by broadcasting e.g transaction dependencies or intermediate state-diffs.
  • The proposer will broadcast transactions before executing them. The broadcast list will not be externalized to users. Instead, voters will execute transactions from the ordered list until the allocated block resources are depleted, at which point they will truncate the list. After execution, voters will reach consensus on the state obtained by executing the truncated list. In this scheme, it is less clear how to facilitate deterministic concurrency.

In a sense, this discussion is premature, because we do not yet understand the subtleties surrounding it. However, we would love to learn from discussions and community feedback, especially from those experienced in these matters.

Recovery from corrupted consensus

Every consensus protocol faces the threat of agreement on an invalid state transition. In BFT protocols with finality – such as Tendermint – the problem of corrupted consensus is delegated to social recovery mechanisms external to the protocol. This approach is only reasonable if we expect essentially no recourse to social recovery, as it can be a sluggish and cumbersome process. Indeed we wish to avoid extended downtime as much as possible, without compromising too much safety.

Two noteworthy points may undermine the above expectation, possibly calling for an alternative approach.

  • The presence of proofs outside the critical path of consensus.
  • The threat of lazy voters who do not validate.

Before discussing a potential solution, let us elaborate.

We begin with the first point. Currently, any validation of Ethereum’s state transitions requires (re)execution of all transactions. In other words, the validation method is uniform throughout the network. Consequently, any (sufficiently vigilant) full node can quickly:

  1. detect corrupted consensus,
  2. convince other full nodes of corrupted consensus.

In other words, the uniform validation method ensures that few vigilant full nodes can quickly sound the alarm and trigger social recovery. Now compare this to Starknet, whose full nodes validate state transitions by verifying validity proofs. The presence of proofs breaks the symmetry between sequencers and other full nodes, allowing weak machines to validate state transitions. If the proving layer halts, e.g due to a bug, then the network is fragmented: powerful machines who can validate by (re)execution are unaffected, but weaker machines who rely on proof verification are shrouded in a mist of uncertainty, unable to detect corrupted consensus. Now each person faces a dilemma: wait for a proof, or pursue social recovery and overturn the consensus on the problematic state transition? Such a scenario (e.g a completeness bug in the prover) may justify a defined recovery mechanism to help coordinate the network.

The second point is simpler. Naive analysis of BFT consensus protocols such as Tendermint assumes >⅔ honest voters, who vigilantly validate before voting. While it’s clear that having too many malicious voters will wreck protocol security, there is also a fine line which is not so easy to categorize as widespread malicious behavior: lazy voters. Our scenario is a tragedy of the commons: each voter quietly stops validating, trusting the job to the others, and subsequently resulting in too few vigilant voters. In such a horrific world, a single malicious proposal of an invalid block can end in consensus on an invalid state transition. How likely is widespread lazy voting? How to combat it? Such problems are not explicitly handled by the Ethereum protocol now, nor do we plan to tackle them in the near future in the Starknet protocol. As such, perhaps they warrant a recovery mechanism in Starknet.

How will Tendermint for Starknet mitigate these concerns, you ask? A simple proposal is to add a fork choice rule to the following effect: if a Starknet block has not been finalized on L1 for k hours, it is invalidated alongside all of its ancestors. This modified fork choice rule defines a clear recovery mechanism in case of a completeness bug in the prover. It also addresses lazy voters, because the soundness of proofs prevents invalid state transitions from being proven, leading to the same flow.

Summary

In this post, we outlined the Tendermint consensus protocol alongside the main modifications required from its adaptation to Starknet. In the next post we will outline the entire Starknet protocol (including proof production and L1 interaction) and define the block validation process.

Looking forward to reading your thoughts!

As always great work Ilia. We will be taking a look at this in the next few days!

I have re read this and I don’t understand the advantage Ilia of broadcasting after executing them. Could you please explain it a little bit more?

Thanks in advance.

Hey Fede!

As soon as the proposer executes, it can help others parallelize re-execution (details below). In such a scheme, only the proposer of every block would have purely sequential execution.

If the proposer executes before broadcasting, one can require “parallelization hints” to be part of the broadcast alongside the transactions themselves.

I can think of two main advantages of such concurrency:

  1. Larger blocks.
  2. Less need for powerful cores.

1

Even in the set of voters, we expect some variety in computational power. On the other hand, we want all voters to validate blocks before voting. This imposes a throughput constraint on Starknet: weaker voters must be given time to validate the strongest proposer.

To mitigate this throughput constraint, we should minimize validation time for voters without requiring stronger cores. A natural way to achieve this is to parallelize validation.

I slightly oversimplified above: in fact, we want to give voters enough time to make up for small lags and catch up to the tip of the chain. Hence we need validation to be even faster than the throughput. This strengthens the benefit of concurrent validation.

2

Voters will only optimize for the strength of their strong execution core(s) during block-building. I say core(s) because they may employ optimistic parallelization.

Parallelization hints

Currently every contract call is a syscall, which is handled differently by the Starknet OS (Cairo program) and the sequencer backend. Specifically, the OS (used for proving) is concurrent by nature while the (current!) backend is not. Modifying the backend to more closely resemble the OS in terms of concurrency will facilitate a clearer “parallelization hint” mechanism.

In the OS, you guess the result of every syscall and continue execution; at the end you verify all the syscallguesses. This obviously makes no sense for the proposer, as it cannot know the call results. However, if a voter received an ordered list of all contract calls results from the proposer, it can parallelize every contract call. Each time I encounter the call_contract syscall, I open a new thread to run it, and continue with the proposer’s suggestion. In a way, I’m validating the proposer’s execution without naively re-executing. This can be extended to other operations where non-determinism makes sense. For example, sqrt computation libfunc, where the proposer can tell me the root and I can just square and verify it.

I think that integrating the Tendermint consensus protocol into Starknet might offer some valuable benefits. From what I gather, Tendermint’s track record of reliability in other blockchains could provide a stronger sense of security and stability to Starknet.

The main perk, in my humble opinion, is its ability to reach consensus efficiently, even when dealing with malicious actors, seems like a promising way to ensure smoother operations.

The adjustments proposed to fit Starknet’s needs, such as transaction streaming and addressing concerns about “lazy voters,” strike me as potential enhancements for better performance and security.

Considering that Tendermint is an established protocol seducing even the best devs in the world (Hi Abdel !), I believe it could contribute to the long-term growth and engagement of Starknet within the broader blockchain community, encouraging collaborative learning.

All in all, I think that incorporating Tendermint could be a strategic move to elevate Starknet’s effectiveness and reputation, great work proposing it.

Great discussion here Ilia!

I’ll provide some remarks on a specific implementation of Tendermint, called CometBFT. Remarks from an implementation are relevant because they can carry real-world lessons (from that implementation). Hopefully this will give more substance to the discussion and help move it forward with more clarity.

1. On transaction streaming

The idea to stream transactions is in contrast to the CometBFT implementation, where the proposer first constructs and second disseminates the block as a whole.

Theoretical impact

Transaction streaming should enable more parallelism in the system. Given the following:

  • Suppose A is the task of a proposer to assemble and broadcast the proposal to a specific votes,
  • and B is the task for a specific voter to validate the transactions in the block.

Typically, A and B are on the critical path of the consensus protocol lifecycle. Without transaction streaming, tasks A and B would run in a sequence. Transaction streaming allows the overlapping of the two tasks. More specifically, we can start task B at each voter before task A finishes. The larger that tasks A (i.e., the larger the blocks are) and B (i.e., the more compute and memory-intensive is transaction validation) the more benefits there are to transaction streaming.

Concerns

In practice, transaction dissemination in CometBFT happens on two paths:

  1. It happens as part of the background gossip among nodes in the P2P layer. Whenever a node observes a transaction from a user, it adds it to its own mempool and then gossips it to its neighbors.
  2. As part of the block lifecycle, on the critical path of consensus. This is equivalent to task A above.

There is redundancy in this design, so optimizations are possible. Specifically, it is possible to eliminate the need for transaction dissemination as part of (2). There is actually a production network that is experimenting with this optimization. Namely Sei (whitepaper) who recently launched their mainnet. They are operating on a fork of Comet and we are looking into bringing their optimization into mainline CometBFT. Other optimizations are also possible to help improve throughput and latency.

Given the above, the need for transaction streaming seems to make most sense in the absence of path (1). It depends on the envisioned architecture of Starknet sequencer network, but in general path (1) is very important because it allows nodes in the network to populate their respective mempools and to be ready to propose a block with valuable/recent transactions when it is their time to be the chosen proposer.

2. Coupling sequencing and execution

The architecture of applications built on CometBFT has 2 logical layers: CometBFT and the application itself. Comet is in charge of producing blocks, and in doing so it requests input from the application. Throughout the lifecycle of each block, Comet demands input from the application three times. In other words, there are three points of synchronization between Comet and the application. Assuming CometBFT v0.37, these three points are, roughly speaking: PrepareProposal, ProcessProposal, and FinalizeBlock. The diagram below gives an illustration, with the synchronization points drawn in red color.

The final point of synchronization, FinalizeBlock, is particularly relevant. At this point, CometBFT assumes that the application executes the transactions and computes the new root of the application state (Merkle tree root hash). Without this assumption, a runaway problem in indeed possible. In a shared sequencer design, transaction execution is not necessary, but like the original post states, it makes sense to couple sequencing with execution.

A quick thought looking for feedback

Problem statement,
A chain is only as decentralized as its validators.
Using delegated proof of stake, employs economics to solve this.
The problem is that open economies naturally lead to a 10%, who are really good at capitol allocation, owning over 85% of the value.
This does not lend itself well to decentralized governance.

Solution:
Step 1 - The Chain
Imagine a POS ZKP state machine, where the validators are blind to the state changes they are validating, and blind to who sent these transactions (using relayers).
Essentially making all transactions indistinguishable one from the other. The state machine is either on and processing transactions or off.

Step 2 - Governance
Imagine a wallet that lets you choose the validators you trust. Enabling you to seamlessly interact with any protocol that has enough of your trusted validators.
Going further, what if the users could select preferences for general chain governance. As to only use chains that follow from a set of preferred rule sets they are comfortable with.

Of course a wallet with good UX is crucial, providing users with a recommended trust setup which they can toggle. Perhaps big names in the space may have their trust setups copied with a one click ux while wallet prompts help with security.

Protocols & liquidity are incentivized towards chains with the most users. This translates a user’s trust threshold into a practical vote on governance.

Abstract summary:
A democratic republic for the digital world, this setup lets each person have a vote in governance. As for how we know what’s human and what isn’t. The onus is on the validators, protocols, and liquidity to figure out where the users are. For it is in their best interest to do so.

Technical summary:
Stake slashing, blind validation, reputational risk, and user side custom trust thresholds.