Starknet Decentralized Protocol III - Consensus

This is post III, which continues post I (introduction) and post II (leader elections) by diving into the open questions and trade-offs surrounding the consensus protocol for Starknet. The discussion is organized into four sections:

  1. Dual ledgers for safety and liveness
  2. Slashing equivocators
  3. Responsiveness
  4. Streamlining the critical path to consensus

We have tried to make the tone as informal as possible, citing technical references for the interested reader.

The next post will dive into thoughts and questions revolving around proofs in the protocol.

Dual ledgers for safety and liveness

In this section we outline the dual-ledger design for consensus protocols used by Ethereum and then discuss its applicability to Starknet.

Partial synchrony and CAP

Recall (from post I) the partially synchronous network model which provides a convenient middle ground between synchrony and asynchrony by handling network faults of unknown but bounded duration. For instance, this sensible model captures the idea of power outages of finite duration. Unfortunately it also raises a seemingly insurmountable obstacle known as the CAP theorem: during asynchrony, no protocol can satisfy both safety and liveness. Asynchrony can cause network partitions, which provide a simple illustration of the underlying idea. Suppose the network is entirely honest and partitioned into parts A,B. If the protocol is live then A,B progress independently, contradicting safety. If the protocol is safe then A,B cannot progress independently, contradicting liveness.

Dual ledgers: circumventing CAP with checkpoints

As remarked above, the CAP theorem precludes a single consensus protocol from being both safe and live under asynchrony. This obstruction can be circumvented by cleverly coupling a live protocol with a safe protocol. Write \Pi_\text{safe},\Pi_\text{live} for safe and live protocols respectively. The output of \Pi_\text{live} is called the live ledger. The idea is to create a feedback loop:

  1. First, the live ledger is fed as the input to \Pi_\text{safe} . The output of the composition is a checkpointed live ledger.
  2. Next, the checkpointed live ledger is fed back into \Pi_\text{live} . This ensures the checkpointed live ledger is always a prefix of the live ledger.
  3. Repeat

During asynchrony, the safe protocol \Pi_\text{safe} stalls. In this case, the safe prefix of the live ledger lags behind while the live ledger itself continues advancing. For more information, see e.g this paper and this paper.

Ethereum has been using such a dual-ledger design for its consensus protocol since the introduction of Casper – a finality gadget in the form of a safe consensus protocol (essentially chained Tendermint).

Checkpoints for Starknet

In this subsection we will examine checkpoints in the context of a dual-ledger consensus protocol for Starknet. Two things to keep in mind:

  • The consensus protocol will be purely PoS.
  • Stake will be managed by an L1 contract.

L1 checkpoints

Under the partially synchronous model, designing the L2 consensus protocol of a validity rollup inevitably faces the same CAP limitations. The dual ledger design outlined above delivers the best of both worlds. But what about L1?

By design, security of validity rollups eventually maxes out at L1 security (upon L1 state updates). To approach this level of confidence at small latency it is natural to consider L1 checkpoints, sent via Ethereum transaction. We see two main trade-offs involved:

  1. Cost vs confidence – interaction with L1 is expensive; on the other hand, L1 is safe and visible by assumption. Furthermore, L1 can enforce logic of all kinds, e.g slashing rules.
  2. Cost vs latency – low latency requires increased frequency, which translates to increased costs.

To mitigate the cost factor, L1 checkpoints can be restricted to commitments certified by a majority of stake. In this case they will not provide any data availability about the state of Starknet, but only evidence of L2 consensus (via quorum certificate). For latency, we are fine with several minutes.

Here are some more questions of interest regarding L1 checkpoints.

  1. What is verified upon submission of L1 checkpoints? This is a cost/confidence trade-off, since L1 verification of the quorum certificate consumes L1 gas.

  2. To what extent can there be conflicting checkpoints? At one extreme is a single checkpoint chain. At the other extreme lies a checkpoint tree. Many intermediate alternatives are possible.

  3. Checkpoint durability? Should newer and/or larger quorum certificates be allowed to override existing ones or are there rigid expiration dates?

Digression: checkpoints for fast finality

Starknet L1 state updates must follow the L2 consensus. Hence it is natural to couple proofs with checkpoints. Moreover, the planned latency of Starknet L1 finality is in the realm of several hours, and checkpoints can be used to design fast finality mechanisms. Our thoughts on such mechanisms are outlined separately in an upcoming post which defines some desirable properties and suggests ways of attaining them.

Slashing equivocators

We have previously decided the StarkNet consensus protocol will be leader-based with PoS Sybil resistance. This makes it vulnerable to the “nothing at stake” problem in the form of equivocation. Specifically, a leader can equivocate by producing conflicting blocks during its turn at negligible costs (compared to PoW). If a dual-ledger design is used, checkpoint equivocation is also possible.

Ethereum defends against equivocation with slashing rules. However, slashing instances must be part of the ledger in order to take effect, which means they can be censored by a minority of ⅓ of stake. By maintaining the slashing logic on L1, Starknet will be able to defend against equivocation even by a malicious majority of stake. Note Starknet’s L1 state updates act as unforkable checkpoints, restricting the scope of backward simulation attacks of suddenly presenting alternative histories even without any slashing.

Responsiveness

In our leader-based setting it would be nice for a leader to drive consensus at the pace of actual network delay δ as opposed to maximal network delay Δ, since this would decrease latency and increase throughput. Roughly speaking, responsiveness means that during synchrony, transaction confirmation time depends only on actual network delay δ and not any apriori upper-bound Δ. For example, the Bitcoin protocol is non-responsive because its block difficulty is an explicit function of a pre-specified upper-bound Δ. Consequently, transaction confirmation time suffers from the looseness in the estimate Δ.

Pass and Shi distill a formal definition of responsiveness and also distill an unfortunate trade-off between performance and security: responsive secure consensus protocols can only tolerate <â…“ corruptions, even during synchrony. The argument resembles the proof of CAP and we outline it for three nodes A,B,M with M malicious. Essentially, M partitions the network by presenting contradicting world views to A,B. Specifically:

  • M causes communication A↔B to have maximal latency Δ while maintaining communications A↔M and B↔M at a network delay δ≪Δ.
  • M behaves honestly in each of the communications A↔M and B↔M, but sends conflicting information to A,B.
  • Consequently A thinks it and M are online and B is offline, and therefore perceives communication A↔M as an honest execution of the protocol at â…“ failures. Similarly with A,B switched.
  • Assuming the protocol is responsive at â…“ failures, A,B perform conflicting updates to their ledger, breaking consistency.

In light of the above argument, the best we can hope for is optimistic responsiveness: transaction confirmation time depends only on the current maximum latency causable by malicious agents. Perhaps the first partially synchronous protocol which is both safe and optimistically responsive is a leader-based BFT protocol called HotStuff. As shown in the paper, the responsiveness property achieved in this leader-based context means honest leaders can drive the consensus forward at actual network delay δ. A refinement of HotStuff is used in production by Aptos. However, since HotStuff is not live during asynchrony, it cannot serve as \Pi_\text{live} in a dual-ledger design. We are not aware of any battle-tested PoS protocols that are optimistically responsive and live under asynchrony. Here’s a nice technical reference.

We suspect responsiveness properties will play an important part in achieving lower latency and higher throughput. Motivated by this suspicion we are thinking of an optimistically responsive protocol suitable to produce Starknet’s live ledger.

Streamlining the critical path to consensus

The critical path to consensus is the sequence of operations performed after the previous execution of the consensus protocol to complete the current execution. For example, the leader’s critical path may include block-building and computing a commitment to the state obtained by executing the block. Every “obstacle” in the critical path to consensus reduces throughput. What can be taken out? Are the benefits worthwhile for Starknet?

State commitments

Every Ethereum block contains a commitment to the state following its execution. Hence the calculation of state commitments is in the critical path to consensus: it must be completed before the block is propagated. An obvious way to streamline the critical path for Starknet is to exclude any state commitments from the block body. A less extreme approach involves delayed computation, e.g the body of block n+k must contain a commitment to the state following execution of block n. In this case, assuming some predictability, leader n+k may compute the commitment in advance or at least in parallel to building its own block. The price of the latter optimization is increased latency for non-executing nodes who wish to verify storage proofs with respect to the state at the tip of the chain, since they must now wait k blocks.

DAGs: decoupling consensus from block proposal

“Monolithic” leader-based BFT consensus protocols like Tendermint and HotStuff both place block building and place block propagation in the critical path to consensus: the leader propagates its block during its turn. The key idea behind DAG-based protocols is to abstract the block proposal layer from the consensus layer and give it a life of its own. Consensus then becomes an overlay to “merely” order proposed blocks. Apart from streamlining the critical path, this decoupling further benefits throughput by enabling many agents to simultaneously propose blocks for ordering. The DAG (directed acyclic graph) vertices are blocks, while edges are causality relations: a block may refer to several blocks, each reference signifying chronological precedence.

In the PoS setting, Narwhal is a DAG-based PoS mempool protocol serving as a block proposal layer. Bullshark is an elegant partially synchronous PoS consensus overlay for Narwhal. Both are used in production by Sui.

In the PoW setting, DAG KNIGHT is a beautiful DAG-based generalization of Nakamoto consensus used in production by Kaspa. Among several lucid insights, the paper observes the essential independence of the block proposal layer from any upper-bound on network delay Δ. DAG KNIGHT makes full use of this property of the DAG to achieve optimistic responsiveness.

Wen DAG-based Starknet?

Summary

Having described some of our thoughts regarding the consensus protocol for Starknet, we encourage the community to participate in the discussion, offer thoughts and feedback, and help design the Starknet decentralized protocol.

In upcoming posts we will share concrete proposals for a decentralized protocol that addresses some of the open questions presented in the series of posts.

48 Likes

Thank you, clearly
How will a PoS-aware consensus protocol be achieved without compromising decentralization in decision making?

10 Likes

Hi there @ipblat! By “decision making” are you referring to participation in the consensus protocol or to governance?

11 Likes

Thank you for your interest. Two of these concepts. The point is that I am observing a picture in the Cosmos ecosystem. It’s really good, but there’s one BUT… From project to project validators are the same faces. This is on the verge of committing manipulations between projects (or should I say blockchains) within the ecosystem. To give you an example, a group of validators in different blockchains of the same ecosystem agree to transfer liquidity from one blockchain to another. I repeat, this is an example and you can think of many similar schemes. You can watch this mami and realize that this is close to the biggest evil within the same ecosystem.

8 Likes

Dear @ipblat, I am trying to understand whether your concerns pertain to the standard reservations about PoS or to “manipulations” that are specific to validity rollups.

I did not quite follow what you meant by the liquidity manipulation. Are you concerned that stakeholders of native tokens of several different blockchains may form a malicious cartel?

4 Likes

Gm, thanks a lot for this great article, I have a few questions/remarks,
1.

Do you think responsiveness is the ultimate bottleneck? Do you have any idea how much more optimistically responsive POS is efficient than trad POS?

Paradigm wrote a POC implementing Narwhal/BS in the Cosmos SDK, it could be a great starting point for a POC for Starknet, are you considering starting from an SDK or starting from scratch?

Finally, do you see anything that’s special to Starknet that could be in this path due to its special L2 architecture? (trying to figure out if there’s another bottleneck here that could reduce the performances of a DAG-based POS)

4 Likes

Dear @matteo, thanks for your questions!

  1. I don’t feel entirely comfortable with “ultimate bottleneck”. For account-based systems, concurrency is tricky and I suspect sequential execution is a considerable bottleneck. Frankly, I just don’t know. For comparative efficiency I think it only makes sense to compare two concrete protocols. Maybe you can find quantitative comparisons of Tendermint and HotStuff. :slight_smile:

  2. The SDK sounds interesting - thanks for bringing it to our attention!

  3. If I understand your question correctly then no - I don’t see L2-specific obstacles in the critical path. Interaction with L1 is meant to periodically anchor consensus, which seems outside the critical path to me. A potential “obstacle” may come in the future, when small proofs are frequently propagated off-chain to facilitate execution-free verification of state transitions. For example, if the sequencer must compute a proof of the execution of its block of transactions before proposing the block for consensus, the critical path is burdened. This can be resolved e.g using “delays” as described in the post.

3 Likes

Just to make sure, you don’t suggest having a faster finality than L1 right? 'Cause that would be impossible.

4 Likes