Starknet Decentralized Protocol III - Consensus

This is post III, which continues post I (introduction) and post II (leader elections) by diving into the open questions and trade-offs surrounding the consensus protocol for Starknet. The discussion is organized into four sections:

  1. Dual ledgers for safety and liveness
  2. Slashing equivocators
  3. Responsiveness
  4. Streamlining the critical path to consensus

We have tried to make the tone as informal as possible, citing technical references for the interested reader.

The next post will dive into thoughts and questions revolving around proofs in the protocol.

Dual ledgers for safety and liveness

In this section we outline the dual-ledger design for consensus protocols used by Ethereum and then discuss its applicability to Starknet.

Partial synchrony and CAP

Recall (from post I) the partially synchronous network model which provides a convenient middle ground between synchrony and asynchrony by handling network faults of unknown but bounded duration. For instance, this sensible model captures the idea of power outages of finite duration. Unfortunately it also raises a seemingly insurmountable obstacle known as the CAP theorem: during asynchrony, no protocol can satisfy both safety and liveness. Asynchrony can cause network partitions, which provide a simple illustration of the underlying idea. Suppose the network is entirely honest and partitioned into parts A,B. If the protocol is live then A,B progress independently, contradicting safety. If the protocol is safe then A,B cannot progress independently, contradicting liveness.

Dual ledgers: circumventing CAP with checkpoints

As remarked above, the CAP theorem precludes a single consensus protocol from being both safe and live under asynchrony. This obstruction can be circumvented by cleverly coupling a live protocol with a safe protocol. Write \Pi_\text{safe},\Pi_\text{live} for safe and live protocols respectively. The output of \Pi_\text{live} is called the live ledger. The idea is to create a feedback loop:

  1. First, the live ledger is fed as the input to \Pi_\text{safe} . The output of the composition is a checkpointed live ledger.
  2. Next, the checkpointed live ledger is fed back into \Pi_\text{live} . This ensures the checkpointed live ledger is always a prefix of the live ledger.
  3. Repeat

During asynchrony, the safe protocol \Pi_\text{safe} stalls. In this case, the safe prefix of the live ledger lags behind while the live ledger itself continues advancing. For more information, see e.g this paper and this paper.

Ethereum has been using such a dual-ledger design for its consensus protocol since the introduction of Casper – a finality gadget in the form of a safe consensus protocol (essentially chained Tendermint).

Checkpoints for Starknet

In this subsection we will examine checkpoints in the context of a dual-ledger consensus protocol for Starknet. Two things to keep in mind:

  • The consensus protocol will be purely PoS.
  • Stake will be managed by an L1 contract.

L1 checkpoints

Under the partially synchronous model, designing the L2 consensus protocol of a validity rollup inevitably faces the same CAP limitations. The dual ledger design outlined above delivers the best of both worlds. But what about L1?

By design, security of validity rollups eventually maxes out at L1 security (upon L1 state updates). To approach this level of confidence at small latency it is natural to consider L1 checkpoints, sent via Ethereum transaction. We see two main trade-offs involved:

  1. Cost vs confidence – interaction with L1 is expensive; on the other hand, L1 is safe and visible by assumption. Furthermore, L1 can enforce logic of all kinds, e.g slashing rules.
  2. Cost vs latency – low latency requires increased frequency, which translates to increased costs.

To mitigate the cost factor, L1 checkpoints can be restricted to commitments certified by a majority of stake. In this case they will not provide any data availability about the state of Starknet, but only evidence of L2 consensus (via quorum certificate). For latency, we are fine with several minutes.

Here are some more questions of interest regarding L1 checkpoints.

  1. What is verified upon submission of L1 checkpoints? This is a cost/confidence trade-off, since L1 verification of the quorum certificate consumes L1 gas.

  2. To what extent can there be conflicting checkpoints? At one extreme is a single checkpoint chain. At the other extreme lies a checkpoint tree. Many intermediate alternatives are possible.

  3. Checkpoint durability? Should newer and/or larger quorum certificates be allowed to override existing ones or are there rigid expiration dates?

Digression: checkpoints for fast finality

Starknet L1 state updates must follow the L2 consensus. Hence it is natural to couple proofs with checkpoints. Moreover, the planned latency of Starknet L1 finality is in the realm of several hours, and checkpoints can be used to design fast finality mechanisms. Our thoughts on such mechanisms are outlined separately in an upcoming post which defines some desirable properties and suggests ways of attaining them.

Slashing equivocators

We have previously decided the StarkNet consensus protocol will be leader-based with PoS Sybil resistance. This makes it vulnerable to the “nothing at stake” problem in the form of equivocation. Specifically, a leader can equivocate by producing conflicting blocks during its turn at negligible costs (compared to PoW). If a dual-ledger design is used, checkpoint equivocation is also possible.

Ethereum defends against equivocation with slashing rules. However, slashing instances must be part of the ledger in order to take effect, which means they can be censored by a minority of ⅓ of stake. By maintaining the slashing logic on L1, Starknet will be able to defend against equivocation even by a malicious majority of stake. Note Starknet’s L1 state updates act as unforkable checkpoints, restricting the scope of backward simulation attacks of suddenly presenting alternative histories even without any slashing.

Responsiveness

In our leader-based setting it would be nice for a leader to drive consensus at the pace of actual network delay δ as opposed to maximal network delay Δ, since this would decrease latency and increase throughput. Roughly speaking, responsiveness means that during synchrony, transaction confirmation time depends only on actual network delay δ and not any apriori upper-bound Δ. For example, the Bitcoin protocol is non-responsive because its block difficulty is an explicit function of a pre-specified upper-bound Δ. Consequently, transaction confirmation time suffers from the looseness in the estimate Δ.

Pass and Shi distill a formal definition of responsiveness and also distill an unfortunate trade-off between performance and security: responsive secure consensus protocols can only tolerate <â…“ corruptions, even during synchrony. The argument resembles the proof of CAP and we outline it for three nodes A,B,M with M malicious. Essentially, M partitions the network by presenting contradicting world views to A,B. Specifically:

  • M causes communication A↔B to have maximal latency Δ while maintaining communications A↔M and B↔M at a network delay δ≪Δ.
  • M behaves honestly in each of the communications A↔M and B↔M, but sends conflicting information to A,B.
  • Consequently A thinks it and M are online and B is offline, and therefore perceives communication A↔M as an honest execution of the protocol at â…“ failures. Similarly with A,B switched.
  • Assuming the protocol is responsive at â…“ failures, A,B perform conflicting updates to their ledger, breaking consistency.

In light of the above argument, the best we can hope for is optimistic responsiveness: transaction confirmation time depends only on the current maximum latency causable by malicious agents. Perhaps the first partially synchronous protocol which is both safe and optimistically responsive is a leader-based BFT protocol called HotStuff. As shown in the paper, the responsiveness property achieved in this leader-based context means honest leaders can drive the consensus forward at actual network delay δ. A refinement of HotStuff is used in production by Aptos. However, since HotStuff is not live during asynchrony, it cannot serve as \Pi_\text{live} in a dual-ledger design. We are not aware of any battle-tested PoS protocols that are optimistically responsive and live under asynchrony. Here’s a nice technical reference.

We suspect responsiveness properties will play an important part in achieving lower latency and higher throughput. Motivated by this suspicion we are thinking of an optimistically responsive protocol suitable to produce Starknet’s live ledger.

Streamlining the critical path to consensus

The critical path to consensus is the sequence of operations performed after the previous execution of the consensus protocol to complete the current execution. For example, the leader’s critical path may include block-building and computing a commitment to the state obtained by executing the block. Every “obstacle” in the critical path to consensus reduces throughput. What can be taken out? Are the benefits worthwhile for Starknet?

State commitments

Every Ethereum block contains a commitment to the state following its execution. Hence the calculation of state commitments is in the critical path to consensus: it must be completed before the block is propagated. An obvious way to streamline the critical path for Starknet is to exclude any state commitments from the block body. A less extreme approach involves delayed computation, e.g the body of block n+k must contain a commitment to the state following execution of block n. In this case, assuming some predictability, leader n+k may compute the commitment in advance or at least in parallel to building its own block. The price of the latter optimization is increased latency for non-executing nodes who wish to verify storage proofs with respect to the state at the tip of the chain, since they must now wait k blocks.

DAGs: decoupling consensus from block proposal

“Monolithic” leader-based BFT consensus protocols like Tendermint and HotStuff both place block building and place block propagation in the critical path to consensus: the leader propagates its block during its turn. The key idea behind DAG-based protocols is to abstract the block proposal layer from the consensus layer and give it a life of its own. Consensus then becomes an overlay to “merely” order proposed blocks. Apart from streamlining the critical path, this decoupling further benefits throughput by enabling many agents to simultaneously propose blocks for ordering. The DAG (directed acyclic graph) vertices are blocks, while edges are causality relations: a block may refer to several blocks, each reference signifying chronological precedence.

In the PoS setting, Narwhal is a DAG-based PoS mempool protocol serving as a block proposal layer. Bullshark is an elegant partially synchronous PoS consensus overlay for Narwhal. Both are used in production by Sui.

In the PoW setting, DAG KNIGHT is a beautiful DAG-based generalization of Nakamoto consensus used in production by Kaspa. Among several lucid insights, the paper observes the essential independence of the block proposal layer from any upper-bound on network delay Δ. DAG KNIGHT makes full use of this property of the DAG to achieve optimistic responsiveness.

Wen DAG-based Starknet?

Summary

Having described some of our thoughts regarding the consensus protocol for Starknet, we encourage the community to participate in the discussion, offer thoughts and feedback, and help design the Starknet decentralized protocol.

In upcoming posts we will share concrete proposals for a decentralized protocol that addresses some of the open questions presented in the series of posts.

47 Likes

Thank you, clearly
How will a PoS-aware consensus protocol be achieved without compromising decentralization in decision making?

10 Likes

Hi there @ipblat! By “decision making” are you referring to participation in the consensus protocol or to governance?

11 Likes

Thank you for your interest. Two of these concepts. The point is that I am observing a picture in the Cosmos ecosystem. It’s really good, but there’s one BUT… From project to project validators are the same faces. This is on the verge of committing manipulations between projects (or should I say blockchains) within the ecosystem. To give you an example, a group of validators in different blockchains of the same ecosystem agree to transfer liquidity from one blockchain to another. I repeat, this is an example and you can think of many similar schemes. You can watch this mami and realize that this is close to the biggest evil within the same ecosystem.

8 Likes

Dear @ipblat, I am trying to understand whether your concerns pertain to the standard reservations about PoS or to “manipulations” that are specific to validity rollups.

I did not quite follow what you meant by the liquidity manipulation. Are you concerned that stakeholders of native tokens of several different blockchains may form a malicious cartel?

4 Likes

Gm, thanks a lot for this great article, I have a few questions/remarks,
1.

Do you think responsiveness is the ultimate bottleneck? Do you have any idea how much more optimistically responsive POS is efficient than trad POS?

Paradigm wrote a POC implementing Narwhal/BS in the Cosmos SDK, it could be a great starting point for a POC for Starknet, are you considering starting from an SDK or starting from scratch?

Finally, do you see anything that’s special to Starknet that could be in this path due to its special L2 architecture? (trying to figure out if there’s another bottleneck here that could reduce the performances of a DAG-based POS)

4 Likes

Dear @matteo, thanks for your questions!

  1. I don’t feel entirely comfortable with “ultimate bottleneck”. For account-based systems, concurrency is tricky and I suspect sequential execution is a considerable bottleneck. Frankly, I just don’t know. For comparative efficiency I think it only makes sense to compare two concrete protocols. Maybe you can find quantitative comparisons of Tendermint and HotStuff. :slight_smile:

  2. The SDK sounds interesting - thanks for bringing it to our attention!

  3. If I understand your question correctly then no - I don’t see L2-specific obstacles in the critical path. Interaction with L1 is meant to periodically anchor consensus, which seems outside the critical path to me. A potential “obstacle” may come in the future, when small proofs are frequently propagated off-chain to facilitate execution-free verification of state transitions. For example, if the sequencer must compute a proof of the execution of its block of transactions before proposing the block for consensus, the critical path is burdened. This can be resolved e.g using “delays” as described in the post.

3 Likes

Just to make sure, you don’t suggest having a faster finality than L1 right? 'Cause that would be impossible.

4 Likes

We suspect responsiveness properties will play an important part in achieving lower latency and higher throughput.

Note that under practical conditions, responsiveness is not relevant for performance. From a practical perspective, responsiveness is relevant for a protocol’s recovery from a period of asynchrony (different from common-case performance). Responsiveness is also a very useful theoretical instrument to help categorize consensus algorithms.

In practice, for a responsive protocol to make a difference, a very precise set of conditions have to be met. This is a rare corner case. I can try to give an intuition by crafting a scenario below: Given the implementation of a non-responsive protocol, say Tendermint [Later edit: I was wrong, Tendermint is responsive. Will leave the rest of the discussion as originally posted nevertheless, as it might be interesting.], we need to experience the following during a run of this protocol:

  1. The run has to undergo asynchrony. Specifically, asynchrony has to be severe enough to cause the validators to need multiple rounds to reach a decision on a block. Suppose we start from round = 0. Under synchrony, both responsive and non-responsive protocols will behave similarly and will finish deciding in the first round = 0. So it is important we start from asynchrony.

  2. Asynchrony has to very targeted: It needs to affect a future proposer (i.e., the validator who would become the proposer in a later round). Let’s call this validator by name Porky. It is important that Porky misses what the other validators are doing at least during the first round of this run (i.e., round = 0).

  3. It is important that a subset of validators are able to communicate and make some progress. Specifically, they need to lock on a value during round = 0. This subset is not affected by asynchrony. This subset does not include Porky. This subset must include the proposer for the current round = 0.

  4. Suppose round = 0 of the run is unsuccessful, because a quorum of 2f+1 precommits is not reached. A subset of validators are locked on the value proposed in this round. Let’s name this value X.

  5. Now it is important that the validator Porky is the proposer for round = 1. Recall that this validator is not among those who locked on X. Even more important it is that Porky does not see the prevotes for X (line 36 of the algorithm) before it tries to send its own proposal. Since it does not see the prevotes, Porky will be unable to propose value X, instead it will propose some other value Y.

  6. It is important that the GST time kicks in and validators can communicate smoothly, including Porky. The subset of validators who were not affected by asynchrony and who locked on X will refuse to prevote for Y. Instead, these validators will prevote nil because they expected to see X instead of Y. This means that round = 1 will not be fruitful and will not lead to a decision. The protocol needs another round = 2 to reach a decision.

The bottom line is: Under the conditions we crafted here, if Tendermint protocol were responsive, then round = 1 would have been successful. Outside of these exact kind of situations as described above, responsiveness is not relevant for latency or throughput.

If any of the steps above is not fulfilled, we will not recreate a scenarion where responsiveness matters. For example:

  • Without GST at (6), responsiveness is moot. A responsive protocol would not terminate either in round = 1 and would need a subsequent tries to reach a decision.
  • If GST kick in at any moment before (6) starts, the run will terminate in round = 1 because Porky will be able to see the prevotes from the prior round and propose value X.
  • If Porky is part of the subset of nodes that can communicate during round = 0, it will see prevotes for X and will be able to drive this value (it will not propose Y, but X) to a decision during round = 1.
  • If asynchrony does not affect a large enough (f+1) subset of validators, then the protocol will progress during round = 0 and reach a decision.
  • If the set of validators who lock on value X during round = 0 is too small, then these validators do not matter in round = 1. In this case, Porky can make progress as a proposer, i.e., drive value Y to a decision during round = 1.
  • If Porky is not the proposer for round = 1, and instead the proposer is a validator that has seen value X, then this proposer will drive X to a decision during round = 1.
  • Regarding (1), generally production networks rarely require go beyond the first round. For a quick intuition & data on this, I fetched the last ~20’000 blocks from the Osmosis network. This network is currently the most active/prone to congestion, and it builds on CometBFT (an implementation of Tendermint consensus algorithm). As the dataset shows, a few tens of blocks went up to round 1, and an outlier block committed in round 3, while the vast majority of blocks were committed in round 0. This is incomplete data, so it would be very valuable to extend the dataset with the whole history and study that, but it’s a start.

hey,

since sequencers have less power than ethereum validators - would love to understand why pos has been prioritized instead of electing new sequencers through governance process (“proof of governance”).

not sure layer 2s have to be as credibly neutral as ethereum, pos will probably result in lsts and so Starknet will lose a) governance, given that governance will be up to the governance token of the liquid staking protocol and b) revenue, given that the liquid staking provider will take 5% of the rewards (fees + mev) and pass them on to the operators + another 5%, which will be kept for their dao.

there are tradeoffs everywhere but my take is that Ethereum is building for rollups and rollups are building for users. that’s why having the operator election (and all that goes with it) enshrined in the governance of the protocol makes sense except if you don’t want to bother with all the governance process.

maybe i’m too late but still curious about your thoughts on a) and b)!

Hey Sam, you’re never too late, and for the future, I’d advise you to tag, or answer a specific post if you want to have quicker answers

I think the main point of divergence we’d have is the fact that proof of governance is enough for Starknet. I don’t consider that the promise of inheriting Ethereum’s properties through rollups is complete if we’re not providing censorship resistance and neutrality when it comes to transaction ordering. To my mind, those properties are not sufficiently guaranteed with PoG.

Nevertheless, I think the two points are valid, and that’s two things we can think about now to mitigate them. The main ideological difference I’d have with you is that yes, rollups are for users, but the main properties of blockchains that are sought by users are of course the global hard ledger, but also censorship resistance, transparency, and fairness in the tx ordering. I have no problem if you deploy an app chain using one sequencer, or PoG, it’s your choice but Starknet is intended to be a public chain, with many dapps, and thus I think we should really care about those properties.

I found this post to be very informative and insightful and I also enjoyed the section on the dual-ledger design, which is a clever way to circumvent the CAP theorem.

very interested in the work that Starknet is doing on an optimistically responsive consensus protocol. This sounds like a very promising approach to improving latency and throughput which i believe can become a bottleneck in the future.

one question please, what are the challenges of implementing an optimistically responsive consensus protocol?

Hey Matteo, oh yea we seem to be in the decentralization part of the protocol so thought it was the right place - thanks for your reply.

The way I see it, PoG could be envisioned as having, say, between 30 and 60 operators geographically distributed (or a similar number*). In addition, there would be a forced inclusion mechanism. In this context, incentives to censor are not particularly relevant and even if it happens, it can be solved imo.

Sure you wouldnt have slashing but as I mentioned, sequencers have much less power than Ethereum validators, e.g when you think about reorgs, sequencers can only reorg prior to the data/root being posted to Ethereum (much shorter timeframe) - once this is posted, for Starknet to reorg, Ethereum would have to reorg.

imo CR and MEV are not necessarily worse with PoG, you could keep the mempool private and enforce certain type of tx ordering that can not be enforced on the client side - e.g this could help to avoid front-running/sandwich attacks.
(Note that mid/long term I believe that sequencers will end up outsourcing block building (and so ordering) to external/more sophisticated actors as it is the case on Ethereum L1 today.)

That being said, with PoG, you could totally imagine that governance could/would remove censoring/frontrunning sequencers from the active set - and instead of slashing, negative incentives would be forfeiting all future revenue.

If you end up in a situation with thousands of sequencers, you might also lose that fast preconfirmation “feature” that users love and Starknet can provide imho.

At the end of the day, what you get with PoS is the rise of a dominant LST protocol and what is a LST protocol? A governance mechanism to pick up operators.

PoG is kinda the same thing but in this case, it’s Starknet’s governance that makes the decisions, not an external entity. Starknet governance would still have the authority to add as many sequencers* (and all that goes with it) as are accepted by the protocol gov.