SOS blocks in the Decentralized Starknet Protocol

TLDR

We propose an alternative recovery mechanism from unprovable blocks which can appear due to bugs in different components of the protocol. We introduce the notion of SOS blocks. SOS blocks will contain zk-proofs (zkp, for short) for showing that a bug was encountered. SOS blocks can be used in-protocol to trigger reverts at consensus level.

Motivation

The Decentralized Starknet protocol has a very elegant solution for coupling consensus and proving. In addition to transactions, blocks need to contain a zkp for a block that was previously accepted on the L2 chain in order to be considered valid during consensus. However, the protocol doesn’t guarantee liveness.

The Decentralized Starknet protocol uses the battle-tested Tendermint protocol as an underlying consensus protocol, but it currently breaks the consistency assumption of Tendermint (i.e., an honest validator can always make a valid proposal). Due to bugs in several components involved in the architecture of the protocol (e.g., execution clients, prover), a block which seems valid during consensus can be accepted on the L2 chain, but no zkp can be build for it. We call these blocks unprovable blocks.

Current approaches for recovering from unprovable blocks

Currently, the Decentralized Starknet protocol can recover from unprovable blocks by L2 social consensus or the L1 reset mechanism used for handling stake updates. Let us explain how the L1 reset mechanism can be engaged for recovering from unprovable blocks.

Empty blocks are blocks containing no transactions and no proofs. Let us assume k is the number of proofchains (designed to give validators enough time to produce a zkp for their proposal) and P is the limit on consecutive empty blocks in a proofchain. Let’s consider the following scenario:

  • The block at height H (belonging to a proofchain PC) is unprovable due to a bug in an execution client
  • Proposers for blocks at heights H + k*i, with 1 \leq i \leq P, will not be able to produce a zkp for the block at height H and therefore will propose only empty blocks
  • The limit on empty blocks on proofchain PC is reached and from this point on, validators can propose only empty blocks which will not be considered valid anymore during consensus (as the limit on empty blocks is reached), and the Tendermind protocol will just continue to advance to the next round of the height H + k*P until L2 social consensus or the L1 reset mechanism (for handling stake updates) are engaged.

Note that engaging the L1 reset mechanism for stake updates currently takes more than 24h. This can potentially impact liveness as we assume there will come a point where the network is post-GST and timeouts will increase. Therefore, we are proposing a faster in-protocol recovery mechanism for such cases which can be triggered on the L2 chain.

SOS blocks

For dealing with potential bugs in different components involved in the protocol (e.g, execution clients, prover), we introduce a new type of block, SOS block. SOS blocks will convince validators (i.e., by containing adequate zkps) to revert during consensus as a bug was encountered.

Assume B' is an SOS block proving to the validators that they should revert because a block is unprovable or because a bug appeared in the Prover while trying to build a proof. Consensus can be performed normally on B' and block B' will contain zkps that can be checked for validity during consensus. After committing to the SOS block B' (which is expected to prove block B), all nodes will revert prior to block B. All blocks between B and B' should be dropped as there cannot be a proper state transition.

In the next section we describe how SOS blocks and adequate zkps can be constructed.

Building SOS blocks

Assume block B contains some transactions txs. We are trying to build a block B' expected to prove block B. We propose the following mechanism depicted in the figure below.

The main idea is to run CairoRun in a zkVM to produce a zkp connecting the trace outputed by CairoRun with the initial transactions, or have a zkp showing that CairoRun could not produce a trace for these transactions (i.e., it produced an error). Further, we engage a TraceChecker for checking the validity of the trace created by CairoRun. The TraceChecker produces a zkp as well. We then compare the result of the TraceChecker with the results from the Prover.

The flow of the mechanism and other details are described below.

Step 1. Run CairoRun on txs in a zkVM

  • CairoRun can produce two outputs: a trace tr or an error
  • By running CairoRun in a zkVM, we will get a zkp for either cases:
    a. If CairoRun outputs a trace tr, we obtain a zkp proof attesting that tr is obtained by executing the transactions txs. In this case, we continue with Step 2.
    b. If CairoRun outputs an error, we obtain a zkp proof_1' attesting this fact. This means there was a bug in CairoRun or in the Execution Client(s). In this case, B' will be an SOS-block containing the zkp proof_1' (and maybe also txs).

Remarks:

  • Running CairoRun in a zkVM should not use much memory (e.g., RISC-Zero offers a max of 4GB of RAM).
  • Running CairoRun in a zkVM is more feasible than running the entire Prover in a zkVM due to memory constraints.

Step 2. Run TraceChecker on tr

  • The TraceChecker is a small program in a provable language (e.g., Cairo or anything that compiles to RISC-V) which will produce a zkp showing that the trace tr satisfies the VM constraints (i.e., tr is a valid trace) or a zkp showing that the trace is invalid (i.e., it provides the first trace index where the VM constraints are not satisfied).
  • The TraceChecker will apply repeatedly the OneStepChecker on all transitions in tr. The OneStepChecker is checking if a one-step state transition satisfies the VM constraints.
  • The TraceChecker outputs:
    a. A zkp proof_2 if the trace tr is a valid trace satisfying the VM constraints.
    b. A zkp proof_2' if the trace tr is an invalid trace (e.g., provides the first index in the trace tr where a VM constraint is not satisfied).

Remarks:

  • Step 2 can be done in parallel with Step 3 below.
  • The TraceChecker should be able to run in O(|tr|).
  • The TraceChecker can potentially be written in the Lean Theorem Prover using the formalization of the Cairo VM constraints that were already specified.
  • The TraceChecker can be formally verified against Prof. Jeremy Avigad’s formalization of the underlying CairoVM polynomial constraints.

Step 3. Run the Prover on tr

  • The Prover can produce a zkp proof_3 (which proves the state transition according to the trace tr) or Nothing.

Remarks:

  • Step 3 can be done in parallel with Step 2 above.
  • Formally verifying the Prover seems a bit infeasible.
  • However, we can formally verify the Verifier in the Lean Theorem Prover.

Step 4. Compare the results from the Prover and the TraceChecker

We have the following four scenarios:

I. Prover produced the zkp proof_3

I.a. TraceChecker produced the zkp proof_2

  • All good, sunny day
  • B' will be a non-empty block containing the zkp proof_3 produced by the Prover (and new transactions)

I.b. TraceChecker produced the zkp proof_2' (i.e., invalid trace)

  • This scenario means that there were bugs in the Prover & CairoRun (i.e., we have an invalid trace produced by CairoRun and the Prover managed to prove it).
  • B' will be an SOS-block containing the following proofs (and maybe also txs):
    • proof, attesting that the trace tr is indeed a trace obtained by executing the transactions txs, and
    • proof_2', attesting that the trace tr is invalid.

II. Prover produced Nothing

II.a. TraceChecker produced the zkp proof_2 (i.e., valid trace but the Prover cannot prove it)

  • This scenario means that there was a bug in the Prover.
  • B' will be an SOS-block containing the following proofs (and maybe also txs):
    • proof, attesting that the trace tr is indeed a trace obtained by executing the transactions txs, and
    • proof_2, attesting that the trace tr is valid.
  • It must be decided what to do in this case (e.g., reuse the block, revert, use a different Prover)

II.b. TraceChecker produced the zkp proof_2' (i.e., invalid trace)

  • This scenario means that there was a bug in CairoRun.
  • B' will be an SOS-block containing the following proofs (and maybe also txs):
    • proof, attesting that the trace tr is indeed a trace obtained by executing the transactions txs, and
    • proof_2', attesting that the trace tr is invalid.

Depending on the computational costs of running CairoRun in a zkVM, it can be decided when to engage this recovery mechanism. Ideally, this mechanism should be engaged any time we try to obtain a zkp needed for block B'. Alternatively, SOS blocks can be build when the limit on consecutive blocks in a proofchain is reached. In this scenario, several blocks will need to be dropped after the revert which could potentially be prevented.

A more fine grained architecture

Depending if one wants to reason about the program given as an input to the Prover, we can extend the above arhitecture by running the compiler Sierra → CASM in a zkVM (or write the compiler in a provable language). The flow remains mostly the same, we just have one additional case of building an SOS block. The mechanism is described in the figure below.

Summary

With this post, we wanted to propose a faster in-protocol recovery mechanism from potential bugs in the components involved in the Decentralized Starknet protocol. We look forward to hear your feedback on these ideas!

That’s super interesting. I can’t comment on the proof complexity related questions, but I have some comments on what to do in the case an SOS block is accepted:

  1. The notion of “revert” is not clear to me. I guess the idea is to dismiss all the blocks from the the unprovable one, pick a new forkID, and build a new chain on top of the last OK block? This also means that the L1 contract must have a way to be aware of the new forkID. Until now, the forkID was related to the last stale registration. (Note that we cannot re-use the old forkID, as this would make validators to sign conflicting votes, which would be indistinguishable from equivocation)
  2. Similarly, I guess one needs to prove to L1 that L2 has forked because an SOS block got accepted. I am not sure how this would be done.
  3. Perhaps this indicates that “revert” should be something different. Perhaps we keep on building blocks for the same forkID (and keep all blocks), but roll-back the application state to before the unprovable block, and exempt the blocks in between somehow from the proofs submitted to L1. But then this would make the proof strands / proof scheduling more complex.

So this is a bit unclear to me. What did you have in mind for reverting?

Hi Josef! Thanks a lot for the message. You hit exactly the soft spot of the design :slight_smile:

I think using the ideas from your points 1 and 2 and using a similar solution as in the case of stale registration is a viable option. Point 2 is indeed a bit more delicate.

Point 3 is more interesting and I think it works. Below I share my thoughts on it and hopefully it is a less invasive solution to the entire architecture of the protocol.

Suppose we have K = 3 proofchains. Let us consider the scenario depicted in the figure below. Green blocks are blocks containing txs, the yellow block 5 is an unprovable block due to a bug, while the blue block 8 is an SOS block proving that block 5 is unprovable.

Proofs contained in blocks are depicted by the colored arrows. Note that Blocks 1, 2, and 3 don’t contain proofs. The letters in each block represent the state before executing the txs in that block. For example, Block 4 proves the state transition s1 → s2, while Block 7 proves the state transitions s4 → s5 (directly) and s1 → s2 (indirectly).

Now since block 5 is unprovable, it means the state transitions s5 → s6, s6 → s7, and s7 → s8 are broken. When seeing the SOS block 8, the application state must roll-back to the state before the unprovable block (in our example, to state s5). Now we can continue building blocks. In the above scenario, the only lost transactions are those contained in blocks 5, 6, and 7. The SOS block should contain the zkp from the unprovable block, and the next K - 1 blocks should piggyback the zkps contained in the blocks between the unprovable block and themselves. New transactions can be added in this process. I exemplify this below.

The SOS block 8 will contain the following zkps:

  • the zkps \Pi proving that block 5 is unprovable, obtained from the above described mechanism
  • the zkp \pi_1 contained in block 5, proving the state transition s2 → s3
  • possible new txs1

Block 9 will contain:

  • the zkp \pi_2 contained in block 6, proving the state transition s3 → s4
  • possible new txs2

Block 10 will contain:

  • the zkp \pi_3 contained in block 7, proving the state transitions s4 → s5 (directly) and s1 → s2 (indirectly)
  • possible new txs3

Block 11 will contain:

  • the zkps proving the execution of txs1 from the SOS block 8 (hence proving the state transition s5 → s6') and the verification of the zkp \pi_1 (contained in block 8)
  • possible new txs4

Computing the L1 state update is not affected. An L1 state update is obtained by merging the proofs from the last non-empty blocks in every proofchain that have been already proved. So if we merge the proofs in blocks 10, 11, and 12, we have a zkp proving the state transition s1 → s7' .

This is a scenario without empty blocks, but I think for the case with empty block no big obstacles should be encountered (famous last words).