TLDR
We propose an alternative recovery mechanism from unprovable blocks which can appear due to bugs in different components of the protocol. We introduce the notion of SOS blocks. SOS blocks will contain zk-proofs (zkp, for short) for showing that a bug was encountered. SOS blocks can be used in-protocol to trigger reverts at consensus level.
Motivation
The Decentralized Starknet protocol has a very elegant solution for coupling consensus and proving. In addition to transactions, blocks need to contain a zkp for a block that was previously accepted on the L2 chain in order to be considered valid during consensus. However, the protocol doesn’t guarantee liveness.
The Decentralized Starknet protocol uses the battle-tested Tendermint protocol as an underlying consensus protocol, but it currently breaks the consistency assumption of Tendermint (i.e., an honest validator can always make a valid proposal). Due to bugs in several components involved in the architecture of the protocol (e.g., execution clients, prover), a block which seems valid during consensus can be accepted on the L2 chain, but no zkp can be build for it. We call these blocks unprovable blocks.
Current approaches for recovering from unprovable blocks
Currently, the Decentralized Starknet protocol can recover from unprovable blocks by L2 social consensus or the L1 reset mechanism used for handling stake updates. Let us explain how the L1 reset mechanism can be engaged for recovering from unprovable blocks.
Empty blocks are blocks containing no transactions and no proofs. Let us assume k is the number of proofchains (designed to give validators enough time to produce a zkp for their proposal) and P is the limit on consecutive empty blocks in a proofchain. Let’s consider the following scenario:
- The block at height H (belonging to a proofchain PC) is unprovable due to a bug in an execution client
- Proposers for blocks at heights H + k*i, with 1 \leq i \leq P, will not be able to produce a zkp for the block at height H and therefore will propose only empty blocks
- The limit on empty blocks on proofchain PC is reached and from this point on, validators can propose only empty blocks which will not be considered valid anymore during consensus (as the limit on empty blocks is reached), and the Tendermind protocol will just continue to advance to the next round of the height H + k*P until L2 social consensus or the L1 reset mechanism (for handling stake updates) are engaged.
Note that engaging the L1 reset mechanism for stake updates currently takes more than 24h. This can potentially impact liveness as we assume there will come a point where the network is post-GST and timeouts will increase. Therefore, we are proposing a faster in-protocol recovery mechanism for such cases which can be triggered on the L2 chain.
SOS blocks
For dealing with potential bugs in different components involved in the protocol (e.g, execution clients, prover), we introduce a new type of block, SOS block. SOS blocks will convince validators (i.e., by containing adequate zkps) to revert during consensus as a bug was encountered.
Assume B' is an SOS block proving to the validators that they should revert because a block is unprovable or because a bug appeared in the Prover while trying to build a proof. Consensus can be performed normally on B' and block B' will contain zkps that can be checked for validity during consensus. After committing to the SOS block B' (which is expected to prove block B), all nodes will revert prior to block B. All blocks between B and B' should be dropped as there cannot be a proper state transition.
In the next section we describe how SOS blocks and adequate zkps can be constructed.
Building SOS blocks
Assume block B contains some transactions txs. We are trying to build a block B' expected to prove block B. We propose the following mechanism depicted in the figure below.
The main idea is to run CairoRun
in a zkVM to produce a zkp connecting the trace outputed by CairoRun
with the initial transactions, or have a zkp showing that CairoRun
could not produce a trace for these transactions (i.e., it produced an error). Further, we engage a TraceChecker
for checking the validity of the trace created by CairoRun
. The TraceChecker
produces a zkp as well. We then compare the result of the TraceChecker
with the results from the Prover
.
The flow of the mechanism and other details are described below.
Step 1. Run CairoRun
on txs in a zkVM
CairoRun
can produce two outputs: a trace tr or anerror
- By running
CairoRun
in a zkVM, we will get a zkp for either cases:
a. IfCairoRun
outputs a trace tr, we obtain a zkp proof attesting that tr is obtained by executing the transactions txs. In this case, we continue with Step 2.
b. IfCairoRun
outputs anerror
, we obtain a zkp proof_1' attesting this fact. This means there was a bug in CairoRun or in the Execution Client(s). In this case, B' will be an SOS-block containing the zkp proof_1' (and maybe also txs).
Remarks:
- Running
CairoRun
in a zkVM should not use much memory (e.g., RISC-Zero offers a max of 4GB of RAM). - Running
CairoRun
in a zkVM is more feasible than running the entireProver
in a zkVM due to memory constraints.
Step 2. Run TraceChecker
on tr
- The
TraceChecker
is a small program in a provable language (e.g., Cairo or anything that compiles to RISC-V) which will produce a zkp showing that the trace tr satisfies the VM constraints (i.e., tr is a valid trace) or a zkp showing that the trace is invalid (i.e., it provides the first trace index where the VM constraints are not satisfied). - The
TraceChecker
will apply repeatedly theOneStepChecker
on all transitions in tr. TheOneStepChecker
is checking if a one-step state transition satisfies the VM constraints. - The
TraceChecker
outputs:
a. A zkp proof_2 if the trace tr is a valid trace satisfying the VM constraints.
b. A zkp proof_2' if the trace tr is an invalid trace (e.g., provides the first index in the trace tr where a VM constraint is not satisfied).
Remarks:
- Step 2 can be done in parallel with Step 3 below.
- The
TraceChecker
should be able to run in O(|tr|). - The
TraceChecker
can potentially be written in the Lean Theorem Prover using the formalization of the Cairo VM constraints that were already specified. - The
TraceChecker
can be formally verified against Prof. Jeremy Avigad’s formalization of the underlying CairoVM polynomial constraints.
Step 3. Run the Prover
on tr
- The
Prover
can produce a zkp proof_3 (which proves the state transition according to the trace tr) orNothing
.
Remarks:
- Step 3 can be done in parallel with Step 2 above.
- Formally verifying the Prover seems a bit infeasible.
- However, we can formally verify the Verifier in the Lean Theorem Prover.
Step 4. Compare the results from the Prover
and the TraceChecker
We have the following four scenarios:
I. Prover
produced the zkp proof_3
I.a. TraceChecker
produced the zkp proof_2
- All good, sunny day
- B' will be a non-empty block containing the zkp proof_3 produced by the
Prover
(and new transactions)
I.b. TraceChecker
produced the zkp proof_2' (i.e., invalid trace)
- This scenario means that there were bugs in the
Prover
&CairoRun
(i.e., we have an invalid trace produced byCairoRun
and theProver
managed to prove it). - B' will be an SOS-block containing the following proofs (and maybe also txs):
- proof, attesting that the trace tr is indeed a trace obtained by executing the transactions txs, and
- proof_2', attesting that the trace tr is invalid.
II. Prover
produced Nothing
II.a. TraceChecker
produced the zkp proof_2 (i.e., valid trace but the Prover
cannot prove it)
- This scenario means that there was a bug in the
Prover
. - B' will be an SOS-block containing the following proofs (and maybe also txs):
- proof, attesting that the trace tr is indeed a trace obtained by executing the transactions txs, and
- proof_2, attesting that the trace tr is valid.
- It must be decided what to do in this case (e.g., reuse the block, revert, use a different Prover)
II.b. TraceChecker
produced the zkp proof_2' (i.e., invalid trace)
- This scenario means that there was a bug in
CairoRun
. - B' will be an SOS-block containing the following proofs (and maybe also txs):
- proof, attesting that the trace tr is indeed a trace obtained by executing the transactions txs, and
- proof_2', attesting that the trace tr is invalid.
Depending on the computational costs of running CairoRun in a zkVM, it can be decided when to engage this recovery mechanism. Ideally, this mechanism should be engaged any time we try to obtain a zkp needed for block B'. Alternatively, SOS blocks can be build when the limit on consecutive blocks in a proofchain is reached. In this scenario, several blocks will need to be dropped after the revert which could potentially be prevented.
A more fine grained architecture
Depending if one wants to reason about the program given as an input to the Prover
, we can extend the above arhitecture by running the compiler Sierra → CASM
in a zkVM (or write the compiler in a provable language). The flow remains mostly the same, we just have one additional case of building an SOS block. The mechanism is described in the figure below.
Summary
With this post, we wanted to propose a faster in-protocol recovery mechanism from potential bugs in the components involved in the Decentralized Starknet protocol. We look forward to hear your feedback on these ideas!