Starknet downtime post-mortem: November 15, 2023

ilia · November 23, 2023, 4:44pm

What happened?

On Nov 15, 2023 10:00 AM GMT, Starknet faced some downtime. Here’s what happened:

40 minutes without blocks
First post-stall block with 10 txs
10 minutes without blocks
Second post-stall block with 10 txs
Normal operation

The goal of this document is to give some background, explain the timeline above, and illustrate how upcoming versions will improve the system’s robustness and efficiency.

Background

In the current Starknet architecture, a service called a gateway (henceforth GW) is responsible for receiving transactions, doing some pre-processing (before execution), and passing it down the line to the sequencer. There are multiple GW instances always active on mainnet.

Every transaction that reaches the GW opens an add_tx thread that takes some CPU. Roughly speaking, this thread performs the following operations:

Checks the user has the declared fee amount
Runs validate
Writes tx to DB
Returns the received/rejected status to users

To avoid passing duplicates of the same transaction to the sequencer, there’s a handler called the deduplicator. By design, there is only one instance of this handler at a given point in time, and it sits in the path to block creation. If this process stalls, so does block creation.

The deduplicator works with packets of 10 transactions at a time.
The deduplicator process involves many asynchronous tasks; it uses ≥40 threads that access storage (and therefore require context switching).

Lastly, the GW logic is implemented in Python, which has a sequential nature: threads are processed in sequence even if they’re independent. Consequently, for a packet of 10 txs to reach the sequencer, the time we must wait for the deduplicator is the sum of times for each of its ≥40 long threads to resolve.

Timeline explanation?

40 minutes without blocks – a big influx of transactions initiated many threads which spread the GW CPU very thinly across all of them. The deduplication process took roughly 40 minutes to finish its packet of 10 txs during the high load while it needed more CPU.
First post-stall block with 10 txs – the second deduplicator task took longer than the block timeout, so the sequencer closed a 10 tx block.
10 minutes without blocks – the high load lessened, freeing up some CPU, but not enough to immediately resolve the issue as the second deduplicator task still moved slowly.
Second post-stall block with 10 txs – the third deduplicator task took longer than the batch timeout, resulting in another small block closed.
Normal operation – additional GWs were spun up, and then the existing ones were reset, causing the existing load to be spread over more CPU. The restarted add_tx threads and the fresh deduplicator process now had enough CPU to run faster, resuming normal operation.

Note that transactions whose add_tx thread resolved within roughly 15 seconds returned received or rejected status and joined the backlog. On the other hand, transactions with long add_tx threads that did not resolve in time returned internal server errors and did not reach the backlog. Generally speaking, these were transactions with heavy validate functions.

What next?

Upcoming Starknet versions feature significant improvements to GW performance.

v0.12.2

More GW instances were spun up to make more CPU available.
Improvements to load balancing when GW instances are congested.

v0.12.3 (already past integration and currently on testnet)

The GW has improved logic, making more efficient use of CPU and performing roughly 4x as well.

v0.13.0 (currently on integration)

The GW will use a Rust implementation of the Cairo VM to run validate, making CPU usage still more efficient.
The deduplicator will be moved to a separate service with dedicated CPU instead of living in one of the GWs.

LauriP · November 27, 2023, 1:48pm

Thanks for the post-mortem! Being public about these things is essential.

Will the GW be fully deprecated at some point once decentralization progresses? Or will we simply have more GW instances, run by different entities?

ilia · November 28, 2023, 9:13am

Hi @LauriP!

In the decentralized world, each node sequencer will choose its own internal architecture. Currently, the sequencer uses GWs as a convenient filter for rejecting transactions that either don’t have enough fees or don’t pass the respective account’s validate. This seems to me like a natural thing to do, but who knows what others will come up with.

Does that answer the question(s)?

LauriP · November 28, 2023, 3:18pm

Ah right. So everyone just does as they see fit. Makes sense.
As long as they take in txs, do their job and forward tx bundles to a prover, all’s good.

Topic		Replies	Views
Efficient Utilization of Sequencer Capacity in Starknet v0.12.1 📜 Development Proposals	11	3464	July 13, 2023
Possible ways to make starknet transaction to be faster while processing multiple transactions 🙏 Help and Support consensus , decentralization	1	1074	April 12, 2023
StarkNet Monthly Update- April, May Starknet Technical Development	1	1499	June 10, 2022
Starknet Decentralized Protocol III - Consensus Starknet Technical Development consensus , decentralization , sequencers	12	5516	December 4, 2023
Starknet Decentralization Day Summary Starknet Technical Development consensus , decentralization , sequencers , provers	0	2090	February 17, 2023

Starknet downtime post-mortem: November 15, 2023

What happened?

Background

Timeline explanation?

What next?

Related topics