What happened?
On Nov 15, 2023 10:00 AM GMT, Starknet faced some downtime. Here’s what happened:
- 40 minutes without blocks
- First post-stall block with 10 txs
- 10 minutes without blocks
- Second post-stall block with 10 txs
- Normal operation
The goal of this document is to give some background, explain the timeline above, and illustrate how upcoming versions will improve the system’s robustness and efficiency.
Background
In the current Starknet architecture, a service called a gateway (henceforth GW) is responsible for receiving transactions, doing some pre-processing (before execution), and passing it down the line to the sequencer. There are multiple GW instances always active on mainnet.
Every transaction that reaches the GW opens an add_tx
thread that takes some CPU. Roughly speaking, this thread performs the following operations:
- Checks the user has the declared fee amount
- Runs
validate
- Writes tx to DB
- Returns the received/rejected status to users
To avoid passing duplicates of the same transaction to the sequencer, there’s a handler called the deduplicator. By design, there is only one instance of this handler at a given point in time, and it sits in the path to block creation. If this process stalls, so does block creation.
- The deduplicator works with packets of 10 transactions at a time.
- The deduplicator process involves many asynchronous tasks; it uses ≥40 threads that access storage (and therefore require context switching).
Lastly, the GW logic is implemented in Python, which has a sequential nature: threads are processed in sequence even if they’re independent. Consequently, for a packet of 10 txs to reach the sequencer, the time we must wait for the deduplicator is the sum of times for each of its ≥40 long threads to resolve.
Timeline explanation?
- 40 minutes without blocks – a big influx of transactions initiated many threads which spread the GW CPU very thinly across all of them. The deduplication process took roughly 40 minutes to finish its packet of 10 txs during the high load while it needed more CPU.
- First post-stall block with 10 txs – the second deduplicator task took longer than the block timeout, so the sequencer closed a 10 tx block.
- 10 minutes without blocks – the high load lessened, freeing up some CPU, but not enough to immediately resolve the issue as the second deduplicator task still moved slowly.
- Second post-stall block with 10 txs – the third deduplicator task took longer than the batch timeout, resulting in another small block closed.
- Normal operation – additional GWs were spun up, and then the existing ones were reset, causing the existing load to be spread over more CPU. The restarted
add_tx
threads and the fresh deduplicator process now had enough CPU to run faster, resuming normal operation.
Note that transactions whose add_tx
thread resolved within roughly 15 seconds returned received or rejected status and joined the backlog. On the other hand, transactions with long add_tx
threads that did not resolve in time returned internal server errors and did not reach the backlog. Generally speaking, these were transactions with heavy validate
functions.
What next?
Upcoming Starknet versions feature significant improvements to GW performance.
- v0.12.2
- More GW instances were spun up to make more CPU available.
- Improvements to load balancing when GW instances are congested.
- v0.12.3 (already past integration and currently on testnet)
- The GW has improved logic, making more efficient use of CPU and performing roughly 4x as well.
- v0.13.0 (currently on integration)
- The GW will use a Rust implementation of the Cairo VM to run
validate
, making CPU usage still more efficient. - The deduplicator will be moved to a separate service with dedicated CPU instead of living in one of the GWs.