SN mainnet downtimes over 09.10-15.10

TL;DR

Between Oct 9th and Oct 15th, Starknet experienced a series of downtimes. Most of them lasted a few minutes, but three downtimes were longer, lasting 10-20 minutes each. Two separate issues caused these outages: an Aerospike hotspot issue and a misconfiguration in transaction execution time. Both issues were promptly handled and fixed.

Root Causes

Starknet identified two main bugs as responsible for these outages:
Bug 1 - Aerospike hotspot issue: Growing demand for fetching preconfirmed blocks created hotspots within the Aerospilke database, and caused the committer, a service that consumes it, to not be able to successfully update the block hashes, thus stalling the network

Bug 2 - Misconfiguration in transaction execution time: Some transactions Starknet started receiving required more than 5 seconds to execute, causing a loop where a transaction was picked to be sequenced, a block could not include it, it then returned to the mempool, and the issue repeated in the next block until it was evicted from the mempool.

How did the Starknet team react?

During this period, the team worked non-stop, day and night, and over the course of weekends to accomplish the following:

  • Delivered several mainnet changes to ease and fix the core issues.
  • Cooperated with ecosystem applications to mitigate the incident impact and ensure resumed operation.
  • Transparently reflected the situation to the ecosystem in real time in the relevant telegram group

What went well

  • The team responded fast to the incidents, monitored the situation, and communicated downtimes transparently over all of the available means of communication.
  • The team was dedicated to solving the issues fast, including shipping mainnet changes on weekends, holidays and day-offs, alongside with the Aerospike team, to understand how to solve the issue
  • In the second incident, the core issues were isolated and detected quickly and efficiently

What needs improvement

  • Adding metrics and monitoring around connection utilization to the Aerospike and hotspot creation - Done
  • Made the caching mechanism even more robust and comprehensive than what was done on 12.10 - Done
  • Adding more logs to investigate better execution times - Currently WIP, many logs were added already

Conclusion:

These incidents were a valuable stress test for Starknet’s infrastructure. They helped surface edge cases that could only appear under real network load and allowed the core team to harden key components for greater stability. Thanks to quick coordination between core contributors, node operators, and application teams, the network is now more resilient under pressure.