Bugs and errors should always be expected and natural in the development of any complex system. The important thing is how these unexpected errors are handled and having a systematic approach to solving, mitigating and avoiding these issues.
Here we present a concrete list of proposals to improve the state of Starknet DevOps.
- Documentation
Since blockchains and rollups are complex systems it’s impossible to keep the entire system in our heads. There are too many details. Each unknown part represents a high risk when building new parts or maintaining existing ones and extensive documentation reduces the area these kinds of unknowns can exist in.
If we’re successful the complexity will only increase with time in Starknet. The number of contributors, projects, lines of code, rate of change, specs and features will increase with time.
The only way to manage the huge amount of complexity is by making this knowledge explicit. In software projects this means documentation.
No software project is complete without documentation such as user guides, examples, references, test vectors, and change lists. Documentation shows users how to adopt, deploy, and use any given software. Without proper documentation, potential users might overlook or abandon even the most powerful and dynamic software products
Documentation is as important as working code. It not only helps with the maintainability of a system but also fosters its improvement, since it helps to onboard new builders into the ecosystem.
We need public documentation and diagrams explaining the full flow of the system. Each part involved, from micro-services to the communication and dependencies between them.
- Monitoring and Alerting
Monitoring is based on gathering predefined sets of metrics or logs from each part of the ecosystem.
Currently, the only information available about the network’s performance relies on various explorers that focus solely on the end-to-end aspect. However, there is no tool to observe each individual component and there is no mechanism via which data and/or requests can be traced throughout the system and events correlated in time.
Having thorough documentation of the architecture, coupled with the right monitoring tools to publicly display the network’s status, is crucial to Starknet’s development. Tools like Grafana, Prometheus, or similar options are relatively easy to implement in various environments.
Once a simple monitoring tool is in place, you can configure alarms that not only help in addressing problems more quickly, but also potentially aid in error prevention through the use of different threshold settings.
Providing this information to everyone is a key component in establishing trust in a system that undergoes continuous evolution
Moreover, establishing clear communication channels where authorized members (or automated systems) can send reports about any status changes is essential. It’s not necessary to inundate the community with numerous alerts; rather, having a minimum number of alerts that assure the community of the network’s optimal performance is key.
Building a robust monitoring scheme will allow to provide observability on the system. The key concept behind observability is to learn about unknown-unkowns. A system is observable when you can ask any arbitrary question and dive deep and explore properties and patterns not defined in advance.
- Local development replica
Every engineer building a system should have the ability to launch all the components locally or remotely to test, debug and play with it. One way to address this, is by using tools like docker compose or tilt with every service that is part of Starknet in order to allow any engineer to reproduce bugs in testnet, mainnet or test big changes before addition.
Another possibility is establishing a standard way across components of querying or dumping production data to replay traffic and/or events in local development environments.
- A/B Production Testing
Although a decentralized service running as a mesh of nodes operated by several different organizations does not have the same requirements as a centralized PaaS (because it benefits from node and client diversity) it can still be beneficial for the software to support the capability of deploying new software versions and running a portion of live traffic through the subsystem under test to smoke out bugs before going live with the new software on all nodes.
- Automatic deployment and scaling
As a system becomes more complex, it necessitates the implementation of tools to facilitate automation for tasks like building, testing, deploying, and scaling. This applies both to the system as a whole and to each individual service or microservice. Utilizing automation tools such as Terraform or Ansible can greatly enhance efficiency in these endeavors.
Human intervention should be minimal for tasks that can be easily automated, and none at all for scaling the system.
- Load testing tooling
When systems are designed to handle a large volume of users and transactions (such as a blockchain), it is imperative to understand how they will behave under extreme load conditions. To this end, load testing during all stages of the development process helps identify bottlenecks, performance issues, and general operational limitations in a controlled environment, before the system is live and serving real users.
If we want to ensure high scalability and throughput, it is necessary to put each component of the system under pressure to validate their operation. In the case of blockchains and rollups, there are many components which operate through different interfaces, which increases the amount of potential operational limits that we want to understand. By building tools that help us replicate a production environment (as was mentioned above) and load testing these environments easily, we can gain valuable insights into how to make the system more robust and efficient.
LambdaClass Core Engineering