The Zilliqa network experienced a disruption on December 18, 2023 which resulted in several hours of downtime.
This situation resulted in block production being temporarily interrupted before full network functionality was restored by the technical team later the same day.
Following this disruption, the Zilliqa technical team has conducted a root cause analysis of this event and found it to be the result of a critical inconsistency issue caused by a divergence in root hashes.
We apologise for the inconvenience caused by this incident and appreciate the support exhibited by the Zilliqa community as we worked to resolve the network downtime.
Below is the technical team’s analysis of this disruption and the steps we are taking to improve the reliability of the Zilliqa network.
Root Cause Analysis – Zilliqa Network Disruption on December 18th
The root cause analysis conducted by the Zilliqa technical team found that a critical inconsistency issue was encountered during the processing of block 3428513, with a subset of nodes failing to receive the complete set of microblocks associated with this block.
As a result, this subset of nodes derived a divergent root hash to the rest of the network, leading to a conflict in establishing consensus.
These nodes subsequently exited consensus, leaving an insufficient number of votes to commit this block, and attempts to retry this commitment led to the same issue of a mismatch between root hashes.
The Zilliqa network's existing codebase heavily relies on the assumption of reliable object gossiping across the network, and while there are mechanisms in place intended to recover from such discrepancies, in this instance, the relevant recovery code failed to activate.
To rectify this issue and restore consensus, it was necessary that the Zilliqa network be restarted. This straightforward procedure was immediately enacted, however it resulted in several hours of downtime.
Improving Zilliqa’s reliability and efficiency
The incident on December 18, 2023 demonstrates the need for the continuous improvement of Zilliqa’s reliability and efficiency.
The technical team is working hard to roll out network architecture upgrades that will enhance the reliability of the Zilliqa network and reduce the risk of downtime due to incidents like the one described above.
Upcoming updates to Zilliqa are set to deliver a major boost to overall network resilience and mitigate both the risk and impact of a similar incident occurring in future.
A number of network improvements are being implemented that will not only deliver major performance enhancements and exciting new features, but will also directly improve the network’s capabilities to handle any disruptive incidents.
These changes include the following:
Zilliqa network upgrade v9.3.0
The Zilliqa v9.3.0 upgrade, deployed on January 3, 2024, greatly diminishes the risk of similar inconsistencies and disruptions to consensus through desharding the network.
This change will improve the network’s efficiency and reliability, providing a solid foundation for the launch of a new and improved sharding architecture with Zilliqa 2.0.
Zilliqa v9.3.0 also features a new active reward control mechanism, improved EVM compatibility, mining efficiency improvements, and much more.
Migration to Google Cloud Platform (GCP)
As part of Zilliqa’s strategic alliance with Google Cloud, nodes operated by the Zilliqa infrastructure team are in the process of migrating to Google Cloud Platform (GCP).
This will have a significant impact on network startup times, allowing the network to be restored far more quickly and minimising any potential downtimes in scenarios where this is required.
A more reliable network with Zilliqa 2.0
Zilliqa 2.0, which is currently expected to be released in the second half of 2024, addresses fundamental issues related to this disruptive incident.
This overhauled and greatly enhanced version of Zilliqa will eliminate the reliability assumption in network object gossiping, which is the primary cause of the downtime experienced on December 18. It will also use a new consensus mechanism that allows for consensus to be maintained in similar scenarios via a self-healing model.
Zilliqa 2.0 will also introduce more effective data persistence, reducing the lengthy join times currently observed on Zilliqa.
Additionally, instead of the several hours currently required to restore the Zilliqa network, Zilliqa 2.0 will be designed to boot up completely in approximately 10 minutes.
All the changes described above are designed to greatly improve the reliability of the Zilliqa network, delivering an efficient, flexible, and stable network that minimises disruption.
The incident on 18 December underscores the need for a more reliable and dynamic network architecture, which is the core of the design philosophy for Zilliqa 2.0 and the upgrades currently being rolled out to the network.
We apologise again for the inconvenience caused by this downtime and appreciate the continued support of the Zilliqa community as we work to improve the network’s resilience and reliability.