The home for Hyperlane core contracts, sdk packages, and other infrastructure
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 
hyperlane-monorepo/docs/failure-cases.md

4.3 KiB

Optics Failure Cases

Optics is a robust system, resistant to all sorts of problems. However, there are a set of failure cases that require human intervention and need to be enumerated

Agent State/Config

Updater

  • Two updaters deployed with the same config
    • (See Double Update)
  • Extended updater downtime
    • Effect:
      • Updates stop being sent for a period of time
    • Mitigation:
      • Updater Rotation (not implemented)
  • Fraudulent updater
    • Effect:
      • Invalid or fraudulent update is sent
    • Mitigation:
      • Watcher detects fraud, submits fraud proof (see Improper Update)

Relayer

  • relayer "relays" the same update more than once
    • Effect:
      • Only the first one works
      • Subsequent transactions are rejected by the replicas
    • Mitigation:
      • Mempool scanning

        • "is there a tx in the mempool already that does what I want to do?"

        If so, do nothing, pick another message to process.

      • If minimizing gas use: Increase polling interval (check less often)

Processor

  • processor "processes" the same message more than once
    • Effect:
      • Only the first one works
      • Subsequent transactions are rejected by the smart contracts

Watcher

  • Watcher and Fraudulent Updater Collude
    • Effect:
      • Fraud is possible
    • Mitigation:
      • Distribute watcher operations to disparate entities. Anyone can run a watcher.

General

  • Transaction Wallets Empty
    • Effect:
      • Transactions cease to be sent
    • Mitigation:
      • Monitor and top-up wallets on a regular basis

Contract State

  • Double Update
    • Happens if Updater (single key), submits two updates building off the "old root" with different "new root"
    • If two updaters were polling often but message volume was low, would likely result in the "same update"
    • If two updaters were polling often but message volume was high, would likely result in a "double update"
    • Doesn't necessarily need to be the two updaters, edge case could occur where the updater is submitting a transaction, crashes, and then reboots and submits a double update
    • Effect:
      • Home and Replicas go into a Failed state (stops working)
    • Mitigation:
      • Agent code has the ability to check its Database for a signed update, check whether it is going to submit a double update, and prevent itself from doing so
      • Need to improve things there
      • Updater wait time
        • Updater doesn't want to double-update, so it creates an update and sits on it for some interval. If still valid after the interval, submit. (Reorg mitigation)
      • "Just don't run multiple updaters with the same config"
  • Improper Update
    • Should only occur if the chain has a "deep reorg" that is longer than the Updater's pause period OR if the Updater is actively committing fraud.
    • Effect:
      • Home goes into a FAILED state (stops working)
        • No plan for dealing with this currently
      • Updater gets slashed
        • (not implemented currently)
    • Mitigation:
      • Watcher(s) unenroll xapps
      • Humans look at the situation, determine if the Updater was committing fraud or just the victim of poor consensus environment.

Network Environment

  • Network Partition
    • When multiple nodes split off on a fork and break consensus
    • Especially bad if the updater is off on the least-power chain (results in Improper Update)
    • Effect:
      • Manifests as a double-update
      • Manifests as an improper update
      • Messages simply stop
    • Mitigation:
      • Pay attention and be on the right fork
      • Stop signing updates when this occurs!
      • Have a reliable mechanism for determining this is happening and pull the kill-switch.
  • PoW Chain Reorg (See Network Partition)
    • What happens when a network partition ends
    • Mitigation:
  • PoS Chain Reorg (See Network Partition)
    • Safety failure (BPs producing conflicting blocks)
    • Liveness Failure (no new blocks, chain stops finalizing new blocks)
    • Effect:
      • Slows down finality
      • Blocks stop being produced
    • How would this manifest in Celo?
      • Celo would stop producing blocks.
      • Agents would pause and sit there
      • When agents see new blocks, they continue normal operations.