Ryan Prior

A few strategies to deal with rollbacks after a server fault:

  1. Create a hierarchy of event types and prioritize persistence (disk I/O and db writes) according to the hierarchy. Highest priority are the most consequential actions, like crafting results, shattering crystals or breaking objects, items added to player inventories, moved onto a map tile or into a container. Below that would be skill checks resulting in lower HP, mind, stamina. Then at the bottom are mob movements, hunger and thirst ticks, regeneration from resting.
  2. Send events (if they are consequential enough, at least) to a highly available low-latency persistent datastore like kafka/redpanda or questdb. That way you can reconstruct the game state at least in large part after a fault by opening a window at the newest event the post-fault db knows about and replaying events until you're up to date.
  3. For low-priority events at the bottom of the consequence hierarchy, still batch writes but instead of saving the whole world every 5 minutes, partition the data by zones and save a partition at a time in a rotating fashion.
The overall effect should be that events are persisted to disk in near-real-time using the low-latency datastore, then according to the type of event, either sent to the DB asap or batched to be persisted on a rotating best-effort basis. That should result in smaller average transaction size, lower impact of a dropped transaction, and recoverability of most of the game action from the event stream after a db fault.