Experiences with Other Systems

sled is motivated by the experiences gained while working with other stateful systems, outlined below.

Most of the points below are learned from being burned, rather than delighted.

make it easy to tail the replication stream in flexible topologies
support merging shards a la MariaDB
support mechanisms for live, lock-free schema updates a la pt-online-schema-change
include GTID in all replication information
actively reduce tree fragmentation
give operators and distributed database creators first-class support for replication, sharding, backup, tuning, and diagnosis
O_DIRECT + real linux AIO is worth the effort

provide high-level collections that let engineers get to their business logic as quickly as possible instead of forcing them to define a schema in a relational system (usually spending an hour+ googling how to even do it)
don’t let single slow requests block all other requests to a shard
let operators peer into the sequence of operations that hit the database to track down bad usage
don’t force replicas to retrieve the entire state of the leader when they begin replication

don’t split “the source of truth” across too many decoupled systems or you will always have downtime
give users first-class APIs to peer into their system state without forcing them to write scrapers
serve http pages for high-level overviews and possibly log access
coprocessors are awesome but people should have easy ways of doing secondary indexing

give users tons of flexibility with different usage patterns
don’t force users to use distributed machine learning to discover configurations that work for their use cases
merge operators are extremely powerful
merge operators should be usable from serial transactions across multiple keys

raft makes operating replicated systems SO MUCH EASIER than popular relational systems / redis etc…
modify raft to use leader leases instead of using the paxos register, avoiding livelocks in the presence of simple partitions
give users flexible interfaces
reactive semantics are awesome, but access must be done through smart clients, because users will assume watches are reliable
if we have smart clients anyway, quorum reads can be cheap by lower-bounding future reads to the raft id last observed
expose the metrics and operational levers required to build a self-driving stateful system on top of k8s/mesos/cloud providers/etc…

build things in a testable way from the beginning
don’t seek gratuitous concurrency
allow replication streams to be used in flexible ways
instant finality (or interface finality, the thing should be done by the time the request successfully returns to the client) is mandatory for nice high-level interfaces that don’t push optimism (and rollbacks) into interfacing systems

approach a wait-free tree traversal for reads
use modern tree structures that can support concurrent writers
multi-process is nice for browsers etc…
people value read performance and are often forgiving of terrible write performance for most workloads

reactive semantics are awesome, but access must be done through smart clients, because users will assume watches are reliable
the more important the system, the more you should keep old snapshots around for emergency recovery
never assume a hostname that was resolvable in the past will be resolvable in the future
if a critical thread dies, bring down the entire system
make replication configuration as simple as possible. people will mess up the order and cause split brains if this is not automated.