Distributed Key Storage: Designing for Scale and Reliability

Key management systems face a unique challenge when it comes to distributed storage. Unlike typical distributed systems where eventual consistency might be acceptable, key management requires strong consistency while maintaining high availability. A key operation can’t return stale data, yet the system must remain available even during network partitions or data center failures. Here’s how we’re tackling this at Thales.

Why Distribution Matters

Enterprise key management isn’t a single-server problem. Organizations need:

Geographic distribution for disaster recovery
High availability to prevent encryption operations from becoming a bottleneck
Scalability to handle millions of keys and thousands of operations per second
Strong consistency to prevent cryptographic errors from key version mismatches

The challenge is that these requirements often conflict. CAP theorem tells us we can’t have consistency, availability, and partition tolerance simultaneously. We have to make trade-offs.

The HSM Replication Problem

HSMs provide excellent security but complicate distribution. Keys stored in HSMs can’t be replicated using standard database replication techniques. You can’t just copy key files from one HSM to another - keys never leave the HSM in plaintext.

Instead, HSMs use specialized replication protocols. SafeNet HSMs, which we use, support cloning where key material is encrypted under a cloning domain key before transmission to another HSM. The receiving HSM must be part of the same cloning domain to decrypt and import the keys.

This works well for planned replication, but it’s slow. Establishing a new HSM replica might take hours, which isn’t suitable for dynamic scaling. We need additional layers of distribution beyond just HSM replication.

Metadata Distribution

While cryptographic key material must stay in HSMs, metadata about keys can be distributed more conventionally. Metadata includes:

Key identifiers and aliases
Key lifecycle state (active, expired, destroyed)
Usage policies and access controls
Audit trail of key operations
Relationships between keys (which data keys are encrypted by which master keys)

We’re using a distributed database to store this metadata. The challenge is keeping metadata synchronized with the actual key state in HSMs. If metadata says a key exists but the HSM doesn’t have it, operations fail. If the HSM has a key that’s not in metadata, the key is effectively inaccessible.

Consistency Models

We’ve implemented different consistency models for different operations based on their requirements:

Strong consistency for key lifecycle operations: Creating, rotating, or destroying keys requires strong consistency. These operations use consensus protocols (we’re evaluating Raft) to ensure all replicas agree on the state change before acknowledging the operation.

Read-after-write consistency for key operations: After creating a key, subsequent operations using that key must see it immediately. However, reads from other clients can tolerate slight staleness. This allows us to serve reads from local replicas while ensuring a client always sees its own writes.

Eventual consistency for audit logs: Audit records don’t need immediate consistency across all replicas. We can accept that different data centers might have slightly different views of the recent audit trail, as long as they eventually converge.

The Caching Layer

To reduce latency and HSM load, we implement multi-level caching:

Application-level cache: Data keys that have been unwrapped by the HSM are cached in application memory with time-limited leases. This allows thousands of encryption operations without calling the HSM for each one.

Service-level cache: Metadata about keys and access policies is cached in each service instance. Cache invalidation happens through a publish-subscribe mechanism when keys are rotated or policies change.

Regional cache: Each geographic region maintains a cache of frequently used keys. The cache is populated from HSMs in that region, reducing cross-region latency.

The challenge with caching is invalidation. When a key is rotated or revoked, all caches must be invalidated quickly. We use a combination of TTL-based expiration and active invalidation through message queues.

Handling Network Partitions

Network partitions are inevitable in distributed systems. When data centers can’t communicate, how does the key management system behave?

Our approach is to prioritize availability within certain constraints:

Read operations (decryption with existing keys) continue during partitions
Write operations (creating new keys, rotating keys) require connectivity to a quorum of replicas
Key revocation is handled specially: revocations are always accepted and merged optimistically when the partition heals

This means during a partition, you can’t create new keys but you can continue using existing keys. For most applications, this is an acceptable trade-off.

Geographic Distribution Strategy

We’re deploying across multiple AWS regions and on-premises data centers. The architecture uses a hub-and-spoke model:

Hub sites (typically 3): These contain the master HSMs and authoritative key metadata. They form a consensus group using Raft for coordinating state changes.

Spoke sites (many): These contain read replicas of key metadata and may have local HSMs with cloned key material. They can serve read operations locally but forward write operations to hubs.

This model provides low-latency reads for geographically distributed applications while maintaining strong consistency for writes through the hub sites.

Key Operation Routing

When an application requests a cryptographic operation, the system must route it to an appropriate HSM. The routing logic considers:

Geographic proximity (minimize latency)
HSM load (distribute operations evenly)
Key availability (not all HSMs have all keys)
Policy constraints (some keys must only be used in specific regions for compliance)

We’ve built a routing layer that maintains real-time information about HSM status, load, and key availability. It’s essentially a load balancer specifically designed for cryptographic operations.

Data Consistency Verification

With distributed key storage, we need mechanisms to detect and repair inconsistencies:

Merkle trees over key metadata allow quick comparison between replicas. If the tree hashes match, replicas are consistent. If they differ, we can quickly identify which keys have discrepancies.

Periodic reconciliation jobs compare key state across all HSMs and metadata replicas, detecting and alerting on any inconsistencies.

Checksums on key metadata and cryptographic verification that keys in HSMs match expectations provide additional consistency checks.

Backup and Recovery

Distributed systems help with availability but don’t eliminate the need for backups. HSM key material is backed up using secure HSM backup protocols:

Backups are encrypted under an HSM backup key
Multiple backup copies are stored in geographically separated locations
Backup restoration is tested regularly through disaster recovery drills

Metadata backups use more conventional database backup techniques, with point-in-time recovery capabilities.

The Latency Challenge

Geographic distribution inherently introduces latency. An application in Singapore using keys stored in HSMs in Virginia will experience significant latency for every operation.

Our approach is to pre-position keys in regions where they’ll be used:

Analyze access patterns to determine which keys are used in which regions
Proactively replicate high-use keys to regional HSMs
Cache unwrapped data keys in regional caches
For global applications, use regional master keys that protect region-specific data keys

This reduces cross-region traffic to only the initial key establishment, with subsequent operations happening locally.

Monitoring Distributed State

Operating a distributed key management system requires comprehensive monitoring:

Replication lag between data centers
Cache hit rates at each layer
HSM operation latency by region
Consistency check results
Network partition events and duration

We’re building this on top of Elasticsearch, which I’ll cover in detail in future posts. The ability to aggregate metrics and logs from distributed components and query them in real-time is essential for operating this system.

Looking Forward

We’re exploring several improvements to our distributed architecture:

Global consensus: Moving from regional hubs to a globally distributed consensus group
Sharding: Partitioning keys across multiple independent consensus groups for better scalability
Conflict-free replicated data types (CRDTs): For certain metadata that can tolerate eventual consistency with better availability guarantees

Distributed key storage is one of those problems that seems straightforward until you dig into the details. The combination of security requirements, consistency needs, and geographic distribution creates unique challenges that standard distributed databases don’t address well.

Key Takeaways

For teams building distributed key management systems:

Separate key material (stays in HSMs) from metadata (can be distributed more conventionally)
Choose appropriate consistency models for different operations
Implement multi-level caching with careful attention to invalidation
Plan for network partitions and decide which operations should remain available
Monitor replication lag and consistency closely
Test disaster recovery procedures regularly

Building distributed systems is hard. Building distributed security systems is harder. But it’s the only way to provide the scale and reliability that modern enterprises require. The constraints make the problem interesting, and solving it is deeply satisfying.