Building CipherTrust Monitor: ELK Stack for HSM and Microservices Monitoring

Monitoring is critical for any production system, but especially for security infrastructure. When enterprises depend on your key management platform for every encrypted transaction, you need deep visibility into system health, performance, and security events. We’ve been building CipherTrust Monitor using the ELK stack (Elasticsearch, Logstash, Kibana), and I want to share our architecture and lessons learned.

Why ELK for Security Monitoring?

We evaluated several monitoring solutions before settling on ELK. Here’s what made it the right choice:

Flexible schema: Unlike traditional databases with fixed schemas, Elasticsearch handles semi-structured data well. Security events from different sources have different fields, and we needed flexibility to add new event types without schema migrations.

Full-text search: The ability to search across all text fields is invaluable. When investigating a security incident, you need to search for IP addresses, usernames, key IDs, error messages across all logs.

Time-series optimization: Security and performance data is time-series in nature. Elasticsearch is optimized for this use case with time-based indices.

Visualization: Kibana provides powerful visualization capabilities out of the box. We can build dashboards for different audiences: operations teams need performance metrics, security teams need audit trails, customers need compliance reports.

Scalability: We’re collecting millions of events per day. Elasticsearch scales horizontally to handle the volume.

Architecture Overview

Our ELK deployment has several components:

Log shippers: Filebeat agents on each host ship logs to Logstash.

Logstash: Processes logs, parses them, enriches with additional data, and forwards to Elasticsearch.

Elasticsearch cluster: Stores and indexes all events. We run a 6-node cluster with 3 master-eligible nodes and 3 data nodes.

Kibana: Provides visualization and search interface. Multiple Kibana instances behind a load balancer for high availability.

Curator: Manages index lifecycle, deleting old indices and optimizing storage.

Data Collection Strategy

We collect several types of data:

HSM operational events: Every cryptographic operation performed by HSMs. Key generation, encryption, decryption, signing, verification. We capture operation type, duration, requesting service, key identifier, and result.

HSM health metrics: CPU usage, memory, session count, operation queue depth. Collected every 30 seconds from each HSM.

Microservice logs: Application logs from all microservices. Includes request logs, error logs, and debug logs.

Microservice metrics: Request count, latency distribution, error rates by endpoint. Captured for each service.

Audit events: Security-relevant events like authentication attempts, authorization decisions, policy changes.

System metrics: OS-level metrics from all hosts. CPU, memory, disk, network.

The challenge is collecting all this without impacting the performance of production systems.

Log Parsing and Enrichment

Raw logs aren’t immediately useful. A line like “Operation completed in 5ms” needs to be parsed and enriched with context.

Logstash uses grok patterns for parsing. For our HSM logs, we have patterns like:

%{TIMESTAMP_ISO8601:timestamp} %{WORD:hsm_id} %{WORD:operation} key=%{WORD:key_id} user=%{WORD:user} duration=%{NUMBER:duration_ms} result=%{WORD:result}

This extracts structured fields from unstructured log lines. We can then query for “all failed decrypt operations taking longer than 10ms” rather than grepping through text.

We also enrich events with additional context:

Geographic data: IP addresses are enriched with country and city using GeoIP databases.

Service metadata: Events from microservices are tagged with service version, deployment environment, and host information.

Key metadata: Key operation events are enriched with key type, algorithm, and lifecycle state by querying our key metadata service.

This enrichment happens in Logstash before events reach Elasticsearch, so queries can filter on these enriched fields.

Index Design

Elasticsearch uses indices to organize data. Our index design follows these principles:

Time-based indices: We create daily indices for high-volume data (hsm-operations-2015.10.22) and weekly indices for lower-volume data (audit-events-2015.43).

Index templates: Define mappings and settings for indices before they’re created. This ensures consistent field types across all indices.

Index lifecycle: Old indices are transitioned to read-only, optimized for search, and eventually deleted based on retention policies.

We partition data by type and time. This allows us to:

Delete old data by simply deleting old indices
Optimize query performance by searching only relevant indices
Apply different retention policies to different data types (audit logs kept 7 years, performance metrics kept 90 days)

Query Patterns

The power of Elasticsearch comes from its query capabilities. Some key query patterns we use:

Time-range queries: “Show all failed HSM operations in the last hour”

Aggregations: “Group key operations by service and count”

Percentile queries: “What’s the 95th percentile latency for decrypt operations?”

Correlation queries: “Find all operations using this key in the last 24 hours”

Kibana makes these queries accessible through a UI, but we also use the Elasticsearch API directly for programmatic access and alerting.

Alerting

Monitoring isn’t useful unless it triggers action when problems occur. We’ve built alerting on top of Elasticsearch using scheduled searches:

Threshold alerts: “Alert if HSM error rate exceeds 1%”

Anomaly alerts: “Alert if key operation volume deviates significantly from normal patterns”

Security alerts: “Alert if failed authentication attempts exceed threshold”

SLA alerts: “Alert if operation latency p95 exceeds SLA targets”

Alerts are triggered by queries that run periodically. If the query returns results, an alert is sent via email, PagerDuty, or Slack depending on severity.

We’re careful about alert fatigue. Too many alerts and operations teams ignore them. Each alert must be actionable - if there’s nothing the on-call engineer can do about it, it shouldn’t alert.

Performance Optimization

Elasticsearch performance is critical - slow queries impact monitoring dashboards and alerting.

Hardware sizing: We use SSD storage for Elasticsearch data nodes. The difference in query performance is dramatic compared to spinning disks.

Heap sizing: Elasticsearch is a Java application. We allocate 50% of system RAM to the JVM heap, with the other 50% left for filesystem cache.

Index optimization: Older indices that are no longer written to are force-merged to reduce segment count and improve query performance.

Query optimization: We use filters instead of queries where possible (filters are cacheable), and we avoid wildcard queries on high-cardinality fields.

Sharding strategy: Each index is divided into shards that can be distributed across nodes. Too many shards adds overhead; too few reduces parallelism. We use 5 shards for our high-volume daily indices.

Kibana Dashboards

We’ve built several dashboards for different audiences:

Operations dashboard: Real-time view of system health. Shows request rates, error rates, latency percentiles, and HSM resource utilization. Operations teams watch this to detect issues proactively.

Security dashboard: Audit trail of sensitive operations. Shows authentication attempts, authorization failures, key lifecycle operations. Security teams use this for investigations and compliance.

Customer dashboard: Per-customer view of their key usage. Shows which keys are being used, operation volumes, and any errors. Customers access this for their own monitoring.

Performance dashboard: Deep dive into performance characteristics. Shows latency heatmaps, slowest operations, and bottleneck identification.

Each dashboard is tailored to its audience with appropriate access controls. Customers can only see their own data, not other customers’.

Security Considerations

Elasticsearch holds sensitive data. We implement several security controls:

Network isolation: Elasticsearch cluster is in a private network, not accessible from the Internet.

Authentication: Elasticsearch authentication is enabled with user credentials required for access.

Encryption: TLS for all client-to-cluster and node-to-node communication.

Field-level security: Some users can search but not see sensitive fields like encryption keys or authentication credentials.

Audit logging: Elasticsearch queries are themselves logged for audit purposes. We know who searched for what and when.

Data Retention and Compliance

Different data types have different retention requirements:

Audit logs: 7 years (compliance requirement) HSM operational logs: 90 days Performance metrics: 30 days Debug logs: 7 days

Curator runs daily to identify and delete indices past their retention period. For audit logs that need long-term retention, we export to AWS S3 Glacier before deletion from Elasticsearch.

Multi-Tenancy

Our platform is multi-tenant - multiple customers share infrastructure. In Elasticsearch, we implement tenant isolation through:

Index-per-tenant: Each customer’s data goes to separate indices. This provides strong isolation and makes it easy to delete a customer’s data if needed.

Document-level security: For shared indices, we use document-level security to ensure users can only query their own documents.

Separate Kibana instances: Each customer gets their own Kibana instance that’s preconfigured to only access their data.

High Availability

Elasticsearch cluster is distributed across three availability zones in AWS. If one zone fails, the cluster continues operating with replicas in other zones.

We configure replica shards to ensure every shard has at least one replica in a different availability zone. This means any single zone failure doesn’t cause data loss.

Kibana is stateless and runs behind a load balancer with instances in multiple zones. Logstash also runs multiple instances for redundancy.

Operational Challenges

Running Elasticsearch in production has some challenges:

Cluster state changes: Adding or removing nodes triggers shard rebalancing, which can impact query performance. We schedule these changes during low-traffic periods.

Memory pressure: Elasticsearch can consume all available memory if not carefully tuned. We monitor JVM heap usage and adjust batch sizes to prevent out-of-memory errors.

Split brain: In network partition scenarios, Elasticsearch clusters can split into multiple clusters. We prevent this by configuring appropriate quorum settings.

Data growth: Log volume grows continuously. We monitor disk usage closely and provision additional storage or increase retention policies as needed.

Looking Forward

We’re exploring several enhancements:

Machine learning: Elasticsearch recently added ML capabilities. We’re evaluating anomaly detection for security events and performance metrics.

Beats: Elastic is developing specialized log shippers (Beats) for different data types. We’re testing Metricbeat for system metrics and Packetbeat for network monitoring.

Cross-cluster search: For our multi-region deployment, cross-cluster search would allow querying data across all regions from a single interface.

Security plugins: Third-party plugins like Search Guard provide more sophisticated security features than built-in Elasticsearch authentication.

Key Takeaways

For teams building ELK-based monitoring:

Design your index structure carefully - it’s hard to change later
Invest time in log parsing and enrichment - structured data enables better queries
Monitor Elasticsearch itself - cluster health, query performance, disk usage
Be thoughtful about retention policies - storage costs add up quickly
Build dashboards for different audiences with appropriate access controls
Alert on actionable problems, not just interesting patterns
Use time-based indices for easy data lifecycle management

ELK has been transformative for our monitoring capabilities. The visibility into HSM operations, microservices performance, and security events has helped us identify and fix issues before they impact customers. It’s also essential for compliance, providing the audit trail that regulators require.

The learning curve is steep - Elasticsearch has many knobs to tune and behaviors to understand. But the investment has paid off in improved system reliability and operational efficiency. If you’re building a distributed system and need comprehensive monitoring, ELK is worth serious consideration.