As our infrastructure has grown from a handful of servers to dozens of microservices running across multiple regions, understanding what’s happening in our systems has become exponentially harder. Traditional logging—SSH to a server, tail a log file—doesn’t scale.
We needed centralized logging that could:
- Aggregate logs from all services and infrastructure
- Enable fast searching across billions of log entries
- Correlate events across distributed services
- Alert on anomalies and errors
- Provide visibility for debugging and security investigation
Enter the ELK stack: Elasticsearch, Logstash, and Kibana. After six months of running ELK in production, I’ve learned a lot about what works, what doesn’t, and how to build effective observability for distributed systems.
What is the ELK Stack?
Three components work together:
Elasticsearch: Distributed search and analytics engine. Stores and indexes logs.
Logstash: Log collection and processing pipeline. Ingests, parses, and forwards logs.
Kibana: Visualization and exploration UI. Query logs and build dashboards.
The flow:
Applications → Logstash → Elasticsearch → Kibana
Applications send logs to Logstash, which processes and forwards them to Elasticsearch. Users query Elasticsearch through Kibana.
Architecture Decisions
Centralized vs. Agent-Based Collection
I evaluated two approaches:
Centralized: Applications send logs directly to Logstash:
App → Logstash (over network)
Agent-Based: Applications write to local files, an agent ships them:
App → Local File → Filebeat → Logstash
I went with agent-based for several reasons:
- Reliability: If Logstash is down, logs aren’t lost (buffered on disk)
- Performance: Applications don’t block on network I/O
- Backpressure handling: Agents can buffer when Logstash is overwhelmed
We use Filebeat (lightweight log shipper) on every host:
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
fields:
service: order-service
environment: production
region: us-east-1
output.logstash:
hosts: ["logstash-1:5044", "logstash-2:5044", "logstash-3:5044"]
loadbalance: true
Filebeat watches log files, reads new entries, and forwards to Logstash with load balancing and retry logic.
Logstash Pipeline
Logstash processes logs through three stages: input, filter, output.
Here’s our pipeline:
# logstash.conf
input {
beats {
port => 5044
ssl => true
ssl_certificate => "/etc/logstash/cert.pem"
ssl_key => "/etc/logstash/key.pem"
}
}
filter {
# Parse JSON logs
json {
source => "message"
}
# Add timestamp
date {
match => ["timestamp", "UNIX"]
target => "@timestamp"
}
# Classify log level
if [level] == "ERROR" or [level] == "FATAL" {
mutate {
add_tag => ["alert"]
}
}
# Extract user ID from security logs
if [event] == "authentication" {
grok {
match => { "message" => "user=%{WORD:user_id}" }
}
}
# Enrich with GeoIP data
if [source_ip] {
geoip {
source => "source_ip"
}
}
}
output {
elasticsearch {
hosts => ["es-1:9200", "es-2:9200", "es-3:9200"]
index => "logs-%{[fields][service]}-%{+YYYY.MM.dd}"
user => "logstash"
password => "${LOGSTASH_PASSWORD}"
}
# Alert on critical errors
if "alert" in [tags] {
http {
url => "https://alerting-service/alert"
http_method => "post"
format => "json"
}
}
}
The filter stage is where the magic happens. We parse, enrich, and classify logs before indexing.
Elasticsearch Cluster
Elasticsearch is the heart of the system. We run a cluster of 6 nodes:
- 3 master-eligible nodes (cluster coordination)
- 6 data nodes (store and query data)
Configuration considerations:
Index Design: We create daily indices per service:
logs-order-service-2015.07.30
logs-auth-service-2015.07.30
logs-encryption-service-2015.07.30
This allows us to:
- Drop old indices easily (retention policy)
- Optimize query performance (query specific services)
- Manage storage per service
Sharding Strategy: Each index is split into shards. We use 5 shards per index, with 1 replica:
{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1
}
}
This balances query performance (more shards = more parallelism) with overhead (too many shards = too much overhead).
Retention Policy: We keep logs for different periods based on importance:
- Security logs: 2 years
- Error logs: 6 months
- Info logs: 30 days
We use curator to automate deletion:
# curator.yml
actions:
1:
action: delete_indices
filters:
- filtertype: pattern
kind: prefix
value: logs-
- filtertype: age
source: name
direction: older
timestring: '%Y.%m.%d'
unit: days
unit_count: 30
Structured Logging
The key to effective logging with ELK is structured logging. Don’t log strings:
// Bad
log.Println("User john logged in from 192.168.1.1")
Log structured data:
// Good
log.WithFields(log.Fields{
"event": "user_login",
"user_id": "john",
"source_ip": "192.168.1.1",
"timestamp": time.Now().Unix(),
"success": true,
}).Info("User login")
This produces JSON that Logstash can parse easily:
{
"event": "user_login",
"user_id": "john",
"source_ip": "192.168.1.1",
"timestamp": 1438300800,
"success": true,
"level": "info"
}
Now we can query: “Show all failed login attempts for user john in the last hour.”
In unstructured logs, this requires complex regex. In structured logs, it’s a simple query.
Effective Queries
Kibana provides a powerful query language. Some patterns I use frequently:
Finding Errors
level:ERROR AND service:order-service
Security Investigation
event:authentication AND success:false AND source_ip:"192.168.1.1"
Performance Analysis
event:api_request AND response_time:>1000
Distributed Tracing
Using trace IDs to correlate logs across services:
trace_id:"abc123"
This shows all log entries related to a single request, across all services it touched.
Dashboards and Visualization
Kibana’s visualization capabilities are powerful. We’ve built several dashboards:
Security Dashboard
- Failed authentication attempts (by user, by IP)
- Unusual access patterns
- Key operations (creation, rotation, deletion)
- API access by service
Performance Dashboard
- Request rate per service
- P50, P95, P99 latency
- Error rate
- Resource utilization
Business Metrics
- Orders created per minute
- User signups
- API calls by endpoint
These dashboards provide real-time visibility into system health and security posture.
Alerting
Kibana doesn’t have native alerting (yet), so we built custom alerting:
Watcher Approach
We use Elasticsearch Watcher to query for conditions and trigger alerts:
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"search": {
"request": {
"indices": ["logs-*"],
"body": {
"query": {
"bool": {
"must": [
{"match": {"event": "authentication"}},
{"match": {"success": false}}
],
"filter": {
"range": {
"@timestamp": {
"gte": "now-5m"
}
}
}
}
}
}
}
}
},
"condition": {
"compare": {
"ctx.payload.hits.total": {
"gt": 10
}
}
},
"actions": {
"notify_security": {
"webhook": {
"method": "POST",
"url": "https://alerting-service/alert",
"body": "More than 10 failed auth attempts in last 5 minutes"
}
}
}
}
Every minute, this searches for failed auth attempts. If more than 10 in the last 5 minutes, it sends an alert.
Custom Alerting Service
For more complex alerting logic, we built a custom service that queries Elasticsearch and evaluates conditions:
func checkAnomalies() {
// Query Elasticsearch
query := elastic.NewBoolQuery().
Must(elastic.NewTermQuery("event", "key_access")).
Filter(elastic.NewRangeQuery("@timestamp").Gte("now-1h"))
result, err := esClient.Search().
Index("logs-*").
Query(query).
Aggregation("by_user", elastic.NewTermsAggregation().Field("user_id")).
Do(context.Background())
// Analyze results
for _, bucket := range result.Aggregations.Terms("by_user").Buckets {
if bucket.DocCount > threshold {
sendAlert(fmt.Sprintf(
"User %s accessed %d keys in last hour (threshold: %d)",
bucket.Key, bucket.DocCount, threshold,
))
}
}
}
This detects anomalies like a user accessing an unusual number of keys.
Performance Optimization
Running ELK at scale requires optimization:
Indexing Performance
Bulk Indexing: Logstash batches logs before sending to Elasticsearch:
output {
elasticsearch {
hosts => ["es-1:9200"]
index => "logs-%{+YYYY.MM.dd}"
flush_size => 500
idle_flush_time => 5
}
}
This batches up to 500 logs or waits 5 seconds before sending.
Replica Delay: For daily indices, we create with 0 replicas initially, then add replicas after a few hours:
# Create index with no replicas
curl -XPUT 'localhost:9200/logs-2015.07.30' -d '{
"settings": {
"number_of_replicas": 0
}
}'
# Later, add replicas
curl -XPUT 'localhost:9200/logs-2015.07.30/_settings' -d '{
"number_of_replicas": 1
}'
This improves indexing speed (no replication during peak load).
Query Performance
Index Patterns: Query specific indices instead of wildcards:
# Slow
logs-*
# Fast
logs-order-service-2015.07.30
Field Filtering: Only retrieve fields you need:
{
"_source": ["user_id", "timestamp", "event"]
}
Caching: Elasticsearch caches query results. Structure queries to benefit from caching.
Storage Optimization
Compression: Enable compression for older indices:
curl -XPOST 'localhost:9200/logs-2015.06.*/_forcemerge?max_num_segments=1'
Hot/Warm Architecture: Recent logs (hot) on fast SSD. Old logs (warm) on cheaper HDD.
Security Considerations
Encrypted Communication
All communication is encrypted:
Filebeat → Logstash: TLS
Logstash → Elasticsearch: HTTPS
Kibana → Elasticsearch: HTTPS
User → Kibana: HTTPS
Authentication and Authorization
Elasticsearch has authentication enabled:
# elasticsearch.yml
xpack.security.enabled: true
Users have role-based access:
Developers: Read access to application logs
Security Team: Read access to all logs
Admins: Full access
Audit Logging
We log all access to Elasticsearch itself:
# elasticsearch.yml
xpack.security.audit.enabled: true
This creates an audit trail of who queried what logs.
Sensitive Data
Be careful what you log. We:
- Never log passwords, tokens, or encryption keys
- Hash or redact PII (personally identifiable information)
- Use field-level security to restrict access to sensitive fields
Example redaction in Logstash:
filter {
# Redact credit card numbers
mutate {
gsub => [
"message", "\d{4}-\d{4}-\d{4}-\d{4}", "XXXX-XXXX-XXXX-XXXX"
]
}
}
Operational Lessons
Capacity Planning
Elasticsearch storage grows fast. We ingest ~500GB of logs per day. Plan for:
- 1.5-2x raw log size (accounting for indexing overhead)
- Retention period (30 days = 15TB for us)
- Replica factor (1 replica = 2x storage)
Total: ~30TB for 30-day retention.
Monitoring the Monitoring System
Don’t forget to monitor ELK itself:
- Elasticsearch cluster health
- Indexing rate and lag
- Query performance
- Disk usage and growth rate
- Logstash pipeline throughput
We use Prometheus and Grafana to monitor our ELK cluster.
Disaster Recovery
Elasticsearch stores critical security logs. We:
- Replicate across availability zones: Replica shards in different AZs
- Snapshot to S3: Daily snapshots of all indices
- Test restoration: Regular DR drills
# Create snapshot repository
curl -XPUT 'localhost:9200/_snapshot/s3_backup' -d '{
"type": "s3",
"settings": {
"bucket": "elasticsearch-backups",
"region": "us-east-1"
}
}'
# Create snapshot
curl -XPUT 'localhost:9200/_snapshot/s3_backup/snapshot_1'
Future Improvements
Areas we’re working on:
- Machine learning: Anomaly detection using Elasticsearch ML
- Better alerting: More sophisticated alerting rules
- Correlation: Better tools for correlating events across services
- Performance: Optimizing for even faster queries
- Cost optimization: Reducing storage costs with better compression and retention policies
Lessons Learned
After six months with ELK:
What Works:
- Structured logging is essential
- Daily indices make management easier
- Agent-based shipping is more reliable than direct logging
- Dashboards provide great visibility
- Integration with alerting catches issues quickly
What’s Hard:
- Scaling Elasticsearch requires expertise
- Storage costs grow quickly
- Query optimization is non-trivial
- Keeping up with Elasticsearch changes (rapid development)
- Ensuring security of sensitive logs
Conclusion
The ELK stack has transformed our ability to understand and debug our distributed systems. Centralized logging with powerful search and visualization is essential for operating microservices at scale.
The investment in setting up and maintaining ELK is significant, but the payoff in operational visibility and security investigation capability makes it worthwhile.
If you’re running distributed systems, invest in observability. ELK is one good option. The alternative—distributed logs across dozens of services—is untenable.
Start simple: collect logs centrally. Add structure. Build dashboards. Iterate.
Your future self (and your on-call engineers) will thank you.
Getting Started
If you’re new to ELK:
- Start small: Single Elasticsearch node, single Logstash instance
- Use structured logging: JSON format
- Create one useful dashboard: Build from there
- Plan for growth: Elasticsearch scales, but you need to understand how
- Secure it: Authentication, encryption, audit logging
The journey to observability starts with collecting logs. ELK is a proven way to do it.
In future posts, I’ll explore advanced topics: Elasticsearch performance tuning, machine learning for anomaly detection, and integrating distributed tracing.
Happy logging!