Building Observability with the ELK Stack: Elasticsearch, Logstash, and Kibana

As our infrastructure has grown from a handful of servers to dozens of microservices running across multiple regions, understanding what’s happening in our systems has become exponentially harder. Traditional logging—SSH to a server, tail a log file—doesn’t scale.

We needed centralized logging that could:

Aggregate logs from all services and infrastructure
Enable fast searching across billions of log entries
Correlate events across distributed services
Alert on anomalies and errors
Provide visibility for debugging and security investigation

Enter the ELK stack: Elasticsearch, Logstash, and Kibana. After six months of running ELK in production, I’ve learned a lot about what works, what doesn’t, and how to build effective observability for distributed systems.

What is the ELK Stack?

Three components work together:

Elasticsearch: Distributed search and analytics engine. Stores and indexes logs.

Logstash: Log collection and processing pipeline. Ingests, parses, and forwards logs.

Kibana: Visualization and exploration UI. Query logs and build dashboards.

The flow:

Applications → Logstash → Elasticsearch → Kibana

Applications send logs to Logstash, which processes and forwards them to Elasticsearch. Users query Elasticsearch through Kibana.

Architecture Decisions

Centralized vs. Agent-Based Collection

I evaluated two approaches:

Centralized: Applications send logs directly to Logstash:

App → Logstash (over network)

Agent-Based: Applications write to local files, an agent ships them:

App → Local File → Filebeat → Logstash

I went with agent-based for several reasons:

Reliability: If Logstash is down, logs aren’t lost (buffered on disk)
Performance: Applications don’t block on network I/O
Backpressure handling: Agents can buffer when Logstash is overwhelmed

We use Filebeat (lightweight log shipper) on every host:

# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/app/*.log
  fields:
    service: order-service
    environment: production
    region: us-east-1

output.logstash:
  hosts: ["logstash-1:5044", "logstash-2:5044", "logstash-3:5044"]
  loadbalance: true

Filebeat watches log files, reads new entries, and forwards to Logstash with load balancing and retry logic.

Logstash Pipeline

Logstash processes logs through three stages: input, filter, output.

Here’s our pipeline:

# logstash.conf
input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/logstash/cert.pem"
    ssl_key => "/etc/logstash/key.pem"
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }

  # Add timestamp
  date {
    match => ["timestamp", "UNIX"]
    target => "@timestamp"
  }

  # Classify log level
  if [level] == "ERROR" or [level] == "FATAL" {
    mutate {
      add_tag => ["alert"]
    }
  }

  # Extract user ID from security logs
  if [event] == "authentication" {
    grok {
      match => { "message" => "user=%{WORD:user_id}" }
    }
  }

  # Enrich with GeoIP data
  if [source_ip] {
    geoip {
      source => "source_ip"
    }
  }
}

output {
  elasticsearch {
    hosts => ["es-1:9200", "es-2:9200", "es-3:9200"]
    index => "logs-%{[fields][service]}-%{+YYYY.MM.dd}"
    user => "logstash"
    password => "${LOGSTASH_PASSWORD}"
  }

  # Alert on critical errors
  if "alert" in [tags] {
    http {
      url => "https://alerting-service/alert"
      http_method => "post"
      format => "json"
    }
  }
}

The filter stage is where the magic happens. We parse, enrich, and classify logs before indexing.

Elasticsearch Cluster

Elasticsearch is the heart of the system. We run a cluster of 6 nodes:

3 master-eligible nodes (cluster coordination)
6 data nodes (store and query data)

Configuration considerations:

Index Design: We create daily indices per service:

logs-order-service-2015.07.30
logs-auth-service-2015.07.30
logs-encryption-service-2015.07.30

This allows us to:

Drop old indices easily (retention policy)
Optimize query performance (query specific services)
Manage storage per service

Sharding Strategy: Each index is split into shards. We use 5 shards per index, with 1 replica:

{
  "settings": {
    "number_of_shards": 5,
    "number_of_replicas": 1
  }
}

This balances query performance (more shards = more parallelism) with overhead (too many shards = too much overhead).

Retention Policy: We keep logs for different periods based on importance:

Security logs: 2 years
Error logs: 6 months
Info logs: 30 days

We use curator to automate deletion:

# curator.yml
actions:
  1:
    action: delete_indices
    filters:
      - filtertype: pattern
        kind: prefix
        value: logs-
      - filtertype: age
        source: name
        direction: older
        timestring: '%Y.%m.%d'
        unit: days
        unit_count: 30

Structured Logging

The key to effective logging with ELK is structured logging. Don’t log strings:

// Bad
log.Println("User john logged in from 192.168.1.1")

Log structured data:

// Good
log.WithFields(log.Fields{
    "event": "user_login",
    "user_id": "john",
    "source_ip": "192.168.1.1",
    "timestamp": time.Now().Unix(),
    "success": true,
}).Info("User login")

This produces JSON that Logstash can parse easily:

{
  "event": "user_login",
  "user_id": "john",
  "source_ip": "192.168.1.1",
  "timestamp": 1438300800,
  "success": true,
  "level": "info"
}

Now we can query: “Show all failed login attempts for user john in the last hour.”

In unstructured logs, this requires complex regex. In structured logs, it’s a simple query.

Effective Queries

Kibana provides a powerful query language. Some patterns I use frequently:

Finding Errors

level:ERROR AND service:order-service

Security Investigation

event:authentication AND success:false AND source_ip:"192.168.1.1"

Performance Analysis

event:api_request AND response_time:>1000

Distributed Tracing

Using trace IDs to correlate logs across services:

trace_id:"abc123"

This shows all log entries related to a single request, across all services it touched.

Dashboards and Visualization

Kibana’s visualization capabilities are powerful. We’ve built several dashboards:

Security Dashboard

Failed authentication attempts (by user, by IP)
Unusual access patterns
Key operations (creation, rotation, deletion)
API access by service

Performance Dashboard

Request rate per service
P50, P95, P99 latency
Error rate
Resource utilization

Business Metrics

Orders created per minute
User signups
API calls by endpoint

These dashboards provide real-time visibility into system health and security posture.

Alerting

Kibana doesn’t have native alerting (yet), so we built custom alerting:

Watcher Approach

We use Elasticsearch Watcher to query for conditions and trigger alerts:

{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["logs-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                {"match": {"event": "authentication"}},
                {"match": {"success": false}}
              ],
              "filter": {
                "range": {
                  "@timestamp": {
                    "gte": "now-5m"
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 10
      }
    }
  },
  "actions": {
    "notify_security": {
      "webhook": {
        "method": "POST",
        "url": "https://alerting-service/alert",
        "body": "More than 10 failed auth attempts in last 5 minutes"
      }
    }
  }
}

Every minute, this searches for failed auth attempts. If more than 10 in the last 5 minutes, it sends an alert.

Custom Alerting Service

For more complex alerting logic, we built a custom service that queries Elasticsearch and evaluates conditions:

func checkAnomalies() {
    // Query Elasticsearch
    query := elastic.NewBoolQuery().
        Must(elastic.NewTermQuery("event", "key_access")).
        Filter(elastic.NewRangeQuery("@timestamp").Gte("now-1h"))

    result, err := esClient.Search().
        Index("logs-*").
        Query(query).
        Aggregation("by_user", elastic.NewTermsAggregation().Field("user_id")).
        Do(context.Background())

    // Analyze results
    for _, bucket := range result.Aggregations.Terms("by_user").Buckets {
        if bucket.DocCount > threshold {
            sendAlert(fmt.Sprintf(
                "User %s accessed %d keys in last hour (threshold: %d)",
                bucket.Key, bucket.DocCount, threshold,
            ))
        }
    }
}

This detects anomalies like a user accessing an unusual number of keys.

Performance Optimization

Running ELK at scale requires optimization:

Indexing Performance

Bulk Indexing: Logstash batches logs before sending to Elasticsearch:

output {
  elasticsearch {
    hosts => ["es-1:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    flush_size => 500
    idle_flush_time => 5
  }
}

This batches up to 500 logs or waits 5 seconds before sending.

Replica Delay: For daily indices, we create with 0 replicas initially, then add replicas after a few hours:

# Create index with no replicas
curl -XPUT 'localhost:9200/logs-2015.07.30' -d '{
  "settings": {
    "number_of_replicas": 0
  }
}'

# Later, add replicas
curl -XPUT 'localhost:9200/logs-2015.07.30/_settings' -d '{
  "number_of_replicas": 1
}'

This improves indexing speed (no replication during peak load).

Query Performance

Index Patterns: Query specific indices instead of wildcards:

# Slow
logs-*

# Fast
logs-order-service-2015.07.30

Field Filtering: Only retrieve fields you need:

{
  "_source": ["user_id", "timestamp", "event"]
}

Caching: Elasticsearch caches query results. Structure queries to benefit from caching.

Storage Optimization

Compression: Enable compression for older indices:

curl -XPOST 'localhost:9200/logs-2015.06.*/_forcemerge?max_num_segments=1'

Hot/Warm Architecture: Recent logs (hot) on fast SSD. Old logs (warm) on cheaper HDD.

Security Considerations

Encrypted Communication

All communication is encrypted:

Filebeat → Logstash: TLS
Logstash → Elasticsearch: HTTPS
Kibana → Elasticsearch: HTTPS
User → Kibana: HTTPS

Authentication and Authorization

Elasticsearch has authentication enabled:

# elasticsearch.yml
xpack.security.enabled: true

Users have role-based access:

Developers: Read access to application logs
Security Team: Read access to all logs
Admins: Full access

Audit Logging

We log all access to Elasticsearch itself:

# elasticsearch.yml
xpack.security.audit.enabled: true

This creates an audit trail of who queried what logs.

Sensitive Data

Be careful what you log. We:

Never log passwords, tokens, or encryption keys
Hash or redact PII (personally identifiable information)
Use field-level security to restrict access to sensitive fields

Example redaction in Logstash:

filter {
  # Redact credit card numbers
  mutate {
    gsub => [
      "message", "\d{4}-\d{4}-\d{4}-\d{4}", "XXXX-XXXX-XXXX-XXXX"
    ]
  }
}

Operational Lessons

Capacity Planning

Elasticsearch storage grows fast. We ingest ~500GB of logs per day. Plan for:

1.5-2x raw log size (accounting for indexing overhead)
Retention period (30 days = 15TB for us)
Replica factor (1 replica = 2x storage)

Total: ~30TB for 30-day retention.

Monitoring the Monitoring System

Don’t forget to monitor ELK itself:

Elasticsearch cluster health
Indexing rate and lag
Query performance
Disk usage and growth rate
Logstash pipeline throughput

We use Prometheus and Grafana to monitor our ELK cluster.

Disaster Recovery

Elasticsearch stores critical security logs. We:

Replicate across availability zones: Replica shards in different AZs
Snapshot to S3: Daily snapshots of all indices
Test restoration: Regular DR drills

# Create snapshot repository
curl -XPUT 'localhost:9200/_snapshot/s3_backup' -d '{
  "type": "s3",
  "settings": {
    "bucket": "elasticsearch-backups",
    "region": "us-east-1"
  }
}'

# Create snapshot
curl -XPUT 'localhost:9200/_snapshot/s3_backup/snapshot_1'

Future Improvements

Areas we’re working on:

Machine learning: Anomaly detection using Elasticsearch ML
Better alerting: More sophisticated alerting rules
Correlation: Better tools for correlating events across services
Performance: Optimizing for even faster queries
Cost optimization: Reducing storage costs with better compression and retention policies

Lessons Learned

After six months with ELK:

What Works:

Structured logging is essential
Daily indices make management easier
Agent-based shipping is more reliable than direct logging
Dashboards provide great visibility
Integration with alerting catches issues quickly

What’s Hard:

Scaling Elasticsearch requires expertise
Storage costs grow quickly
Query optimization is non-trivial
Keeping up with Elasticsearch changes (rapid development)
Ensuring security of sensitive logs

Conclusion

The ELK stack has transformed our ability to understand and debug our distributed systems. Centralized logging with powerful search and visualization is essential for operating microservices at scale.

The investment in setting up and maintaining ELK is significant, but the payoff in operational visibility and security investigation capability makes it worthwhile.

If you’re running distributed systems, invest in observability. ELK is one good option. The alternative—distributed logs across dozens of services—is untenable.

Start simple: collect logs centrally. Add structure. Build dashboards. Iterate.

Your future self (and your on-call engineers) will thank you.

Getting Started

If you’re new to ELK:

Start small: Single Elasticsearch node, single Logstash instance
Use structured logging: JSON format
Create one useful dashboard: Build from there
Plan for growth: Elasticsearch scales, but you need to understand how
Secure it: Authentication, encryption, audit logging

The journey to observability starts with collecting logs. ELK is a proven way to do it.

In future posts, I’ll explore advanced topics: Elasticsearch performance tuning, machine learning for anomaly detection, and integrating distributed tracing.

Happy logging!