As we close out 2015, I want to reflect on one of the most challenging aspects of our CipherTrust platform: policy management. It’s not enough to provide encryption and key management capabilities - enterprises need to enforce policies about how those capabilities are used. Who can access which keys? Which data must be encrypted? What key rotation schedules apply? Let me share what we’ve built and learned.

Why Policy Management Matters

In a small organization, you might have a few developers who all understand security requirements. As organizations scale, this doesn’t work. You have hundreds of developers, multiple teams, various applications, all handling sensitive data. You can’t rely on everyone making correct security decisions.

Policy management centralizes security decisions. Instead of each application implementing its own access controls, policies are defined centrally and enforced by the key management platform. This provides:

Consistency: The same policies apply across all applications and environments.

Auditability: Policies are documented and versioned. Auditors can review exactly what policies are in effect.

Separation of duties: Security teams define policies. Development teams build applications. Neither can bypass the other.

Compliance: Many regulations require documented policies. A policy management system provides evidence that policies exist and are enforced.

Policy Architecture

Our policy system has several components:

Policy Definition Language: A DSL for expressing policies in a human-readable format.

Policy Repository: Stores policy definitions with versioning and access controls.

Policy Evaluation Engine: Evaluates policies against incoming requests to determine if operations should be allowed.

Policy Administration UI: Allows security teams to create, edit, and test policies without writing code.

Audit Trail: Logs all policy evaluations, showing which policies were evaluated, what decisions were made, and why.

Policy Model

We use an attribute-based access control (ABAC) model. Policies evaluate attributes of:

Subject: Who is making the request? User identity, role, group membership, security clearance.

Resource: What resource is being accessed? Key identifier, key type, data classification, geographic location.

Action: What operation is being performed? Encrypt, decrypt, sign, verify, rotate, delete.

Environment: Context of the request. Time of day, source IP address, authentication method, whether the connection is from an approved network.

A policy might say: “Users in the finance role can decrypt payment_processing keys, but only from the corporate network, and only during business hours.”

Policy Expression Language

We needed a language for expressing policies that’s powerful enough for complex rules but simple enough for security teams to write without being programmers.

Our initial implementation used JSON-based policy documents similar to AWS IAM policies. They worked but were verbose and error-prone. Writing complex policies required deep understanding of the schema.

We’ve developed a higher-level policy language that compiles to the JSON format:

policy "payment_card_decryption" {
  description = "Control access to payment card encryption keys"

  resource {
    key_type = "payment_card_data"
  }

  subject {
    role in ["payment_processor", "fraud_detection"]
    mfa_authenticated = true
  }

  action = "decrypt"

  conditions {
    source_ip in corporate_networks
    time between business_hours
  }

  effect = "allow"
}

This is more readable and easier to write correctly. We validate policies at definition time, catching errors before they’re deployed.

Policy Evaluation

When a request comes in to decrypt a key, the policy engine evaluates all applicable policies:

  1. Identify applicable policies: Which policies apply to this key, action, and subject?

  2. Evaluate conditions: For each policy, evaluate whether conditions are met.

  3. Combine decisions: If multiple policies apply, combine their decisions. Our default is “deny unless explicitly allowed.”

  4. Return decision: Allow or deny, with the reason (which policies contributed to the decision).

Policy evaluation must be fast - it happens on every key operation. We’ve optimized the evaluation engine to make decisions in single-digit milliseconds.

Caching and Performance

Evaluating policies against external data sources (LDAP for group membership, GeoIP for location) can be slow. We implement several caching strategies:

Policy compilation: Policies are compiled into an optimized internal representation that’s faster to evaluate than the source language.

Attribute caching: Subject attributes (role, group membership) are cached with short TTLs. We can tolerate slight staleness in non-critical attributes.

Decision caching: For frequently made requests, we cache the policy decision itself. “User X can decrypt key Y” might be cached for 60 seconds.

Caching introduces a window where policy changes aren’t immediately effective. We balance this against performance requirements and accept short delays for policy propagation.

Dynamic Policies

Some policies need to reference dynamic data. For example: “Only encrypt data with keys that have been rotated within the last 90 days.”

This requires the policy engine to query the key management system for key rotation history. We’ve built a plugin architecture where policies can call out to data sources during evaluation.

Plugins must be carefully designed to avoid creating performance bottlenecks or security vulnerabilities. They run with limited permissions and strict timeouts.

Policy Testing

Policies can have bugs just like code. A typo might accidentally grant overly broad access or block legitimate operations. We provide tools for testing policies before deployment:

Policy simulation: Test how a policy would evaluate for specific requests without actually enforcing it.

Policy coverage: Identify which resources and subjects are covered by which policies, highlighting gaps.

Regression testing: Maintain test suites of expected policy decisions. When policies change, verify that expected cases still work correctly.

Shadow mode: Deploy new policies in shadow mode where they’re evaluated but not enforced. This reveals what would change before making it live.

Policy Versioning and Rollback

Policies change over time. We maintain version history for all policies and support rollback to previous versions.

When policies are updated, we transition through several states:

  1. Draft: Policy is being edited but not active.
  2. Testing: Policy is being evaluated in shadow mode.
  3. Active: Policy is enforced for all requests.
  4. Deprecated: Policy is marked for removal but still enforced.
  5. Archived: Policy is no longer enforced but retained for audit purposes.

This lifecycle allows careful validation before policies go live and maintains history for compliance.

Multi-Tenancy

Our platform serves multiple customers, each with their own policies. Policies must be strictly isolated - Customer A’s policies can’t reference or affect Customer B’s resources.

We implement tenant isolation at several levels:

Namespace isolation: Each tenant’s policies are in a separate namespace. Policy evaluation only considers policies in the requesting tenant’s namespace.

Resource scoping: Policies can only reference resources that belong to their tenant.

Audit separation: Each tenant’s policy audit logs are stored separately and can only be accessed by that tenant.

Compliance Integration

Policies are a key component of compliance. We generate compliance reports showing:

Policy coverage: What resources are protected by policies? Policy violations: What requests were denied and why? Policy changes: What policies changed and who approved the changes? Policy effectiveness: Are policies being evaluated? Are they denying inappropriate requests?

These reports map to specific compliance requirements. For PCI-DSS, we can show that access to cardholder data encryption keys is controlled by documented policies that are enforced by the system.

Conflict Resolution

When multiple policies apply to a request, they might conflict. One policy allows the operation, another denies it. How do we resolve conflicts?

Our approach:

  1. Explicit deny always wins: If any policy explicitly denies, the request is denied.
  2. Explicit allow overrides default deny: If a policy explicitly allows and no policy denies, the request is allowed.
  3. Default deny: If no policies explicitly allow or deny, deny by default.

This is conservative - we’d rather block legitimate requests than allow inappropriate access. In practice, explicit denies are rare. Most policies are allow policies, with default deny providing the baseline.

Delegation and Approval Workflows

Some operations require approval beyond policy evaluation. Deleting a master key might require approval from multiple security officers, even if policy would otherwise allow it.

We’ve built approval workflows on top of the policy system:

Approval policies: Special policies that require external approval before allowing operations.

Approval queue: Operations requiring approval go into a queue for review.

Multi-party approval: Critical operations require M-of-N approvals from designated approvers.

Time-limited approvals: Approvals expire after a period, preventing approved requests from being used indefinitely.

Policy as Code

We’re exploring “policy as code” where policies are defined in version control, reviewed through pull requests, and deployed through CI/CD pipelines.

This brings software engineering practices to policy management:

Code review: Policy changes are reviewed by multiple people before merging. Automated testing: Policy test suites run on every change. Continuous deployment: Approved policy changes deploy automatically. Audit trail: Git history provides a complete audit trail of policy changes.

This is still experimental but showing promise for teams that are already using infrastructure-as-code practices.

Challenges and Lessons Learned

Policy management is harder than it first appears:

Complexity: As policy count grows, understanding the effective policy for any given request becomes difficult. We’ve built policy analysis tools but it’s still challenging.

Performance: Policy evaluation can become a bottleneck. Aggressive caching helps but introduces consistency concerns.

Testing: It’s hard to test that policies correctly express intended security requirements. Simulation helps but can’t catch all logical errors.

Evolution: As requirements change, policies must evolve. Maintaining backward compatibility while changing policies is tricky.

User experience: Policy errors need to be surfaced clearly. “Access denied” isn’t helpful. “Access denied because you’re not in the payment_processor role” is better, but you must be careful not to leak sensitive information in error messages.

Looking Forward

We’re working on several improvements for 2016:

Policy analytics: Machine learning to detect anomalous requests that satisfy policies but are unusual patterns.

Policy recommendations: Analyzing access patterns to recommend policy changes (e.g., “This key hasn’t been accessed in 6 months - consider archiving it”).

Better policy language: More expressive policy language with functions, variables, and better composition.

Real-time policy updates: Currently policy changes can take up to 60 seconds to propagate due to caching. We’re working on real-time propagation for critical policy changes.

Key Takeaways

For teams building policy management systems:

  1. Use attribute-based access control for flexibility
  2. Make policies testable before deployment
  3. Build performance into the design - policy evaluation happens on hot paths
  4. Maintain comprehensive audit logs of policy evaluations
  5. Provide good error messages when policies deny requests
  6. Support policy versioning and rollback
  7. Design for multi-tenancy from the start if you’ll need it
  8. Balance policy expressiveness with simplicity - complex policies are hard to validate

Policy management transforms key management from a set of APIs into a comprehensive platform for data protection. It’s the difference between providing capabilities and ensuring those capabilities are used correctly. As we head into 2016, I’m excited to continue evolving our policy platform and helping customers protect their most sensitive data.

This has been an incredible year at Thales. Building enterprise security systems is challenging but rewarding. Every component we build - from HSM integration to policy management to monitoring - comes together to protect data for organizations around the world. Looking forward to what 2016 brings!