Have you ever launched a new service to production? Have you ever been maintaining a production service? If you answer “yes” to one of these questions, have you been guided during the process? What's good or bad to do in production? And how do you transfer knowledge when new team members want to release production services or take the ownership of existing services?

Most companies end up having organically grown approaches when it comes to production practices. Each team would figure out their tools and best practices themselves with trial-error. This reality often has a real tax not only on the success of the projects but also on engineers.

Trial-error culture creates an environment where finger pointing and blaming is more common. Once these behaviors are common, it becomes harder to learn from mistakes or not to repeat them again.

Successful organizations:

  • acknowledge the need of production guidelines
  • spend time on researching practices that apply to them
  • start having production readiness discussions when designing new systems or components
  • enforce production readiness practices

Production readiness involve a “review” process. Reviews can be a checklist or a questionnaire. Reviews can be done manually, automatically or both. Organizations can produce checklist templates rather than a static list of requirements that can be customized based on the needs. By doing so, it is possible to give engineers a way to inherit knowledge but also enough flexibility when it is required.

When to review a service for production readiness?

Production readiness reviews is not only useful right before pushing to production, they can be a protocol when handing off operational responsibilities to a different team or to a new hire. Use reviews when:

  • Launching a new production service.
  • Handing off the operations of an existing production service to another team such as SRE.
  • Handing off the operations of an existing production service to new individuals.
  • Preparing oncall support.

Production readiness checklists

A while ago, I published an example checklist for production readiness as an example of what they can cover. Even though the list came to existence when working with Google Cloud customers, it is useful and applicable outside of Google Cloud.

Design and Development

  • Have reproducible builds, your build shouldn’t require access to external services and shouldn’t be affected by an outage of an external system.
  • Define and set SLOs for your service at design time.
  • Document the availability expectations of external services you depend on.
  • Avoid single points of failures by not depending on single global resource. Have the resource replicated or have a proper fallback (e.g. hardcoded value) when resource is not available.

Configuration Management

  • Static, small and non-secret configuration can be command-line flags. Use a configuration delivery service for everything else.
  • Dynamic configuration should have a reasonable fallback in the case of unavailability of the configuration system.
  • Development environment configuration shouldn’t inherit from production configuration. This may lead access to production services from development and can cause privacy issues and data leaks.
  • Document what can be configured dynamically and explain the fallback behavior if configuration delivery system is not available.

Release Management

  • Document all details about your release process. Document how releases affect SLOs (e.g. temporary higher latency due to cache misses).
  • Document your canary release process.
  • Have a canary analysis plan and setup mechanisms to automatically revert canaries if possible.
  • Ensure rollbacks can use the same process that rollouts use.

Observability

  • Ensure the collection of metrics that are required by your SLOs are collected and exported from your binaries.
  • Make sure client- and server-side of the observability data can be differentiated. This is important to debug issues in production.
  • Tune alerts to reduce toil, for example remove alerts triggered by the routine events.
  • Include underlying platform metrics in your dashboards. Setup alerting for your external service dependencies.
  • Always propagate the incoming trace context. Even if you are not participating in the trace, this will allow lower-level services to debug debug production issues.

Security and Protection

  • Make sure all external requests are encrypted.
  • Make sure your production projects have proper IAM configuration.
  • Use networks within projects to isolate groups of VM instances.
  • Use VPN to securely connect remote networks.
  • Document and monitor user data access. Ensure that all user data access is logged and audited.
  • Ensure debugging endpoints are limited by ACL.
  • Sanitize user input. Have payload size restrictions for user input.
  • Ensure your service can block incoming traffic selectively per user. This allows to block the abuse cases without impacting other users.
  • Avoid external endpoints that triggers a large number of internal fan-outs.

Capacity planning

  • Document how your service scales. Examples: number of users, size of incoming payload, number of incoming messages.
  • Document resource requirements for your service. Examples: number of dedicated VM instances, number of Spanner instances, specialized hardware such as GPUs or TPUs.
  • Document resource constraints: resource type, region, etc.
  • Document quota restrictions to create new resources. For example, document the rate limit of GCE API if you are creating new instances via the API.
  • Consider having load tests for performance regressions where possible.