Skip to main content

This is the third article in my seven-part series on the Google Cloud Architecture Framework.

Article 1Google Cloud Architecture Framework Overview
Article 2Google Cloud Architecture Framework : System design

This category of the Google Cloud Architecture Framework explains how to effectively manage services on Google Cloud. It covers how to run, manage, and monitor systems that provide value to the business. It also covers services and solutions offered by Google Cloud that help with operational excellence. You may lay the groundwork for reliability by applying the concepts of operational excellence. It accomplishes this by establishing fundamental components like automation, scalability, and observability. The framework is meant to assist you in creating a Google Cloud deployment that best suits your company’s requirements.

Automate your deployments

By removing human error from repetitive tasks like code updates, automation helps in standardising your builds, tests, and deployments. You may increase the safety of your deployments by using a standardised machine-controlled approach. Additionally, it offers a way to roll back earlier deployments as necessary without materially impairing user experience. This section provides best practices for automating your builds, tests, and deployments.

These are some guidelines and best practises for helping you build and run a fully automated cloud architecture:

1. Store your code in central code repositories for versioning and labelling.
2. Use continuous integration and continuous deployment (CI/CD) to support agile workflows.
3. Provision and manage your infrastructure using infrastructure as code with help of tools like Terraform.
4. Incorporate testing throughout the software delivery lifecycle. Perform Unit tests, Integration tests, System tests & load tests.
5. Use separate Google Cloud projects for each test environment you have.
6. Launch deployments gradually (canary testing whenever feasible).
7. Ability to restore previous releases seamlessly is very important.
8. Monitor your CI/CD pipelines through different services like Audit logs, Monitoring Dashboards, build logs.
9. Establish management guidelines for version releases to avoid making mistakes and to enable high-velocity software delivery.

Set up monitoring, alerting, and logging

Monitoring is the process of gathering, examining, and using data to track infrastructure and applications in order to inform business choices. Because it provides you with insight into your work and systems, monitoring is a key capability.

Four golden signals to monitor the system:

  1. Latency : The time it takes to service a request.
  2. Traffic : How much demand is being placed on your system.
  3. Errors : The rate of requests that fail. Failure can be explicit or implicit.
  4. Saturation : Saturation is a measure of your system fraction, emphasising the resources that are most constrained.

These are some guidelines and best practises for helping set up monitoring, alerting, and logging:

1. Monitoring plan should include all your systems, including on-premises resources and cloud resources.
2. Monitoring plan should include monitoring of your cloud costs to help make sure that scaling events doesn’t cause usage to cross your budget thresholds.
3. Build different monitoring strategies for measuring infrastructure performance, user experience, and business KPIs.
4. Define metrics that measure all aspects of your organisation. To do so its very important to know specific business objectives, KPIs & SLIs.
5. Consider Infrastructure metrics, Application metrics, Managed services statistics metrics, Network connectivity statistics metrics & SLIs as per the business objectives.
6. Make alerts actionable to minimise the time to resolution. To do so an alert should have clear description, provide all the information, Define priority levels and clearly identify the person or team responsible for responding to the alert.
7. Enable logging for critical applications.
8. Build monitoring and alerting dashboards to visualise Short-term and real-time analysis & Long-term analysis.
9. Set up an audit trail for Admin Activity logs & Data Access audit logs.

Establish cloud support and escalation processes

This section shows you how to define an effective escalation process. Establishing support from your cloud provider or other third-party service providers is a key part of effective escalation management. Google Cloud provides you with various support channels, including live support or through published guidance such as developer communities or product documentation. An offering from Cloud Customer Care ensures you can work with Google Cloud to run your workloads efficiently.

Determining and resolving issues with your systems requires less time and effort when you have a well defined escalation process in place. This includes problems that call for assistance with Google Cloud products, other cloud providers, or third-party services.

These are some guidelines and best practises for establishing cloud support and escalation processes:

1. Define when and how to escalate issues internally.
2. Define when and how to create support cases with your cloud provider or other third-party service provider.
3. Find or create documents that describe your architecture. Ensure these documents include information that is helpful for support engineers.
4. Define how your teams communicate during an outage.
5. Ensure that people who need support have appropriate levels of support permissions to access the to communicate with other support providers.
6. Set up monitoring, alerting, and logging so that you have the information needed to act on when issues arise.
7. Create templates for incident reporting. For information to include in your incident reports, see Best practices for working with Customer Care.
8. Document your organisation’s escalation process. Ensure that you have clear, well-defined actions to address escalations.
9. Test your escalation process before major events, such as migrations, new product launches, and peak traffic events.

Manage capacity and quota

When you utilise Google Cloud, you give Google complete control over capacity planning, in contrast to traditional data centres. When you use the cloud, you can avoid provisioning and maintaining unused resources. You can, for instance, start, stop, and scale virtual machine instances as needed. You can maximise your spending by utilising excess capacity that you only use during periods of high traffic because you only pay for what you use. Compute Engine offers machine type recommendations to help you save money if it finds underutilised virtual machine instances that can be downsized or removed.

These are some guidelines and best practises for managing capacity and quota:

1. To evaluate your capacity requirements, start by identifying your top cloud workloads. Evaluate the average and peak utilisation of these workloads, and their current and future capacity needs.
2. Analyse load pattern and call distribution. Use factors like last 30 days peak, hourly peak, and peak per minute in your analysis.
3. Use Cloud Monitoring to get visibility into the performance, up-time, and overall health of your applications and infrastructure.
4. Ensure your organisation has set up alerts to automatically notify of when you get close to quota and capacity limitations.
5. Create a process for capacity planning. It should involve running load tests, forecast future traffic and account for growth, estimate the cost of resources your organisation needs and work with your cloud provider to get the correct amount of resources at the correct time with quotas and reservations.
6. Plan the capacity requirements of your projects in advance to prevent unexpected limiting of your resource consumption.
7. Use quotas to cap the consumption of a particular resource (like BigQuery API for example to avoid overspending).
8. Plan for spikes in usage and include these spikes as part of your quota planning.

Plan for peak traffic and launch events

Peak events are major business-related events that cause a sharp increase in traffic beyond the application’s standard baseline. These peak events require planned scaling. Launch events are any substantial roll outs or migrations of new capability in production. For example, a migration from on-premises to the cloud, or a launch of a new product service or feature.

Peak and launch events includes three stages:

A. Planning and preparation for the launch or peak traffic event.
B. Launching the event.
C. Reviewing event performance and post event analysis.

These are some guidelines and best practises to help plan for peak traffic and launch events:

1. Create a general playbook for launch and peak events.
2. Create business projections for upcoming launches and for expected and unexpected peak events.
3. Document any assumptions, risks, and unknown factors.
4. Create a rollback plan for launch and migration events.
5. Create a diagram that shows how the major components of the architecture are connected. A review of the diagram might help you isolate issues and expedite their resolution.
6. Identify business and system metrics to monitor for the event. If any metrics or service level indicators (SLIs) aren’t being collected, modify the system to collect this data.
7. Define a communication plan, timeline, and responsibilities for all teams.

Thank you for reading this article. Your time is appreciated.
Until next time, stay curious !!

Organisations using cloud services must prioritise operational excellence in cloud infrastructures. It describes the capacity to manage and oversee systems in order to provide value to the organisation, continuously enhance procedures, and promote efficiency. For businesses looking to get the most out of their cloud investments, operational excellence in cloud architectures is essential. It includes features like cost-effectiveness, dependability, security, adaptability, and ongoing development, all of which are essential to the success of cloud-based operations as a whole.

This was the third article in my seven-part series on the Google Cloud Architecture Framework. We will go into more detail about the remaining 4 pillars in my upcoming articles, along with an understanding of some best practices for creating and managing a well-architected framework on GCP.

Leave a Reply