Hey everyone, and welcome to this first post on a topic that we will be talking a lot more about over time!
Microsoft 365 is one of the world’s largest enterprise and consumer cloud services, and customer trust is the foundation of our business: customers and people all around the world rely on us to securely operate and maintain some of their most critical assets. To maintain that trust, we invest heavily in securing the infrastructure that powers our services and hosts this data on behalf of our customers – keeping customer data private and secure is THE top priority for our business. This post, and the other ones we’ll share in this series, will shed light on what we do behind the scenes to secure the infrastructure powering the Microsoft 365 service.
As we think about how to secure our infrastructure, we recognize that the service continues to grow and evolve, both in terms of our user base and in terms of the products and experiences we provide to our customers, and so we must constantly work to stay on top of an ever-increasing surface area. Meanwhile, bad actors are not sitting still, either. Attacker groups seeking to exploit enterprise and consumer data continue to evolve, and customers looking to secure their most sensitive data are going up against the most sophisticated and well-funded adversarial organizations in the world, including nation state attackers with seemingly limitless resources.
To secure the service for our customers given these challenges, we focus on these three areas:
- Building tools and architecture that protect the service from compromise
- Building the capability to detect and respond to threats if a successful attack does occur
- Continuous assessment and validation of the security posture of the service
In the rest of this post we will briefly explore each of these areas, or if you’d like to go deep, you can check out the full whitepaper here.
Designing for Security
Before getting into each of these areas, we wanted to touch on some of the major principles that guide our approach to service security. Here are some of the concepts that form the foundation of what we do to secure service infrastructure:
- Data Privacy: We strongly believe customers own their data, and that we are just custodians of the service that hosts their data. Our service is architected to enable our engineers to operate it without ever touching customer data unless and until specifically requested by the customer.
- Assume Breach: Every entity in the service, whether it is personnel administering the service or the service infrastructure itself, is treated as though compromise is a real possibility. Policies governing access to the service are designed with this principle in mind, as is our approach to defense in depth with continuous monitoring and validation.
- Least Privilege: as above, access to a resource is granted only as needed and with the minimal permissions necessary to perform the task that is needed.
- Breach Boundaries: The service is designed with breach boundaries, meaning that identities and infrastructure in one boundary are isolated from resources in other boundaries. Compromise of one boundary should not lead to compromise of others.
- Service Fabric Integrated Security: Security priorities and requirements are built into the design of new features and capabilities, ensuring that our strong security posture scales with the service. At the scale and complexity of Microsoft 365, security is not something that can be bolted on to the service at the end.
- Automated and Automatic: We focus on developing durable products and architectures that can intelligently and automatically enforce service security while giving our engineers the power to safely manage response to security threats at scale. Again, the scale of Microsoft 365 is a key consideration here as our security solutions must handle millions of machines and thousands of internal operators.
- Adaptive Security: Our security capabilities adapt to and are enhanced by continuous evaluation of the threats facing the service. In some cases, our systems adapt automatically through machine learning models that categorize normal behavior (as opposed to attacker behavior which would represent a deviation from the norm). In other cases, we regularly assess service security posture through penetration testing and automated assessment, feeding the results of that back into product development.
The next sections will look into how we put these principles into practice to protect the service, mitigate risk if compromise does occur, and validate our security posture to make sure all of this works.
Minimizing the Risk of Compromise
Our favorite attack is the one that never gets started because we prevented it from happening in the first place. Broadly speaking, protecting the service from attack focuses on two vectors: people (making sure that the Microsoft employees who build and manage the service cannot compromise or damage it), and the technical infrastructure of the service itself (making sure that the machinery running the service has integrated defenses and is architected and configured in a most-secure default configuration).
When it comes to securing the infrastructure from internal operators, our motto here is Zero Standing Access (ZSA). This means that, by default, the teams and personnel charged with developing, maintaining, and repairing core Microsoft 365 services have no elevated access to the service infrastructure, and any elevated privileges must be authorized as shown in the flow below.
Illustration of the Lockbox JIT request process. No account has standing administrative rights in the service. Just in time (JIT) accounts are provisioned with just enough access (JEA) to perform the action that is needed
It is important to keep in mind that even with the approved elevated privileges, a specific restrictive account is provisioned just for that activity. This account is bound by time, scope and approved actions. Ultimately, this is all about making sure that the blast radius for a single account is minimized: even if an internal operator’s account is compromised, it is by design prevented from doing any damage unless additional steps are taken.
Our protections go beyond restricting the blast radius of accounts. Network controls restrict the types of connections that can be made into our services, we also restrict the types of connections permitted between service partitions. This reduces the surface area for attackers to target for initial entry, and it also makes it harder for attackers to move around the service to find what they’re looking for.
Mitigating Risk if the Worst Happens
The assume breach model goes beyond designing architectural protections and access control policies: it means that no matter how effective those protections are, we cannot trust that they will always hold. We must assume a non-zero probability of successful attack, no matter how confident we are in our defenses. We need to have the ability to detect and mitigate these attacks against the service infrastructure before they result in a compromise of customer data.
Our work in this space spans security monitoring and incident response:
- Security Monitoring: this is about building systems and processes to catch compromise to the infrastructure in real time and at scale, allowing us to respond to and stop attacks before they propagate throughout the service
- Incident Response: we need tools and processes to mitigate risk and evict attackers, also in real time and at scale, in response to the alerts raised by our monitoring systems
Incident response is cloud-powered and service-aware. It can be triggered autonomously for basic actions, or manually for more complex scenarios. Remediation can take effect on a small number of machines, or across a service partition if necessary
As the diagram illustrates, automation and scale are priorities for us in this area. For us to catch and stop attacks against a service the size of Microsoft 365, our systems need to be intelligent enough to proactively and accurately alert us to potential issues, and we need the ability to respond quickly and at scale. Anything less simply won’t do given the scale of the service.
Our assume breach principle is all about planning for the worst – given how seriously we take this philosophy, we would be remiss if we did not have a plan for mitigating potential gaps in our security posture. Indeed, we validate our security posture regularly, automatically, and through cloud-based tools (we hope that you notice a trend here).
We have two primary forms of validation:
- Architectural and configuration assessment: verifying that promises we make about our service architecture (for example, that specific networks are correctly segmented or that machines are up to date with required patches) hold and do not regress.
- Post-exploitation validation: simulating attacks directly against our infrastructure, with the goal of verifying that our monitoring and response systems work as expected in the production environment.
Both forms of validation run directly against the service infrastructure, and they do so continuously. If any regression in security posture does occur, we want to learn about it as quickly as possible so that we can repair it before it gets exploited by attackers.
Securing the infrastructure of one of the world’s largest cloud services requires us to stay ahead of attackers while also keeping up with constantly increasing service scale and complexity. Maintaining customer trust in Microsoft 365 requires us to design our services to a robust set of core security principles and to make sure those principles are embedded deeply into service design and operations.
We have written a whitepaper that looks deeper into what this means, and we will expand on this and other security topics critical to our business in future papers. We hope you find this interesting and informative and look forward to hearing any feedback.
@Adam Hall on behalf of the entire Datacenter Security team