Notes on “Lessons Learned from Securing Google and Google Cloud” talk by Neils Provos
- Defense in Depth at scale by default
- Protect identities by default
- Protect data across full lifecycle by default
- Protect resources by default
- Trust through transparency
- Automate best practices and prevent common mistakes at scale
- Share innovation to raise the bar, support and invest in the security community.
- Address common cases programmatically
- Empower customers to fulfill their security responsibilities
- Trust and security can be the accelerant
Defense in Depth at scale by default
Homogenous infrastructure. Defenses within and spanning between layers of the stack.
Continue to externalize internal technologies, capabilities, and intelligence so that cloud platform customers can use it themselves up the stack.
Hardware infrastructure: physical access
- Multi-layered, biometric identification, metal detection, cameras, video analytics, vehicle barriers, and laser based intrusion detection. Physical access to DC’s is limited to very few employees.
Hardware infrastructure: e.g. Titan
- Server boards and networking equipment are custom designed and manufactured down to the chip level.
- Have custom built a TPM like chip that cryptographically authenticates hardware placed on the network, trust root for boot process, and gates firmware updates to the device to only authentic signed binaries. Used on servers and peripherals.
- Allows huge performance improvements: Juipeter network composed of custom hardware can deliver 1.3 Pbps of bisectional bandwidth. 1k servers transferring 10G to each other at once. This is relevant to performance, but also things like mitigating DDoS.
Hardware infrastructure: secure boot
- Titan verifies bios, bootloaders, and kernel of server and peripherals.
- Each server gets unique machine id tied to hardware root and software with which machine was booted. Machine id is then used to auth service calls to low level management servers. If a machine was compromised, after reboot, the machine will only come back with an id if all stages of the boot process complete and are verified. A successful reboot thus means that the machine is clean. (Similar to chromeos.)
- Automated systems ensure machines only run up to date versions of their software stacks and detect and diagnose hardware and software problems, and to remove machines from service if necessary.
Everything is a service, and no service is trusted by default
Service Identity and Isolation
- All IPC is mutually authenticated and authorized
- Do not rely on network layer segmentation (still use ingress/egress filtering to prevent spoofing etc, as an additional security layer.)
- Each service has it’s own crypto credentials
- Services running on same machine run in separate processes, use language based sandboxing, and also virtual machine based separation to keep services from interacting outside of valid RPC.
- User supplied code gets even more layers of isolation.
- Cluster orchestration service and other sensitive services run on dedicated machines.
Interservice Access Management
- Allow a service owner to precisely specific who can communicate with it.
- Service owner whitelists valid service identities, and this access restriction is automatically enforced by the infrastructure.
- Google engineers receive the same strong identity, so the services allow or deny their accesses the same way.
- These identities are held in a global namespace which are the only valid source of identities. This is separate from end user accounts.
- Includes rich identity workflows, including assigning separate identities to access control groups which may, for example, require two-party control: one engineer proposes change, but another must approve. This scales to enormous services. Again, this acl/group database is centralized and consistent.
Encryption of Inter-Service Communications
- Cryptographic privacy and identity for all communications.
- To provide this for other protocols, such as http, requests are encapsulated inside an RPC as it travels from frontend to backend.
- Provides application layer isolation, and removes any dependancy on the security of the network path. Remains secure even if the network is tapped.
- RPCs are additionally encrypted over intra-datacenter WAN links automatically and without additional configuration, by the infrastructure.
Access Management of End User Data
- A typical service is written to do something for an end user, for example, an end user may store their email on Gmail.
- To provide a complete experience for the end user, the Gmail service may call an API provided by the Contacts service, to access the end user’s address book.
- The Contacts service has been configured to allow the Gmail service to make such a call, but this is still a very broad set of permissions, as it would allow the Gmail service to retrieve the contacts of any user at any time.
- As the gmail service makes the RPC on behalf of the end user, the infrastructure allows the gmail service to present an end user permission ticket on behalf of the user, as part of the RPC.
- The ticket serves as proof that the Gmail service is acting on behalf of the particular end-user, and allows the contacts service to require such a ticket with each RPC, and to return data only for the end user named in this ticket. (Sounds like Macaroons)
- This capability is integrated into the central user identity service, which requires a credential such as a cookie or oauth token, and returns a short lived end-user permission ticket for subsequent RPCs related to the user’s request. In this case, the Gmail service requested the ticket and passed it down to the Contacts service.
- In any cascading call, the ticket can be passed down by the calling service to the callee as part of the RPC. (See Macaroon paper for how chained-HMAC prevents escalation of privilege when passing the ticket.)
- All data encrypted at rest by the infrastructure.
- Storage services abstract away the physical storage.
- Services then integrate with central KMS to use keys from central KMS to encrypt data before it is written to physical storage.
- Central KMS also integrates with end-user permission tickets above, so that keys can be tied to particular end-users.
- Application layer encryption insulates data from threats at lower levels, such as malicious disk firmware.
- Hardware encryption is also enabled on hard drives and SSDs.
- Drives are tracked through lifecycle, and if it cannot be erased by a multi-step secure wipe procedure, it is physically destroyed and shredded on premise.
- Cloud customers can use the KMS, and even provide their own encryption keys at time of use through an api, and they will be stored only in RAM during use. Customers are then responsible for keeping keys, since Google will not have copies of keys to retrieve data. These customer provided keys are used as the KEKs, which in turn protect data encryption keys which encrypt each block of data as it is stored.
- Cloud KMS customer customized but stored with google.
- API for Data Loss Prevention can classify/redact sensitive data on Gmail, Drive, or in customer application.
Internet communications – Google Front End (GFE)
- Infrastructure completely on private IP space.
- All traffic from internet travels through GFE
- Super cool: GFE is just a service that runs on the infrastructure, not a physically parallel chained set of proxies. This is how it can scale up and down as needed.
- Services need to register itself with the GFE. GFE then performs TLS termination for the outside world. It’s a homogenous smart reverse proxy.
- Also provides DDoS protection.
- Google Cloud Load Balancer is GFE.
- Due to scale and wholly owned backbone, can simply absorb most DoS attacks.
- When backbone delivers an external connection to the datacenter, this connection goes through several layers of hardware and software load balancing. Report statistics about incoming connections to central DoS service. When it detects a DoS attack, it can configure LBs to drop or throttle traffic associated with the attack. (First thing Niels Provos worked on at Google).
- The next layer, the GFE instances, also report information about requests they receive to the central DoS service. This includes further application layer info not available to the LBs. The central DoS service can then again configure the GFE’s and upstream LBs to drop or throttle traffic.
User identities – authentication
- Central identity service
- Provides login page with username/password
- Can challenge with additional info based on history of logins or location
- Phishing still biggest problem, so invented hardware second factor strong auth that forecloses possibility of phishing.
- Can mandate security key use.
- Malware compromise of endpoint also big problem.
- Choose client platform inherently designed for security. Malware compromise of chromeos very difficult.
- Policy such as: admin access must come from chromebook with security key brings successful admin attack probability ~0
- All google employees are required to use a security key for strong secondary auth.
- Strong investment in monitoring client devices.
- OS uses up to date signed image
- Controls what is installed, including applications, extensions, and services.
- Being on primary lan is not means for granting access.
- Enforce and grant access at application level
- Can then allow only the employee who is strongly authenticated, coming from a healthy corporate device, on expected networks in expected geographic locations to sensitive data.
- One policy per application, not reliant on network by network or instance by instance ACLs.
- Identity-Aware Proxy externalizes this service. IAP will get more options like chromebook-only in the future.
- Strong investment in monitoring client devices.