Google Cloud Platform in 2017

Almost two years ago, one of the HIPAA projects I am working on transitioned to Google Cloud Platform (GCP). The decision at the time was largely financially driven. We were using AWS, however our startup accelerator had a deal with Google which included a significant budget of credits for Google Cloud Platform. We also noted, that unlike at AWS, where you must pay for dedicated use instances in order to run any processes involving protected health information (PHI), GCP allows HIPAA covered services to run on the same instances, with the same pricing as other workloads. This amounts to a savings of at least $1,500 a month at current dedicated instance pricing.

There is a learning curve with all the major cloud players, and learning to do things the Google way is a bit of an adjustment. Initially we made a couple of missteps as we tried to transfer our services directly, but we have learned a fair amount in the past two years. I will cover some of the specific differences between my AWS deployments and GCP deployments in this article and discuss my assessments of the state of GCP today and the improvements over the past two years.

Permissions and Access Management

AWS IAM is a very fine-grained permissions management system. Each service offers a dizzying array of permissions, and those permissions can be given for a service in general or for a specific resource within that service. It can often be fairly complicated to determine the group of services required to perform some set of operations on AWS, but there are many samples and templates to help guide an administrator toward making good decisions in this area. Over time I have developed sets of IAM policies which I incorporate in my CloudFormation scripts in order to restrict access in a way which complies with HIPAA rules. In transitioning over to Google Cloud Platform, the lack of such fine grained permissions led to some initial growing pains.

When I first started using GCP, there were only four classes of permissions for each user: Owner, Editor, Viewer, and Browser. Basically anyone who needed to do work would have to take on the role of Owner or Editor. Luckily, whereas AWS organizes by root accounts, GCP organizes by project, so it is possible to divide an organization's cloud workload into several projects and provide users with access only to the projects they need to access. That unfortunately opened up another problem, which was that each project has its own network. Therefore a decision had to be made about common resources. Should they be packaged for deployment in each project or should they be served as a public service from a separate account? In many cases this depends on both how it is used and the auditing requirements around it. For example, single sign-on is best served as a single public service which allows for easy account locking and auditing, whereas an ontological database may be best deployed along with each project. Figuring out how to divide up your workloads into projects is a useful task to do ahead of your migration. The networking situation is also likely to be resolved soon as cross-project networking is currently in beta.

When logging into the web-based console there are a couple of interesting differences between AWS and GCP. First, they both offer two-factor authentication which I would say is a must for anyone serious about security. AWS only offers TOTP, but GCP also offers hardware security keys as a second factor, including the popular and inexpensive yubico FIDO U2F security key. Once logged into the GCP console it is possible to SSH to your instances with a public IP adress, or pull up the cloud shell to manage your project from a command line running in GCP and accessible through the browser. I will discuss cloud shell more later. The web login is also used as an intermediary in allowing google APIs and the command line tool to connect to your account. In contrast AWS uses a static API key and API secret to connect the command line tools and APIs, and relies on the user to rotate these credentials periodically. I think the GCP console gains an upper hand here in terms of utility as well as the ability to access it from just about any computer, including a chrome book.

On the server side, AWS offers roles which have just as much IAM specificity as user accounts. Google Cloud Platform offered scopes, which were very limited in number, though offered more flexibility than user permissions. Both use a similar metadata service to provide a server with rotating credentials. As time goes on, Google's identity and access management services are becoming much more sophisticated. Google has a more fine-grained set of user permissions, and has transitioned from scopes to service accounts for server side permissions. Service accounts have available all the permissions utilities of user accounts. Although still not as sophisticated as the AWS IAM roles and policies, GCP is quickly catching up.

System Architectures

On AWS I make heavy use of CloudFormation to maintain and deploy groups of related services. CloudFormation is a great tool on AWS for specifying the details of services down to very fine grained levels which are not easily accessible from either the web console or the command line tool. GCP has a similar tool called Cloud Deployment Manager which uses YAML to describe an infrastructure for deployment. Cloud Deployment Manager can also process Jinja and python scripts to allow for more powerful variable substitution and conditional processing. In practice, I have not made use of Cloud Deployment Manager. Google's command line tool, gcloud, is generally quickly updated as new features are implemented and has the ability to manage all the fine-grained features of the platform. There are also many more examples of using gcloud in the documentation than there are Cloud Deployment Manager samples. As a result I tend to build infrastructures using bash scripting and gcloud.

Google cloud shell, which I mentioned earlier, is a great tool for maintaining a project's infrastructure on GCP. Essentially it provides a browser-accessible bash shell running on an f1-micro instance with a set of standard tools, including gcloud, gsutil (Google's cloud storage tool), git, docker and others, along with a 5 Gigabyte persistent home directory. There is also a built-in web-based text editor, and the ability to test webservices from within the shell. There is also a cloud shell boost mode which will run your shell on a g1-small instances instead of an f1-micro. This can be useful to building docker containers which require more memory than provided by the standard cloud shell. I have gotten used to checking out my deployment scripts using git in cloud shell and managing my infrastructure from there. Although each user account only has a single home partition, regardless of the number of projects that user is a member of, each shell is started from within a project page. This means the gcloud command is automatically set up to use the correct project. I also like the fact that if I were traveling and my laptop were stolen, I could still log in from an available computer and manage my GCP projects easily. I also do not have to worry about updating the gcloud tool, as cloud shell tools are always up to date.

Network architecture on GCP will be very familiar for anyone coming from AWS. The equivalent to an AWS VPC is a GCP network, and networks are either legacy networks, which uses a single IP range across all regions, or subnet networks which divide private address space into separate subnets for each google region. AWS regions and GCP regions are similar and are each subdivided into zones. GCP has a somewhat simpler solution for firewalls. Rather than the combination of Network ACLs and Security Groups offered by AWS, GCP offers a set of firewall rules associated with each network. These rules can be applied to all instances in the network, or specifically to tagged instances. Each compute instance created within a network can be tagged with a number of tags signifying which firewall rules to adopt. Firewall rules can filter by source IP address range, protocol, and port. They also can be given priorities such that a lower priority number indicates a higher priority rules. Therefore if two rules conflict the one with higher priority will take precedence. Unlike AWS, GCP does not currently support IPv6 in its compute instance networking options. IPv6 adoption is one area where GCP appears to be lagging behind, even though IPv6 support has been present in App Engine for quite some time. AWS recently lept ahead in this space with innovations such as egress-only IPv6 internet gateways to create private network support. I am surprised Google is not more of a leader here.

Google Compute Engine

Google's answer to AWS EC2 is Google Compute Engine (GCE), not to be confused with Google Container Engine (GKE). GCE is quite similar to EC2 in that you can start a virtual server with a choice of operating systems and they run within Google's network. There are a number of differences though, which can be somewhat surprising when you delve deeper into the service.

First, my understanding is that AWS EC2 is built on a highly customized Xen-based hypervisor. As such, EC2 is a fairly standard virtual private server system with a lot of surrounding networking and metadata support. Google's system is based on Borg, which is the internal Kubernetes predecessor they built to distribute containerized workloads on what was likely once highly stipped down debian linux machines, though now it might be Chromium-OS machines. These Borg containers were basically statically linked executables working in their own linux containers. By running a custom KVM executable, it's possible to run an operating system in one of these containers and get a virtual private server. Although this history is not terribly important to the end-user, it is interesting and allows for what I consider the killer feature of GCE. As an aside, it's interesting to consider what is happening when you run Kubernetes on GCE. Borg creates several virtual servers running in containers, and you use these servers to create another virtual cluster within what is already a real cluster to run more containers. Although this is clearly possible, and for reasons of portability may even be practical, I still think it's a bit odd. I do run containers within my GCE instances, but I tend to control their deployment using tools which do not create a new cluser of GCE instances. My solution in GCE, which involves the use of Google's Container-Optimized OS will likely be the subject for another article.

The first thing an AWS user might notice is that when you start your new instance, you do so without getting an SSH key. How will you connect to the new instance? Well, it turns out that google runs several agents on each GCE instance. These agents include an account agent which will create accounts for you and set the SSH keys for those accounts. If you use the gcloud compute ssh command, then SSH keys will be generated and deployed as needed to your instances. The other alternative is to store SSH Keys as part of your project metadata, which means that new accounts will be created on every instance you deploy with those keys, or to store the SSH keys as part of your instance metadata. Handling SSH keys is the currently accepted way to manage user accounts on an ECS instance. All users are granted sudo privileges by default. A private beta is currently underway for a different GCE user management strategy. This beta has been going on for a very long time, and I am not sure if it will ever become a public beta or be rolled out into general availability. My recommendation is to use instance-only ssh keys in order to limit access to appropriate resources. This can be especially useful for HIPAA auditing, where you want your access controls to be easily viewed from one location or command.

A paragraph or so back I mentioned a killer feature. This feature is live migration. If you've been on AWS for a while you have likely received notice that your host hardware would be brought down for maintenance at some point in the future. Usually you will have some time to begin the hopefully simple task of manually migrating your instance. In GCE you will not receive any such notice. GCE instances migrate automatically when needed, and it happens routinely. 99.99% of the time this migration is a wonder to behold and even active SSH connections wont be disturbed, however once in a blue moon something happens. For us it was a single process which stopped during the migration. The process which alerts us to issues in the pipeline. Although we have had one bad experience, overall I think this feature is tremendous. I have mitigated the issue we had previously by logging migrations and issuing an alert should the monitoring machine migrate so that I can verify everything is still running. We have never encountered the issue again.

Pricing between AWS and GCE is competitive, but GCE automatically starts discounting instances during the month depending on how much uptime you have with that instance. GCE also offers annual and multiyear sustained use discounts, but does not have a market built around them. GCE offers preemptible instances as a response to AWS spot instances, but once again GCE offers a fixed price discount rather than relying on a spot market as AWS does. I find this to be easier to use, though I may be able to get better value from AWS if I invest time in understanding the market pricing structure. I have used preemptible instances on GCE much more often than spot instances on AWS as a result. In many cases it seems like I am stealing from Google when I use preemptible instances. In a managed instance group, preemption seems fairly rare (1-2 out of 50 per 24 hour period for n1-standard-1 instances), and immediately a new instance was spun up by the instance group manager to fill the gap of the preempted instance. The discount for using preemptible instances is quite steep, and I have found myself gravitating towards using them more often for batch processing jobs.

The Long Beta...

I tend not to be an early adopter, and I think I am not alone among my enterprise peers. Generally I take the following steps as new technologies appear:

Alpha, Closed Beta technologies: I read a little about the technology and make a note to look into promising ones as they mature.
Open Beta technologies: I run some experiments using the technology to get a feel for how it works and how it can be applied, but do not implement any production pipelines on it.
General Availability: I am willing to use the technology for production work if it proves valuable enough.

Google has a history of keeping services in beta for a very long time. One example in GCP of a solution currently in a long beta process is cloud functions. Introducted not very long after AWS introduced Lambda in limited availability, cloud functions have only recently graduated from closed to open beta. As a result, when it comes to serverless GCP is falling increasingly behind AWS and even IBM. Further evidence of the long beta can be seen in the Google Cloud Client Library for Python where services which have been in general availability for quite some time, remain in beta or alpha in this SDK. This can be trying if you wait for general availability to begin using a service in production. It can also mean a long wait for services to join the HIPAA compliant service list, which usually happens some time after a solution becomes generally available. Luckily at this point in time there is a critical mass of available features which can enable enterprise, HIPAA compliant use of GCP.

Elephant in the Room

I remember deciding to organize a conference on Google Wave just after it hit GA. I argued to colleagues that this was a new service by Google which had just migrated to full production and was out of beta. It would not disappear like some fly-by-night company's service and it was worth the investment to learn to use it. Google Wave was discontinued less than 3 months later and before the conference had taken place. Google has discontinued so many services that Wikipedia even has a discontinued Google service page. So, who in their right mind would build on such shifting sands as Google?

I wrestle with this occasionally, and try to build services with the idea that I could move them should I have to. I try not to rely on Google-specific services, but on their more generic services which could be replicated on AWS if necessary. As I have pointed out in a previous article I make heavy use of AWS KMS. I have created a library which can be configured to use AWS KMS or Google Cloud KMS in order to maintain flexibility in my code and deployments. I think there are more indications this year that Google is committed to GCP. Many people point to the acquisition of Diane Greene to head up Google's cloud business as an indication that Google is taking both GSuite and GCP seriously. Other people point to Eric Schmidt's public commitment at Next 2017 that Google is investing a lot of time and money in GCP. We can also look at the fact that although Google has abandoned many services, it has not yet abandoned many services that users pay for directly, and this includes GCP.

Why Now?

Several announcements at Google Next 2017 are geared toward wider enterprise adoption. These include Cloud KMS general availibility, the acquisition of AppBridge, the developent of the Identity-Aware Proxy, Data Loss Prevention, Cloud Spanner, Committed Use Discounts, and engineering support along with many other announcements seemed especially geared toward large enterprise adoption.

Prior to this year, I saw GCP as kind of a niche player which caters less to the large enterprise customer and more toward the startup. This year I think GCP has shifted to be a more complete general purpose player. There are still a lot of services in beta, but if they hit GA on all of them, Google is poised to provide a genuine enterprise alternative to AWS and Azure.