Is CDK the right tool for Infrastructure as Code?

Flavour of the month

For the past few years I've worked for a consultancy primarily focused on Amazon Web Services. AWS has been #1 in public cloud market share since that market existed, and depending on who you believe and how you carve it up, they still hold between 30 and 40 percent of that market.

In every customer engagement on which I've worked, our mandate has been to automate as much as possible, using the by-now well-entrenched industry practice of Infrastructure as Code. I'm going to assume readers of this post are familiar with Infrastructure as Code (hereafter IaC) and that I don't need to explain its fundamental principles.

The public cloud market has been large enough for some time to sustain a number of different tools for IaC, including the original AWS-native tool CloudFormation, Hashicorp's Terraform, AWS' Cloud Development Kit (CDK), and Pulumi, along with the usual levels of hype and debate from various pundits and enthusiasts. (I apologise in advance for adding to the noise in this respect.)

For the past couple of years, it has seemed to me (anecdotally) that CDK is the primary tool that our new customer projects have used. Although previously I've worked mostly with Terraform, over the past 6 months or so my work has required me to work mostly with existing customer code bases in CDK, in both its TypeScript and Python forms.

Those familiar with Betteridge's law of headlines should already know my answer to the question which forms the title of this post. For those who aren't: my answer is "No".

So with that out of the way, I want to spend the rest of this post explaining why I think CDK is the wrong tool for managing infrastructure code bases. My views about this are obviously informed by my personal experiences and the situations in which I've encountered CDK. The reader will need to decide whether the reasons I present here are applicable to their situation. I don't expect this post to convince any CDK fans to drop the tool and never touch it again; my aim is only to explain clearly why I continue to advocate for the use of Terraform over CDK.

Before we go on, I would offer a couple of caveats:

  1. My belief that CDK is not fit for purpose as an IaC tool is not a value judgement on the people who wrote and maintain CDK, on my colleagues who wrote the CDK code bases I currently work on, or on anyone who promotes or likes CDK. You are all awesome, and I'm sure you (like me) want to do the right thing by your employer and customers, and work on great technology.
  2. This post does not in any way represent the views of my employer (who would be very happy to discuss helping you with projects involving CDK, Terraform, or any other technology in our core areas of expertise).

What would Bill do?

William of Ockham, a 14th century philosopher, scientist, and theologian is known for Occam's razor, a problem-solving principle often stated as "Entities must not be multiplied beyond necessity". As an explanatory principle, it is used as a philosophical or scientific tie-breaker when testing two hypotheses with equal explanatory power.

William of Ockham

However, this principle is also valuable when used as a design guideline when building systems. Whilst the modern form of Occam's razor stated above is not found in Ockham's works, one of the forms he did use was:

"It is futile to do with more things that which can be done with fewer."

The Agile Manifesto's core principles echoed this when they said:

"Simplicity--the art of maximizing the amount of work not done--is essential."

(I hope I can be forgiven for not taking the time here to justify why simplicity is a desirable principle in IaC code bases, or why desiring to minimise the number of components in a solution is a good choice. Many persons more qualified than I have written about failures in complex systems and a quick search with one's favourite Internet search engine should surface plenty of research which substantiates the claim that aggressive pursuit of simplicity helps to make systems more reliable.)

CDK's multiplication of entities (in other words, its lack of simplicity) is my first objection to it. From the perspective of the number of components needed to deploy a resource in AWS, CDK has more than necessary, and more than its competitors. (In this post, I'll use Terraform as the counterpoint, because I'm most familiar with it; the same arguments may or may not apply to other IaC alternatives. OpenTofu may be considered identical to Terraform for these purposes.) Here's a block diagram of the main components of a very simple CDK python IaC code base to create an EC2 instance:

CDK deployment

And here's a similar example for Terraform:

Terraform deployment

One could certainly quibble over some of the details I've chosen to include or omit, but there are two areas in which Terraform has a clear simplicity advantage over CDK:

  1. CDK python requires the entire language ecosystems of python, TypeScript, and JavaScript. One could obviously eliminate one of these entities by writing only in TypeScript and skipping the transpile step, but this also eliminates one of CDK's key selling points: that developers can create infrastructure in their preferred programming language. By contrast, Terraform's mostly-domain-specific language comes in a single binary.
  2. CDK is mostly only a tool to generate and apply CloudFormation templates. All of its access to AWS is mediated via CloudFormation.

It is this dependence on CloudFormation (hereafter CFN) to which we turn now.

You had one job

CDK's dependence on CFN means that CDK is bound by CFN's operational model. Whilst CFN is ostensibly an IaC tool (one could say, the original AWS IaC tool), it has one peculiar feature which makes it a very poor choice for this job.

With most other IaC tools, the code is taken to be the source of truth. That is, if the state of the running resource and the state of the code disagree, the resource is changed to to match the state dictated by the code. Under Terraform, this is enforced on every deployment when terraform apply is invoked.

However, with CFN the state of the running resource is the source of truth. If we attempt to apply a change to a template when the underlying resource's state is mismatched with the CFN state, the resource's state prevails, and the CFN stack update fails. Or put another way: there's no way to correct drift by reapplying a CFN template. One must manually correct the drift that has occurred before the stack can be updated.

One could argue that an effective DevOps team should not allow this drift to occur in the first place. And that would certainly be the ideal, but there are numerous legitimate reasons for drift to occur. In our team's case, we're working on AWS accounts which are owned by our customers, and we simply do not have the right (morally or contractually) to tell them what they can and cannot do with their own AWS resources. So any effective IaC solution must allow for drift, and shouldn't prevent work occurring while drift is present.

CFN has ways to deal with drift, but they require the drifted resource to be manually removed from and re-imported into the CFN stack. An AWS blog post from 2020 admitted that:

"stack drift represents a persistent and tedious challenge for organizations managing critical infrastructure with CloudFormation"

However, the "automatic" drift remediation presented in that blog post consists of using a lambda to repair the affected resource in a programmatic fashion, using a predefined snapshot of what that resource should look like. How this qualifies as automatic when compared with terraform apply is hard to fathom.

This lack of drift remediation in the underlying CFN technology makes CDK operationally ineffective and costly (in terms of both money and time) for maintaining infrastructure long-term.

A tool looking for a developer

CDK's big selling point is that developers don't need to learn a new language to maintain their infrastructure; they can just use the language they're comfortable with to control the infrastructure on which their application runs. This is absolutely true as a standalone claim, but not really relevant in a large number of situations.

The DevOps ideals, when they were originally conceived, were that development and operational skills would be combined in cross-functional teams which were responsible for driving both the operational effectiveness and the functionality of services, especially in microservice-focused environments. This seems to have worked well in some Silicon Valley companies, but in our customer base in corporate Australia, DevOps ideals were largely never implemented. They were often just used as a buzzword to justify larger budgets and the corporate restructures that top-down management styles seem to thrive on.

If developers do end up managing the IaC, just because they can manage the infrastructure in their preferred language doesn't make them experts at that infrastructure. Knowing the language in which the controls are expressed does not make them automatically understand those controls or the implications of making a particular change.

A more common scenario in most of the customers I've worked with is that the traditional dev/ops divide still exists - a world in which developers would rather not know how DNS delegation works, and ops folks would rather not have to understand GitHub flow.

In these scenarios the CDK code base tends to be inherited by a few ops folks who have sufficient chops to operate git competently and aren't averse to submitting pull requests. In some cases, these folks are excellent programmers; in other cases, not so much. This can result in code bases which are poorly-conceived from a software engineering perspective and are difficult to extend easily with new functionality. In turn they become harder to understand as people bolt on needed functionality in less-than-elegant ways, which feeds into a vicious cycle of neglect and poor design.

The use of "full" programming languages in CDK lends itself to this vicious cycle, because there's always more than one way to do something, and unless someone in the organisation is championing standards for those programming languages, it's easy for the code base to develop unevenly and idiosyncratically. (This is a feature of all software development projects, not just CDK.) It's not uncommon for a CDK code base to be significantly longer and more complex than the CFN template it produces and five to ten times longer than the corresponding Terraform code to generate the same set of resources.

Terraform's lack of "full" language features is actually an advantage in this respect, because it encourages simplicity of structure. Everything eventually needs to boil down to a declaration of resources and their attributes. Sure, there's scope to divide modules up differently, or to use different functions which result in the same array or string manipulations, but largely Terraform is about setting up data structures and mapping them to resource definitions, and there are usually only one or two good ways to skin that cat.

Where to from here?

In conclusion, I'd like to offer some observations about what might mitigate the concerns I've listed above sufficiently for me change my view of CDK in the future.

  1. A CDK implementation which does not require any language transpiling, but allows developers to use only their preferred language's native bindings would be all but essential. The TypeScript/JavaScript/Node.js/npm ecosystem is very unattractive to those not brought up in that particular religious tradition, and a python CDK version which allowed one to eliminate it (and the hundreds of megabytes of overhead it adds to container images) would be very welcome.

    I rate the probability of AWS or the CDK community doing this as low.

  2. Likewise, a CDK implementation which eliminated the dependency on CFN and directly manipulated resources through the AWS APIs would mitigate many of my operational concerns.

    Given that this would require CDK to manage its own state and transaction system, I rate the probability of this occurring as very low.

  3. AWS could and should change the operational model of CFN to make the code the source of truth, and implement true automatic drift remediation.

    This seems the most viable of the changes I've mentioned so far, and I rate the probability of AWS doing this in the next 2-3 years as medium-low.

  4. CDK could start to cut down the ways that developers could interact with it, and encourage the use of a more limited range of idioms in the languages it supports.

    This seems to correspond reasonably well with the work CDK has already been doing in the area of higher-level constructs, so I rate its probability as medium-low, although the extent to which it could discourage the growth of complexity in CDK code bases would be very limited.

In the meantime, I'll be hanging out over in #sig-terraform on our work Slack, hoping others will feel the same. 😃