I have really enjoyed Micah’s recent posts, Embracing NoOps and Are you really ready for serverless?. We share a firm belief that all code is a liability, and that includes “infrastructure as code” code. Of course there is a lot more to it than that. As the saying goes, our job as software developers is to manage complexity. DevOps is all about finding the most efficient way that application developers, operations specialists and quality assurance engineers can cooperate to share the complexity needed to create, host and deploy high quality features to users.

I recently had an opportunity to do an elaborate proof of concept for a client with these requirements:

  • We want to move from on-premises servers to “the cloud”
  • Our cloud hosting provider is AWS
  • We prefer managed services over running servers or containers
  • We think we will need an ETL (or ELT) data ingestion pipeline
  • Here is one data source that can POST XML data to an endpoint
  • Go!

I spent a challenging six months working with a team of about four developers writing a data ingestion pipeline using AWS managed services such as Lambda, Kinesis, Firehose, Glue and Athena. We put it all together using Terraform, PySpark and Python, and used CircleCI to implement a build pipeline.

We had no dedicated QA folks. We had little to no support from an infrastructure or operations team, much less an expert in AWS managed services. Our team was lucky to include a Senior Engineer direct hire at this client with previous experience. We started with their “napkin drawing” and a bunch of humans willing to learn, and together we wrote some code that made things happen.

This kind of implementation really blurs the line between application code and infrastructure. On Tuesday you need three S3 buckets and one Lambda function. On Thursday you need five buckets and two Lambdas. You deploy some ETL scripts and push up some schemas for temporary tables on Friday and call it a week well spent.

As it turns out, our client has moved towards a different model. For their particular context they decided that it would not be a great idea to have application developers responsible for everything. They are making a big departure from their previous “ticket based” internal process for requesting things like a new application server or database instance. The goal is “self service” infrastructure for their development teams.

To accomplish this goal they set up a separate team of operations specialists building tools, managing third-party accounts, and providing example implementations that bake in desired best practices. Within certain boundaries application teams can write build pipelines and Kubernetes deployment files to create and deploy the resources they need to meet their requirements.

All of us involved in that data pipeline proof of concept project learned a lot and have been able to leverage what we learned to be empathetic and knowledgeable consumers of operations services and code. Not to mention that we are so glad we don’t have to do everything!

Our data pipeline project really challenged my “simpler is better” principles. I formed some opinions and hypotheses that I will carry with me to my next adventure in DevOps.

In this post I will share some ideas that might interest leaders in engineering organizations. I’ll also discuss some principles and practices that might interest software developers, QA folks and people in operations. During this client project we “learned how to learn” about building AWS infrastructure with Terraform. I’ll talk a bit about that workflow and then go into even more detail with some example code and discussions of specific implementation details.

Roles and responsibilities (“people stuff”)

As a leader in an engineering organization you make decisions about how to divvy up responsibilities and organize teams of people. Of course, at a high level your goal is to efficiently deliver features to market with speed and quality. As with so many things, the devil is in the details.

DevOps at its best is about harmony between operations, development and QA teams implementing useful features with an engaged product team. It is a shared philosophy, a way of doing things, not a job title.

It is pretty easy to say that for small teams the best scenario is that all your developers understand your whole stack from application code through to infrastructure and fill in the role of a QA team as well. It is also easy enough to assert that for larger organizations there is efficiency to be gained by having specialized teams for application development, operations and QA. Every other scenario requires balancing many priorities and weighing the pros and cons carefully for your specific needs and available team.

A larger team will have more specialist roles, each with a very specific set of skills. But that same larger team needs to adopt the shared philosophy that deployment is everyone’s responsibility. To wit, an engineer working on a microservice might not know much about what say, a front end engineer is building. But everyone on those teams should all have a sense of urgency and ownership about the state of the world. That state of the world is your infrastructure and the health of your applications.
  — Micah Adams

Considering how hard it is to hire and retain people in our industry, it makes good sense to take most advantage of the people that already work for you. Find application developers that are interested in how their code is deployed and care about quality, QA people that like writing code for automated tests, folks who were hired to maintain infrastructure but ask a lot of questions about the details of the application code that runs on “their” servers.

There are no rules—you can structure your team or teams in a way that matches the available people you have. You can also cross-train folks, let people try out new roles and change their minds. Everyone wants to do good work. As long as leadership lets people know that their efforts are appreciated and that all the aspects that make up the DevOps soup are valuable to your organization, your employees will respond positively.

That said, clear boundaries and responsibilities, both between individual humans in your organization and between teams or departments are critical. When “iterating on process”, I have seen very smart folks become confused and distraught in short order. Clear communication and an appropriate level of documentation is critical.

It is also important to keep in mind the trope about aligning agency and responsibility. Knowledge workers, really workers of all kinds, respond best when they have the ability to control the outcomes that their managers are asking them to produce. This applies to teams and departments as much as to the staff who fill these roles.

This might seem obvious, but in the trenches of information technology it is pretty easy for this to get lost or twisted up. When a team of application developers responsible for a DevOps gray area—for example writing configuration for a new CI/CD platform—are late for a deployment because of changes made by an operations team that does not participate in their daily stand up, things can quickly devolve into finger pointing.

This is where management needs to trust that all involved are doing their best to deliver features to production. It is also very helpful if a team feels like the first line of engineering management is doing an effective job of explaining “what went wrong” to the less technically savvy leadership in your organization. At the client in question I was fortunate to work with management folks and representatives of their architecture team that did a great job of both praising the team when things went well and helping to convey credible technical details to our stakeholders when things went pear-shaped.

An AWS development workflow

Referring again to our previous articles touting a NoOps approach to deploying application code, what are you supposed to do if you have been given requirements to use a disparate selection of AWS managed services to achieve some technical goal? There is no doubt that AWS has an amazing and feature-rich web interface and great documentation for many of their services.

As a developer who knows that “simpler is better”, one who might refer to the YAGNI principle, you can stick to your principles by keeping in mind that your goal is to set up repeatable processes to deliver features to production. When it comes to actually creating and managing AWS services, you can treat that shiny AWS web console as a “read-only” view of your infrastructure, useful for troubleshooting and validation only.

Where the AWS Web Console really shines is providing easy access to logs for developer debugging and reports and alerts for performance monitoring that give QA and technical leadership insight into their production applications. Dashboards can be created displaying any number of useful metrics. Keep in mind the goal is a repeatable process. If it takes you more than two or three clicks to create a dashboard for an executive stakeholder, I would suggest looking at Terraform and setting up a build pipeline for that reporting.

Of course it is possible to set up complex AWS IAM Roles and credentials that only allow specific users to take specific actions via the web console. For small teams and folks just starting out I recommend having fairly flat permissions and trusting your team—with the aid of documentation and versioned CI/CD configuration—to ensure that bad things don’t happen. For larger organizations, or situations that require enhanced security, AWS IAM has a solution for any scenario you can imagine. Check out Micah’s article for more details on this and other considerations.

When learning how to deploy a new AWS service, start by following the official AWS blog articles and documentation and build things using the AWS web console. Most of the AWS documentation is not specific to Terraform or any other third-party “infrastructure as code” implementation. Once you have a system that meets most of your requirements, go straight to the Terraform documentation and build the same resources using a set of Terraform scripts. Keep iterating until the requirements can be met by the scripted version and carefully delete the resources you created by hand.

Suggested mindset

  • Always keep in mind that the goal is a reproducible process that is as automated as possible.
  • It is OK to use human checklists. Weigh the cost/benefit in terms of complexity for any decision point that could be moved from a checklist to a build process.
  • Even though Amazon is a huge company, don’t assume “everything they do must be correct”, use your own experience and judgment. As an example, the auto generated Python Glue scripts do not follow PEP 8, and are not structured in any way to enable automated testing.
  • Even if you are in a context where you have been told to “prefer managed solutions”, keep in mind that Terraform and other scripting solutions make this a matter of degrees. For example, you can often use a carefully managed container to lift up and deploy existing application code while moving to managed serverless database or file storage solutions.

Practices that worked for us

  • When learning new things, we used whatever docs we could find, starting with official AWS documentation and blog articles and then heading out to the internet at large. This usually entailed following checklists and building things “manually” using the AWS web console.
  • When using the AWS web console, give resources a name whenever you have a chance, include an environment and some other unique text. The default configurations will often make resources on your behalf, so it makes cleaning up “manually” created resources much easier if you can identify them later.
  • The AWS command line is a good way to compare resources created using the web console with resources created using Terraform. When using Terraform and the AWS provisioner, at a certain level the keys and values in the configuration objects exactly match what the official AWS client returns.
  • You can use the Terraform local-exec provisioner if you need to call a custom script from within your Terraform scripts. In fact, you may risk causing a rift in space and time, but you can even call the AWS command line as a local-exec command if you need to accomplish something that Terraform does not implement.
  • We kept checklists for one time operations in versioned markdown files in the repository.
  • I don’t always write bash scripts, but when I do I use Unofficial Strict Mode and shellcheck.

Some nitty (and/or gritty) details

Of course I can’t share client code with you fine people, so I wrote a sandbox repository that uses some of the practices we learned. This is not exactly “how to” code, but I intend for it to embody the principles and mindset we adopted while working on that data pipeline project.

Some example code you can run yourself!

The pretend requirements for my sandbox project would be something like:

  • We want to move from on-premises servers to the cloud
  • Our cloud hosting provider is AWS
  • We prefer managed services over running servers or containers
  • We need one API endpoint
  • A user can pass in a list of words and the end point will return words that rhyme
  • Go!

If you want to learn about AWS services and Terraform, I encourage you to take advantage of their free one year trial. Of course “free” gets major air quotes, please read the fine print.

If you follow the tips and tricks below to create things on AWS you’ll have a much better chance of a free year that is actually free or at least cheap. The project linked below doesn’t make anything very expensive, and everything can be torn down with the Terraform client.

A repo with example code you can run yourself

A lot of this article discusses possible efficiencies to be gained by carefully balancing responsibilities between operations, application developers and QA. Note that the mindset for the proof of concept I built with this recent client was to see what happens when developers are responsible for everything.

This sample code repository is built with that same strategy in mind, but with no judgment about whether or not this is always a good idea.

I told you there would be details …

For a brand new project put a terse checklist in a README.md in the project folder. This might start out as just a single code block with a bunch of shell commands.


# build some stuff
mkdir ./lambda
rm -rf ./build
mkdir -p ./build
cp ./lambda/src/*.js ./build
cp ./lambda/package.json ./build
cp ./lambda/package-lock.json ./build
cd ./build && npm install

# make a zip file we can deploy
rm -rf ./tmp
mkdir -p ./tmp
cd ./build && zip --recurse-paths ./tmp/sandbox.zip .

# push some code up to our lambda
aws lambda update-function-code \
--function-name "test-lambda" \
--zip-file "fileb://./tmp/sandbox.zip" \
--publish

As the project matures and more scripts and tooling are added, keep the README up to date.

source ./.env
./bin/build
./bin/stage_lambda_zip "$LAMBDA_DEPLOY_BUCKET" "$ENVIRONMENT" "$BUILD_ID"
cd infrastructure
terraform workspace select "$ENVIRONMENT" || terraform workspace new "$ENVIRONMENT"
terraform plan
terraform apply

Eventually, these will be pulled in to the configuration for whatever CI/CD platform you are using. These are snippets from the CircleCI config in my sandbox project.

# ...
- run:
  name: Build code
  command: bin/build
# ...
- run:
  name: Move build folder back where it should be
  command: mv /tmp/workspace/build ./
- run:
  name: Stage lambda code in s3 bucket
  command: ./bin/stage_lambda_zip "$LAMBDA_DEPLOY_BUCKET" "$CI_ENV" "$BUILD_ID"
# ...
- run:
  name: Switch to or create workspace
  command: terraform workspace select "${CI_ENV}" || terraform workspace new "${CI_ENV}"
  working_directory: infrastructure
- run:
  name: Validate Terraform config
  command: terraform validate
  working_directory: infrastructure
- run:
  name: Save Terraform Plan
  command: terraform plan --input=false --out=./circleci.tfplan
  working_directory: infrastructure
- run:
  name: Apply Terraform Plan
  command: terraform apply --input=false ./circleci.tfplan
  working_directory: infrastructure

Projects that leverage managed cloud services are polyglot by nature, for your project there may not be an obvious choice for a general scripting language for CI/CD and developer productivity. You could choose to use rake or npm scripts, or to build a custom command line client in Python. There are a lot of good options, but my favorite minimalist implementation is to use bash scripts. For the majority of applications destined for a Linux or similar environment, bash is a lingua franca of CI/CD, and it runs well on macOS or Linux developer workstations.

#!/bin/bash
set -euo pipefail
IFS=$'\n\t'

DEPLOY_BUCKET=${1:-}
CI_ENV=${2:-}
BUILD_ID=${3:-}

if [ -z "${DEPLOY_BUCKET}" ] || [ -z "${CI_ENV}" ] || [ -z "${BUILD_ID}" ]; then
  echo "Usage:"
  echo "stage_lambda_zip <deploy_bucket> <environment> <build_id>"
  exit 1
fi

DIR=$(cd "$(dirname "${BASH_SOURCE[0]}")"/../ && pwd)

BUILD_DIR="${DIR}/build"

TMP_DIR="${DIR}/tmp"

rm -rf "${TMP_DIR}"

mkdir -p "${TMP_DIR}"

cd "${BUILD_DIR}" && zip --recurse-paths "${TMP_DIR}/sandbox.zip" .

aws s3 cp "${TMP_DIR}/sandbox.zip" "s3://${DEPLOY_BUCKET}/aws-api-sandbox/${CI_ENV}/sandbox-${BUILD_ID}.zip"

It is very productive to have the building blocks of your remote CI/CD scripting available to be run locally in bite size pieces. This makes it nice when you are just getting started and allows for incremental improvements as you iterate on your build.

In combination with some simple shell scripts, a versioned .env.example file to document the usage of a small set of environment variables makes it easy to go from local copy/paste to a scripted command for your remote CI/CD provider.

export CI_ENV="development"
export BUILD_ID="some-unique-id"
export LAMBDA_DEPLOY_BUCKET="some-bucket"
export TF_VAR_lambda_deploy_bucket="$LAMBDA_DEPLOY_BUCKET"
export TF_VAR_build_id="$BUILD_ID"

Since the suggested environment variables documented in our .env.example are also being used in the CI/CD pipeline, it is easy to move back and forth between running these locally and running them in CI. This is super nice when creating or modifying the build pipeline itself.

- run:
  name: Stage lambda code in s3 bucket
  command: ./bin/stage_lambda_zip "$LAMBDA_DEPLOY_BUCKET" "$CI_ENV" "$BUILD_ID"

Whatever implementation you choose for build scripts and scripts needed for local development, consistency and the right amount of documentation are key.

Once you have a working CI/CD pipeline in place, that scripting and configuration—in combination with a set of versioned scripts—should become authoritative for the build process. The manual build and deploy instructions should be deleted from the README to avoid duplicate documentation.

Although the configuration for your CI/CD pipeline is authoritative, it is also fine to have some scripts that exist only for developer productivity. For example, in the sandbox repo I have linked, there are instructions and a small script that allows a developer to deploy a new version of the AWS Lambda code without using the full build process. This really shortens the feedback cycle when working on aspects of the script that cannot be run locally.

# set up your environment
source ./.env

# build
./bin/build

# deploy new version without using Terraform
# this should only be used during feature development
./bin/deploy_lambda "api-sandbox-${CI_ENV}"

Think about the AWS services the same way you would think about a third-party library. Find clear seams between code that can only be run on AWS servers and code that you can test in an isolated manner.

In the case of the Lambda function in my sandbox repo, I decided that one seam is right here in the main script:

const { multiValueQueryStringParameters: params, path, httpMethod } = event

At any point, AWS might change the event that they pass to a Lambda function such that multiValueQueryStringParameters is no longer available or changes in some way. That is essentially a dependency on a some third-party code I don’t control. That dependency can be tested with our end-to-end test.

Deciding on this seam means that I can control and unit test everything below that point. I decided that this makes a nice interface for my controller.

controller()(httpMethod, path, params)

Notice that params and path are abstract concepts that have nothing really to do with the fact that this will be executed by AWS. I was able to choose those variable names and the signature of the function. I was even able to control that fact that my controller method returns a Promise.

There is of course another seam in the structure of the payload passed to the callback provided by AWS. We are making another assumption that can only be reliably tested with an integration or e2e test.

return handler(params).then(({ statusCode, body }) => {
  return {
    statusCode,
    headers: {
      'Access-Control-Allow-Origin': '*',
      'Access-Control-Allow-Credentials': true,
      'Content-Type': 'application/json; charset=utf-8'
    },
    body
  }
})

If this were a much more complicated application, I might isolate the dependency on that schema into a separate function or some other bit of code. However, all these choices balance a risk of the third-party service or library changing frequently enough to cause changes in your code versus the complexity needed to isolate your code from that possible change.

For a project like the AWS Glue and Kinesis data pipeline that I worked on for my client, or for something like this sandbox repo that implements a single end point served by an AWS Lambda function, the balance of what might be considered application code versus infrastructure is weighted heavily towards infrastructure. For a project like this, Terraform makes it possible to create your AWS infrastructure with each build.

The Terraform workspace feature lets you isolate the AWS resources created for one deployed environment from any other. During the height of our development on that data pipeline, we had three pairs of developers working in parallel on different features all sharing the same AWS instance.

Because Terraform is declarative—and does a pretty efficient job of querying AWS to find out which requested resources already exist—build times are very reasonable even when working with large and complicated things like Aurora RDS or an EC2 virtual machine instance. Of course, on a specific build where you add or remove something that takes 10 minutes to deploy, things are slower. But the next build where you assert that an RDS with such and such a configuration already exists and you just need to deploy some new scripts or create something lightweight like an S3 bucket, your builds really fly.

The build pipeline in my sandbox repo allows for this same development technique. The summary of how the build pipeline works gives an idea of the workflow.

CI/CD

This project is set up to be tested and deployed using CircleCI.

The configuration in .circleci/config.yml and the simple bash scripts in bin are the source of truth for the build and deploy process.

For troubleshooting, local testing, and modifying the build process, you can populate the environment variables in the .env files and run individual scripts.

Details

  • Unit tests and linting will be run with every push to any branch
  • Completed pull requests (really all merges into develop) will be deployed to an environment called dev
  • You can choose to deploy a feature branch by pushing a tag that starts with “feature-” plus one or more other characters; your branch name should only contain lower chars and dashes
  • Feature deployments will use the exact tag as the name for the environment

Suggested mindset

  • Seek a quick feedback cycle for development of build and deploy scripting.
  • Seek a quick feedback cycle for development of the application code and local testing.
  • Use your instincts to pick a seam between the AWS managed services and your code for testing and be consistent about it, SAFE tests just need to be enough to cross that boundary, more isolated tests that can be run locally starting with the seam.
  • With some managed services (I am looking at you, AWS Glue!) it is very difficult to find that seam—very little can be run locally. Sometimes the best you can do is to orient your code base to leverage your own custom libraries that are unit testable.
  • Value using environment variables to specify difference between lower envs and production during build and at run time—as opposed to having a named configuration file or folders of Terraform per environment.

Practices that worked for us

  • Use the Terraform workspace feature.
  • Terraform resources will accept a count argument. It is very common to see repositories of Terraform code with folders named for different environments, you can avoid this by using environment variables with counts. In combination with the workspace feature you have everything you need to follow the “Twelve Factor” principle of having “one code base, many deployments.”
  • Find a clear seam between code that can only be run remotely in AWS and code that can be run locally. This seam will help you structure end-to-end automated tests, integration tests and isolated unit tests that can be run locally and offline.
  • AWS Glue is a great example. There is actually no way to run their proprietary system locally. It would still be valuable to set up shared libraries that can be tested in isolation. It is possible to write integration tests using what they refer to as a “Dev Endpoint” but the code is still running on AWS and the service is not cheap.
  • Interpolate the workspace name into the name of the resources you create, those names will be visible via the AWS web console which helps with debugging and troubleshooting.
  • Write simple bash scripts and put them in a bin folder at the top level of your repo in each “sub-project” folder as needed.
  • Tag every resource that can be tagged, the combination of these two will help for making AWS billing reports:
    • Make a tag for “Environment” with the value of the Terraform workspace
    • A tag for “cost center” or “application” is also a good idea

Cleaning up after yourself

One of the great benefits of using Terraform and other “Infrastructure as Code” frameworks is the ability to have a fully repeatable process.

For some client build pipelines I have written we include a step in CircleCI that can be run to delete all the created resources. I didn’t go that far for my sandbox repo, but there are instructions in the README.md that explain how to delete everything created during the build of a given environment.

To clean up after yourself, follow these steps.

Copy .env.example to .env and edit appropriately.

If you exactly follow these instructions, the environment named in $CI_ENV will be destroyed. Be sure to read the confirmation message from Terraform before you type yes.

# set up your environment
source ./.env

# destroy a given environment
cd infrastructure
terraform workspace select "$CI_ENV" || terraform workspace new "$CI_ENV"
terraform destroy
# read the confirmation message and follow the instructions

Wow! Thanks for reading this. (That was a lot right?)

I hope that some of these practices will help you if you are working with AWS and Terraform. The structure of this code, and the way that Terraform and the build pipeline work together, could be applied to almost any AWS service. The principles and ideas should also apply to any other “Infrastructure as Code” implementation.

The point of referencing my sandbox repo is to show the very simplest implementation that I could execute in order to achieve the dream of serverless code in the cloud. Being that everything is about trade-offs, if you look at that repo and say to yourself, “that isn’t really very simple, maybe we should stick with Heroku until we have more traffic,” then my job is done.

On the other hand, if you really need something that AWS provides, and you have done the math to know that it will be much cheaper to host it with them, this example code might give you some ideas to help you and your team balance the risk and reward equation for yourselves. There is no doubt that on-demand pricing, or avoiding taking on the operations complexity of hosting something like Spark and Hadoop, can be very attractive.

I urge you to look not just at the dollar costs of hosting but to include in your equation the required expertise of your team members. It is not just about expertise or knowledge, really it is about how much complexity you will have to manage, and who has to manage what. DevOps being a cooperative effort between your application developers, your operations staff and your QA engineers, it is also worth thinking carefully about where the weight of that complexity will fall for any given decision.

Jason Grosz

Person An icon of a human figure Status
Double Agent
Hash An icon of a hash sign Code Name
Agent 0028
Location An icon of a map marker Location
Middleton, WI