Diving into DevOps details

Search for a definition of DevOps and you're likely to find something involving a collection of other buzzwords. We think we can do better.

Comments

The basic idea behind Dev and Ops is to get the two roles working together. This sounds obvious, but think for a moment about how the roles have traditionally been implemented. Operations is responsible for uptime and reliability; the simplest way to keep systems up and running is to lock em down and prevent change. The job of a software developer is to create change. From the beginning, the incentives for one role are misaligned with the other. The first part of DevOps, the very inkling of an idea, is to break down the walls between the roles.

Here's an example.

For years, decades even, system administrators have been writing little scripts bash, perl, awk and sed to automate repetitive operation and setup. These scripts are code, yet the code the sysadmins write follows none of the practices the programmers have to follow. It likely has no requirements, no formal test process, no deploy process ... it might not even be in source code control. With no standards for the work, the code is likely in a different programming language than the production code. It may even do things the programmers would like to reuse, but they don't know it exists. If the roles were actually working together, instead of at odds, they could share knowledge, practices and even the codebase.

Automating the repetitive sysadmin tasks is something that programmers are uniquely qualified for. And it turns out to be a classic starting point for DevOps.

DevOps as concept

Builds. Deploys. Rollouts. Rerunning the unit tests. Integration and automated end-to-end checks. Creating Virtual Machines. All these are straightforward, definable business processes so well-defined that sysadmins often have runbooks, FAQs and wiki pages they can copy, paste and replace variables to do the daily work.

Programmers automate things; that's what they do. So why not combine the role of developer (automation) with operations (deploys and maintenance)?Overlapping these roles automating the parts of operations and testing that make sense is core to the idea of DevOps. It also runs counter to the conventional ideas in IT and process engineering, where terms like "separation of duties" would imply DevOps is a bad thing. Then again, all those Waterfall process documents said something scrum or agile software would be a bad thing, too. So there's that.

Two ways to do it

One approach is to simply rotate developers through operations for a few weeks to a few months. The theory goes that the programmer-in-ops is going to check the existing codebase into the right version control, automate routine tasks in a smart way, create reusable code libraries, replace the installation of physical machines with virtual machines created through code and so on. There's certainly some immediate, easy gain from this, but it seems likely that programmers in operations will be crushed by either the workload or the policies, and that actual DevOps adoption will be limited.

Another way to do it is to increase the support responsibilities of the development team. For example, a programming team could take over all responsibility, from build to deploy to support, rotating a programmer through support, two weeks at a time. This would give the developers broad exposure to how the codebase worked from the outside, and also help them feel the pain of supporting their own work. (A smaller operations group still needed to support some infrastructure, like the production servers and databases.)

Automating the build

Continuous Integration (CI) tools are probably the most popular and widely used of those expressed here. Some companies continue to perform builds, but the potential for automated builds goes beyond the compile step. It can include logging and tagging builds according to branch, storing the exact build in an archive connected to the commit number and branch, and, possibly, connecting builds to features.

Once the build is complete, it can also run all the unit tests all the time and then actually deploy the software to development, test or staging server. CI tools can then run web services or GUI end-to-end checks to see if something major failed with the new code. This is not continuous delivery the code is not automatically pushed to production on every commit. (Some organizations buy a new, fresh, virtual server from scratch for each deploy, instead of "upgrading" a single development environment; some create multiple virtual environments, one or more per programmer.)

If the damage a defect can do is a function of its severity and how many people see it, then finding the problem soon and fixing it fast can limit the damage. Finding the problem when it has been exposed only to a small percentage of users is not always possible but visual monitoring and altering can help.

There's a great deal of information on server logs from how long requests are taking to serve to how many 404 and 500 errors the system is experiencing. By visualizing those errors in a graph, perhaps using opensource tools like Graphite, the team can see problems as they happen and take action.

Once the work is feature complete, the steps of a production rollout can be automated and put behind a Web page or at least a command-line interface. When the CI server does the next build, the continuous deployment system just needs to go one step further, to production.

Push button deploys lead to pushbutton fixes, which combine with production monitoring to drastically reduce the timetolive (TTL) of a defect.

Configuration flags

Config flags are a mechanism to reduce time-to-live even further. Instead of needing to change the production code, the programmer changes his code with an "if (feature) { }" block around the new code. To roll the feature back, the programmer changes a file on disk that stores the flag as on or off. As this is not a code change, it doesn't require a new build/deploy. Noah Sussman's "Config Flags: A Love Story" covers the concept in much more depth.

Imagine a config flag that is further stratified by type of user so that the functionality runs only on employees, then a family of employee, then beta and other customers willing to take risk, then regular customers, then "enterprise" and riskaverse customers. The team can roll out a new feature gradually, monitoring the system and asking for frequent user feedback before rolling the feature to a wide user base. While implementations vary, one version of this strategy, called "Testing In Production," is described in some detail in the 2008 book "How We Test Software at Microsoft."

The ideas above are mostly concepts; there are many ways to implement them, both commercial and open source. One common tactic is to create a virtual server farm in a public or private cloud. New builds are created but not deployed then the system switches the load balance to the new machine. This creates a "flip" of service unknown to the user, and keeps the old system around, for a least a few minutes, in case something goes drastically wrong.

Configuring a set of machines, real or virtual, requires a great deal of scripting and automation two tools designed to help with this, and commonly associated with DevOps are Chef and Puppet.

Perhaps the most disappointing thing to come out of the entire DevOps zeitgeist is the idea that DevOps is a new role, or that this creates a third "team," the DevOps team. While some developers may choose to do build/deploy automation or infrastructure work, the idea was to get the disciplines to work together not to create yet another specialized role that takes care of details too obscure for other people to understand or care about.

The DevOps Test is a way to quickly assess the extent to which your team has embraced DevOps and what may be next. It's a quick and easy assessment, which means it's imperfect. Still, the test is a start. Give it a try.

The next step is up to you.