../

Determinism, declarative configuration, and infrastructure-as-code: why it's important

I'm targeting laymen here, rather than people experienced in the field.

Old hat

Back in the day, you had a server. Solaris, Linux, Windows, BSD, whatever. That machine would run the services you need directly straight off of the host operating system. You need a database for payroll. You need to host some websites. You have to store a lot of data. This approach worked fine and still has its use case in the real world.

But people realized this didn't always work. Migrating your setup to another system? Huge pain in the ass when it's currently running on bare metal. Demo-ing unstable releases of your software? Hard to do without breaking current workflows. Keeping your services isolated? Good luck with that when they all have direct access to most resources.

The virtual machine became a crucial tool in software/IT deployment everywhere. In a nutshell, virtual machines simulate a bare metal computer and work like one, with support for virtual networking devices and bridges, peripherals, and drivers. You can often pass through your real storage or network devices to them as well. Most servers became "hypervisors", running operating systems designed to run virtual machines where software would be deployed. This model is way less messy and makes it way easier to back up, deploy, share, and reproduce configuration. VMs still see extremely widespread usage in the industry and are not going anywhere.

Virtual machines still have a lot of the same problems as bare metal systems. For one, you still have to manually configure and set it up to be the way you want. It just becomes safer to do so! Virtual machines also end up using a lot more resources than bare metal services at scale, since they each have their own kernel, drivers, and services. You're managing a couple dozen mini-computers!

VMs nowadays are able to be scripted with various tools. Savvy sysadmins relied on rugged bash or perl or python scripts to set up their virtual machines, but ultimately it becomes a huge headache to roll these kinds of installer scripts out at scale.

Advances in the Linux kernel and a shifting mindset led to the widespread use of configuration management tools and containers.

Configuration management is exactly what it sounds like. You use a tool like Ansible or Saltstack and it aims to "bring up" a system to a certain working state. An example would be akin to this:

You have 12 servers in your small/medium-sized business. They all should be configured with certain firewall rules, and 3 of them should be set up with a web server. Doing this by hand is impossible to keep track of meaningfully, and you want to ensure they're all at a certain working state.

An equivalent to this would be sending your little sister out to go shopping for cake ingredients. If you tell her "install a cake", she might grab the needed resources... but maybe not. She might buy a cake. She might not get the flour. She might get too much flour! She could forget the eggs. She could get too many eggs! She has no recipe to follow and as a result her going shopping is a shot in the dark. You have no idea what the end result will be.

Now, if you give her a shopping list, you can approximate a lot of these things. She now knows to pick up milk, flour, etc. and if, say, 2% milk isn't there, she can even go as far as informing you they had to get another kind of milk. This lets you send your sister out to Wal-Mart with a little more confidence.

Using a CM tool, you will script a "playbook" in a simple data markup language like YAML, and configure the servers like so:

# Hosts file
[firewall_only]
host1
host2
host3
host4
#...etc.

[web]
host10
host11
host12

and then an actual playbook (Ansible for example):

---
- name: Configure firewall and optionally install nginx
  hosts: all
  become: true
  tasks:
    - name: Ensure firewall is installed
      ansible.builtin.package:
        name: ufw
        state: present
    
    - name: Allow HTTP traffic through the firewall
      ansible.builtin.command: ufw allow 80/tcp
      register: firewall_rule
      changed_when: "'rule added' in firewall_rule.stdout"
    
    - name: Enable firewall
      ansible.builtin.command: ufw enable
      when: firewall_rule.changed

- name: Install and configure Nginx Web Server
  hosts: web
  become: true
  tasks:
    - name: Ensure nginx is installed
      ansible.builtin.package:
        name: nginx
        state: present

    - name: Ensure nginx service is running and enabled
      ansible.builtin.service:
        name: nginx
        state: started
        enabled: true

Ansible YAML is a bit verbose and tedious, but it got the job done. Ansible itself relies on little more than Python and SSH to deploy out changes. It's a popular tool that is incredibly easy to learn -- making it a great option for modest systems where you don't necessarily want your coworkers to learn an elaborate programming language.

One of the important things Ansible aims for is idempotency. The idea behing idempotent configuration is that operations can be performed as many times as you want without changing the result. If you add a 13th server and deploy with Ansible, I should be able to run this same playbook and have the other 12 not change. It wouldn't make sense to run through all the same configuration steps on servers that are already enjoying a working state.

While this playbook is simple enough to be a rudimentary Python script (Lord knows I've written shit like this before), anything larger just isn't gonna cut it. Think about making a lot of changes to random things like SSH, adding users and groups, deploying complex software your team developed. Configuration management tools start to make more sense.

Containers

Containers are the other Big Thing that swept the industry. The Linux kernel has a function called cgroups which allow processes to be isolated and sandboxed into self-contained "boxes". As opposed to virtual machines, these containers share resources with the host kernel and as a result can be infinitely more memory and CPU efficient. You're not emulating an entire machine, just a bundle of software! This cgroups isolation tech has seen many forms, but the one that really drummed up a lot of noise was Docker. Docker prides itself on being easy to set up and abstracting away a lot of the "Linux" involved in a Linux server. Of course, because it relies on cgroups, it doesn't actually work natively on anything else, using virtual machines that THEN run containers on Windows or macOS.

Docker containers exploded in popularity because of how simple they are to build software in an isolated environment with. Here is an example Dockerfile:

# Start with a lightweight build image
FROM debian:bullseye-slim AS builder

# Install build tools
RUN apt-get update && apt-get install -y build-essential

# Add source code
COPY hello.c /src/hello.c

# Build the program
RUN gcc -o /src/hello /src/hello.c

# Create the runtime image
FROM debian:bullseye-slim

# Copy the compiled binary from the builder stage
COPY --from=builder /src/hello /usr/local/bin/hello

# Default command to run the program
CMD ["hello"]

As you can see, it just uses a simple Debian "image" (bundle of Debian libraries) and builds my simple C program. I can then deploy it with docker run. You can ship this Dockerfile and build it on any system -- not just a Debian host but Ubuntu, RHEL, or fucking Hannah Montana Linux. It will Just Work. Docker became a quick and dirty method of deploying software that worked on everyone's machine, and the paradigm it introduced shifted mindsets:

Cattle, not pets

Servers started to take on a philosophy of being "ephemeral" -- the idea that you can comfortably destroy and rebuild services quickly based on configuration in a file somewhere (like a Dockerfile or an Ansible playbook)! The idea is to manage your servers like cattle in that you're not personally sitting down and tweaking and caring for and protecting the extra configuration you've set up on your system for months. A pet server is one where you or someone else has been quietly hacking away and fixing things under the hood and if that one server exploded tomorrow you'd be cooked. You're stressing over a system configuration that is heavily "snowflaked" and untracked.

Cattle are servers and services that you can bring up, destroy, rebuild, and change on a whim, because you know exactly how they're being defined. You use a version control system like Git to keep track of changes, and the goal is for the cattle to meet a certain end state. If they don't do that, you destroy and rebuild. This approach encourages better security and neatness and allows people to keep track of changes more easily.

Obviously some things must be "pets", but the general idea is to avoid single points of failure and "golden systems". Redundant systems and high availability help solve this issue.

The problem with DOCKER (and friends)

I am going to exclude Kubernetes from this conversation because it's out of scope. What you need to know is Docker is fundamentally still a "hack"! You're still dealing with a traditional Linux system and just automating the commands you'd normally input! You're just abstracting it away a layer to not think about it.

Containers see incredibly widespread use now, but they're fundamentally just cramming a Linux inside another Linux to avoid the problem of it being really fucking hard to reproduce a setup.

There's gotta be SOMETHING better, right?

Nix

Nix actually predates a lot of this tech by a little bit. A PhD project by Eelco Dolstra, Nix looks at the fundamental problems in declarative configuration, deployment, and reproducibility, and attacks them from the root. Nix is the equivalent of an assembly-line robot making a cake instead of a human. Ignoring how depressing of an image that is when the cooking analogy is used, Nix is a programming language and ecosystem that is designed around giving precise and reproducible recipes.

Rather than running a couple sudo apt install commands on top of a system, Nix is configured via code.

Deployment of a similar web server to a single system would look like this:

# system-configuration.nix
{
  services.nginx = {
    enable = true;
    virtualHosts."localhost" = {
      root = "/var/www";
      locations."/" = {
        default = true;
      };
    };
  };

  # Create the HTML file
  environment.etc."var/www/index.html".text = 
  "<html><h1>Hello from Nix!</h1></html>";
}

In only a few lines, Nix automatically sets up the NGINX web service and by building everything in an isolated environment that is designed from the ground up as such, it reliably deploys the same configuration every time.

If you added another service, like a media server, it's as simple as this:

# system-configuration.nix
{
  # New service!
  services.jellyfin.enable = true;

  services.nginx = {
    enable = true;
    virtualHosts."localhost" = {
      root = "/var/www";
      locations."/" = {
        default = true;
      };
    };
  };

  # Create the HTML file
  environment.etc."var/www/index.html".text = 
  "<html><h1>Hello from Nix!</h1></html>";
}

Running nixos-rebuild will automatically build and install the Jellyfin media server alongside this configuration, but here's the thing. If you decide you don't like that service it is as simple as removing the existing line. As opposed to all the tools stated above, Nix keeps track of configuration on a deeper level through its functional programming language and large ecosystem of packages. Doing anything more complex than this absolutely has a high learning curve, but it's actually attacking the fundamental problem in declarative configuration.

Nix has not seem super widespread use in the industry because it's kind of difficult to learn on a high level and work with. Poor documentation (as the community skews towards developers willing to directly read code anyway) leads to an insulated, silo'd off silver bullet that isn't really used much. I personally use Nix a TON now for configuring my own home lab and being able to neatly configure my system through a few text files is a huge memory and time-saver for me. I no longer stress about "pets", because I can literally wipe my OS and bring it back up in a few minutes if I really want.

"Cloud-native" and big infrastructure

Nix is more of a configuration management tool. You might have noticed I didn't include any configuration or code for deploying to multiple hosts in the Nix code above, and that's partly because the tooling around doing so is very rudimentary. There are better tools for deploying infrastructure -- networking devices, cloud instances, stuff that you want to keep mental track of.

Terraform is an extremely popular tool that does just that. Let's look at how we can integrate our (simplified) Nix configuration with some defined Terraform Amazon Web Service nodes:

provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "nixos_nodes" {
  count         = 3
  ami           = "ami-0a6e377f63e5c0bfc"
  instance_type = "t3.micro"
  key_name      = "your-ssh-key"

  # User data to pull and apply the Nix configuration
  user_data = <<EOF
#!/bin/bash
set -e
sudo nix-channel --add https://nixos.org/channels/nixos-23.05 nixos
sudo nix-channel --update
sudo nixos-rebuild switch --flake github:yourname/nixos-config
EOF

  tags = {
    Name = "NixOS-Node-${count.index + 1}"
  }

  # Security group to allow HTTP traffic
  vpc_security_group_ids = [aws_security_group.nixos_sg.id]
}

resource "aws_security_group" "nixos_sg" {
  name_prefix = "nixos-sg"

  ingress {
    from_port   = 80
    to_port     = 80
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

As you can see, we can configure ingress/egress rules, the amount of nodes we want, the provider (AWS), and the like very easily from Terraform. This makes our (cloud) infrastructure code rather than a clickops setup where we go into AWS and manually create nodes from a web browser or something. We can easily share around this Terraform and Nix configuration and there's no weird gray states like Ansible and the like! There are only working states reflecting our code or nothing.

Granted, Ansible is useful for Terraform deployments too if used to run once. But with Nix, I can configure this stuff on a deeper level rather than just running a couple Python and bash scripts under the hood.

Terraform doesn't make as much sense with physical nodes you can touch like a server in a data center. Instead, it's normally paired with Amazon/Google/Microsoft's cloud services that give you virtual machines to play with.

There are a million tools and a ton of stuff I've glossed over, but if you're still here I appreciate you being a trooper and reading through this rambling. I mean for this article to target technically-minded users that don't necessarily have a good grasp of these concepts. Infrastructure as code and declarative configuration are principles used a ton in modern IT.

Please comment below if you have questions, thoughts, or suggestions for improvement.

Leave a comment!

/foss/ /networking/ /cloud/ /linux/ /tech/