Monitoring with Grafana

Monitoring is one of those things I’ve seen get overlooked. Throwing money at it doesn’t always make it better. Taking the time to sit down and understand and doing proper event management is key. Of course I say all of that and I don’t always practice it, especially on my own personal infrastructure. I think I go beyond the basics however. Originally I would just set up nagios and get emails and texts when a server didn’t respond to ping. Finding something that can monitor and alert on performance was more challenging. I ended up settling using the telegraf agent and having it send to an influxdb instance. Then grafana interprets the influxdb data and makes it into pretty visuals.

Docker Swarm Limitations

As I continue to head towards my goal of using a container orchestration tool to be able to scale this website behind a load balancer I’m learning all kinds of things about the pitfalls involved with scalable systems. I though Docker Swarm would be great for this since it is relatively straight forward to set up but I’ve discovered it has a few limitations.

First of all it has no mechanism to scale the container hosts. I started down the path of scripting that and it was fairly successful at adding hosts to a swarm but then I learned another limiter. It doesn’t seem to have any way to balance containers among swarm workers at least after the initial start of the service. That means in a two worker node configuration you could have two container instances running on the same node. I know there are health checks that would in essence ‘heal’ your application but it seems silly to have an unused server out there. It was a smaller issue but persistent volumes wouldn’t update quite right even when you created them in line. If there was another volume with the same name on one of the worker nodes it would not give any errors and use that volumes settings and paths on that node. It was difficult to troubleshoot what was happening on that one.

Now I move towards the popular alternative, Kubernetes. Its popularity right now makes sense for me to figure out how it works. I’ve already stood up some basic services on a hosted cluster. It seems like an extra challenge to set it up from scratch. Thinking about my goals lately I wonder if I want to support this website on a complex set up long term. I want to learn it and know it but it may not make financial and logistical sense. So I think now my plan is to set it up and see how making changes is, otherwise simple one server setups are in my future for my personal assets.

Docker Complexity

As I continue my plan to adopt docker for most of my personal infrastructure, including this website, I am learning a valuable lesson. Docker works best with simple images. I imagine there are some complex images out there that work fine but I believe its best to avoid those situations. The ability to support changes and updates and how items interact with each other just adds to the layers. For instance I’m a big fan of CentOS. I know it sacrifices running the latest for dependability. This caused me to have issues using it in a Docker image with apache and PHP 7.2. It doesn’t naively support PHP 7.2 so I started down the path for work arounds. I then realized that it would be really simple to use a base image that has native support. In the end I accomplished with Ubuntu very easily which was becoming more and more complex in CentOS. As an aside I do hope that CentOS 8 is out soon so it can take advantage of some of the newer versions.

Here’s my Docker file, I know I need to optimize my commands to reduce the layers but it works fine:

FROM ubuntu:latest
ARG DEBIAN_FRONTEND=noninteractive
RUN apt update
RUN apt upgrade -y
RUN apt install apache2 php libapache2-mod-php php-mysql -y

EXPOSE 80

CMD ["/usr/sbin/apache2ctl", "-DFOREGROUND"]

Docker orchestration

As I work towards my goal of making this website highly available I’ve hit a couple of roadblocks. As always in IT it is important to work through these and treat them as learning experiences. Sometimes late at night when I encounter one of these its tough for me to go to bed even though I know a rested mind the next day would help me work through the problem. I’ve been hitting quite a few blocks in my quest lately.

First I always wanted to give Kubernetes a try. I chose a hosted platform on Digital Ocean to help me learn about it. I’m planning on talking more about it later but it gave me a great insight into how it encompasses more than just compute orchestration and covers much more of IT infrastructure. Right now I don’t think I’m ready to tackle using it for my personal infrastructure.

For now my plan is to use Docker Swarm. It offers application healing through container regeneration and scaling of how many container instances are running at a given time. My biggest road block has been storage. WordPress web content is dynamic, things like media attachments exist on the web server. If I want to load balance across containers I need to have the source content identical. NFS seemed like the easiest answer.

I tried to create a docker volume on the management swarm node based on the NFS path and present it to the docker service but it didn’t seem to work right. It mounted an empty directory every time. I troubleshooted it from an NFS standpoint but I think I had a lack of understanding of Docker Swarm. There is a lot of ambiguity out there and I have to preface this with I may have done something wrong but here’s what I used to create the service with the NFS share to fix it.

docker service create \
   --name nginx-test \
   --replicas 2 \
   --mount 'type=volume,src=nginx-test,volume-driver=local,dst=/usr/share/nginx/html,volume-opt=type=nfs,volume-opt=device=:/storage/nfs/nginxhttp,volume-opt=o=addr=192.168.79.181' \
   --constraint 'node.role != manager' \
   --publish 80:80 \
   nginx:latest

My understanding of the above is I created the volume in line with the service. I wish I could find something definitive that confirms my suspicions but I think it needs to be inline so each replica knows where to get the source from. I hope this helps someone because I couldn’t find much out there.

My Goals

With my current enthusiasm with Docker I’ve been recently thinking about moving my websites (this one included) to a high availability cluster. Now I could containerize WordPress fairly easily. The complication comes with horizontal scaling. WordPress files are dynamic because of media assets are stored on the web server. There are several ways to over come this but my favorite plan has been to have shared storage serve the web content to multiple web servers. The problem is choosing which kind, something like gluster takes multiple nodes and having a straight NFS server is a single point of failure. Also do I use a managed load balancer or stand up something like HAProxy? I worry that many of these plans either involve single points of failure or unnecessary expense. In the end I may just leave the websites be and stand up a lab for temporary use so I don’t run up a crazy bill. On the plus side my monitoring has improved, I can alert easily on log file patterns and performance health. I’d like to revamp up-time monitoring but simple may be better there. Right now I think I’d just like to get comfortable using Docker Swarm and deploying services to it.

Finally Graylog is working

After fighting with docker, graylog, nginx, elastic search, linux permissions, java memory limitations, I finally got graylog working. It is aggregating a couple of servers data. Next I need to stand up an email server I think so it can send out email alerts. Part of my problem is my VM didn’t have enough memory and elastic search kept crashing, of course it didn’t really give any good log files to indicate that but I think docker or the system was killing the container before it consumed all the hosts memory. I also learned about file locks in docker volumes and various work arounds. I’ll return to my monitoring/scaling script and configure logging so I get alerts when conditions are met or it takes an action.

Scaling Automation

I’m slowly on my way to automating the creation of droplets in Digital Ocean and adding them to my docker swarm cluster. I have the creation and configuration working thanks to PowerShell and Ansible. I’m still working on the check logic but have the queries I need for InfluxDB. I need to figure out the best way to query the database. I also started thinking about ensuring there are limitations so this thing doesn’t take off and end up spawning 100+ servers overnight by mistake. I also am trying to set up Graylog to drive alerting for actions taken by this script and a general troubleshooting tool. I’m having trouble getting it to work inside Docker but I’ll work on it some more tomorrow. Here’s the work in progress PowerShell script. The first two functions work but there’s definitely improvements to be made.

#Requirements: Digital Ocean command line, powershell running on linux, ansible
#I can't seem to find a way to find droplets associated with a project so I'm kind of cheating and finding droplets with swarm in the name
Write-Host "Starting script to create and configure new swarm worker node" -ForegroundColor Blue
function Create-Droplet {
Write-Host "Retrieveing existing nodes from DigitalOcean"
$droplets = doctl compute droplet list --output json | ConvertFrom-Json
$swarmIDs = New-Object System.Collections.ArrayList
foreach($droplet in $droplets){
    if($droplet.name -like 'swarm*'){
    $swarmIDs.add($droplet.id)
    }
}

#If droplets get destroyed this should allow those numbers to get reused and not over increment
Write-Host "Finding next swarm node number"
$findIncrement = 1
foreach($swarmID in $swarmIDs){
    $swarmNode = doctl compute droplet get $($swarmID) --output json | ConvertFrom-Json
    if($swarmNode.name -like "*$($findIncrement)*"){
        $findIncrement++
    }
}
Write-Host "Found next available number is $($findIncrement)"
#Sets the name of the droplet to be created
$newName = "swarm$($findIncrement).serverhobbyist.net"
Write-Host "Name of node to be created: $($newName)"
#Checks for and deletes last json file for last created droplet to keep it current
$fileCheck = Test-Path -path /storage/ansible/nodescreated/latest.json
if ($fileCheck -eq $true){
    Remove-Item -Path /storage/ansible/nodescreated/latest.json
}
#Run Ansible Playbook to Create Droplet
$command = "ansible-playbook CreateDroplet.yaml -e `"dropletName=$($newName)`""
Write-Host "Running ansible playbook with command: $($command)"
bash -c $command
}
function Setup-Droplet {
#Gets returned data from Create-Droplet function
Write-Host "Discovering data about newly created node"
$createdNode = Get-Content  /storage/ansible/nodescreated/latest.json | ConvertFrom-Json
$IP = $createdNode.data.ip_address
#$IP = "165.22.178.216"
#Adds IP to hosts file
Write-Host "Writing IP $($IP) to Ansible Hosts file"
Add-Content /etc/ansible/hosts "`n$($IP)"
#Runs playbook to configure node: updates, telegraf agent + config, docker, epel, htop
$command = "ansible-playbook SetupDroplet.yaml -e `"IP=$($IP)`""
Write-Host "Running playbook with command: $($command)"
bash -c $command
}
function Check-Droplets {
#Finds current nodes in account with swarm in the name and adds the name to array
Write-Host "Retrieveing existing nodes from DigitalOcean"
$droplets = doctl compute droplet list --output json | ConvertFrom-Json
$swarmNames = New-Object System.Collections.ArrayList
foreach($droplet in $droplets){
    if($droplet.name -like 'swarm*'){
    $swarmNames.add($droplet.name)
    }
}

$check = 0
foreach($swarmName in $swarmNames){
    $influxCheckCPU = "q=`"select mean(usage_idle),count(usage_idle) from `"cpu`" where `"host`" = " + "'" + $($swarmName) + "' " + "and time > now() - 5m`""
    $influxCheckRAM = "q=`"select mean(used_percent),count(used_percent) from `"mem`" where `"host`" =" + "'" + $($swarmName) + "'" +" and time > now() - 5m`""
    $influxCPUResult = curl -G 'https://influx.server.com/query?pretty=true' -u user:pw --data-urlencode `"db=telegraf`" --data-urlencode $influxCheckCPU | ConvertFrom-Json
    $influxRAMResult = curl -G 'https://influx.server.com/query?pretty=true' -u user:pw --data-urlencode `"db=telegraf`" --data-urlencode $influxCheckRAM | ConvertFrom-Json
    if($influxCPUResult -ne $null -and $influxRAMResult -ne $null){
        $CPUUsage = (100 - $($influxCPUResult.results.series.values)[1])
        $RAMUsage  = $($influxRAMResult.results.series.values)[1]
        $CPUUsage = [math]::Round($($CPUUsage),2)
        $RAMUsage = [math]::Round($($RAMUsage),2)
        Write-Host "Results for $($swarmName):" -ForegroundColor Blue
        Write-host "CPU usage: $($CPUUsage)%" -ForegroundColor Blue
        Write-host "RAM usage: $($RAMUsage)%" -ForegroundColor Blue

        if($CPUUsage -ge 40){
            $CPUUsage = [math]::Round($($CPUUsage),2)
            Write-Host "CPU usage is now at $($CPUUsage)% which is high, an action might be taken"
            $command = "logger CPU usage is now at $($CPUUsage)% which is high, an action might be taken"
            $check++
            bash -c $command
            }
        if($RAMUsage -ge 50){
            $RAMUsage = [math]::Round($($RAMUsage),2)
            Write-Host "RAM usage is now at $($RAMUsage)% which is high, an action might be taken"
            $command = "logger RAM usage is now at $($RAMUsage)% which is high, an action might be taken"
            $check++
            bash -c $command
            }
        }
    }
}


#Write-Host "Starting Function: Create-Droplet" -ForegroundColor Green
#Create-Droplet
#Write-Host "Starting Function: Setup-Droplet" -ForegroundColor Green
#Setup-Droplet
Write-Host "Starting Function: Check-Droplet" -ForegroundColor Green
Check-Droplets

Recreating AWS Fargate (lite)

I’ve recent been on a spree building out my infrastructure as docker containers. I’m often torn in the debate between the cloud and on-prem infrastructure. I think the reason for that is two fold. One: I really like the tangible nature of the on-prem hardware. Seeing the servers, storage, and network gear and thinking about all the bits flying all over the place gives me comfort for some reason. Two: I like understanding the logic of how a system works. In the on-prem world with traditional silo’ed infrastructure you have to understand the inner workings to create automation on top of it. I love knowing the inner workings of IT systems.

Lately however I’ve been using more and more cloud providers. I played around with AWS Fargate where they manage the cluster of container hosts for you. You can set auto scale rules so it adds more computer power as it creates more replicas of your container. It will even register and deregister the containers from the elastic load balancer, make health decisions and “self heal” your application. It integrates with Route 53 for your DNS entries. It redirects your log files and monitors several metrics for you. I can only imagine Amazon invested a lot in this system so it may be lofty of me to want to recreate this. I’d really like to remain provider agnostic but its hard when the ecosystem of the provider works so well together right out of the box.

Because my credit card can only take so much from Amazon experiments I decided to try and create something similar on Digital Ocean. I’m using Ansible to spin up swarm worker nodes, from the inception of the VM to updates and configuration and joining the swarm. That is tied together with PowerShell. There may be a better way but its the scripting language I’m most familiar with and it seems to work quite well on Linux. I’m using influxDB and telegraf to gather performance metrics, I hope to have a query drive scaling decisions. I still need to figure out the load balancing aspect and probably others that I haven’t thought of. Here’s the main part that will drive the creation of new droplets on DigitalOcean.

#Requirements: Digital Ocean command line, powershell running on linux, ansible
#I can't seem to find a way to find droplets associated with a project so I'm kind of cheating and finding droplets with swarm in the name
$droplets = doctl compute droplet list --output json | ConvertFrom-Json  
$swarmIDs = New-Object System.Collections.ArrayList
foreach($droplet in $droplets){
    if($droplet.name -like 'swarm*'){
    $swarmIDs.add($droplet.id)
    }
}

#If droplets get destroyed this should allow those numbers to get reused and not over increment
$findIncrement = 1
foreach($swarmID in $swarmIDs){
    $swarmNode = doctl compute droplet get $($swarmID) --output json | ConvertFrom-Json
    if($swarmNode.name -like "*$($findIncrement)*"){
        $findIncrement++
    }
}
#Sets the name of the droplet to be created
$newName = "swarm$($findIncrement).serverhobbyist.net"
#Run Ansible Playbook to Create Droplet
$command = "ansible-playbook CreateDroplet.yaml -e dropletName = $($newName)"
bash -c $command

Devops Tools

The DevOps tool chain has always fascinated me. I work for a company that is fairly silo’ed and sometimes kind of traditional. We use Microsoft System Center Configuration Manager as our CM tool. Now don’t get me wrong, it works, and it is powerful, but boy is it slow. I’ve been told Microsoft wants it to be able to work on environments with millions of servers. I started looking at Chef because another team at work uses it. It was neat to see all of the integrations it had and the amount of documentation out there. I was also enjoying the elegance of structured files to dictate configurations rather than always having to click through SCCM’s sluggish GUI for every modification. Chef uses a pull methodology, as does SCCM. Chef’s can be set pretty low and it also has some neat bootstrap functions for new infrastructure but I really wanted something with faster feedback. I started learning about Ansible and began using it in my lab. I set up integrations with Digital Ocean and my on premise VMWare lab to orchestrate the deployment of virtual servers. I created a php web app to have a web form where you can fill out your server specs and VMWare will use its operating system customization to set things like name and IP address. I was finally able to spin up a complete machine at home at the click of a button like Vultr, AWS, or Digital Ocean. At work we have a more involved build process and I got really excited to demo my created. Some were impressed, some even got excited that we could create a server from start to finish in about 5 minutes, down from about 45. However I was unable to get management buy in at my demo. They had concerns about the learning curve. I like stepping outside the comfort zone of the OS GUI and doing things command line, PowerShell, Bash, YAML files, etc. Others are not so willing to leave that zone. One team mate approached me and agreed to learn it. Slowly but surely I’ve been creating the ansible playbooks for my organizations Windows infrastructure. Its more complicated than my home lab but I’m getting close to accounting for all the extra configuration our servers need.