awsdevops cloudwatch, custom_metric

I had an issue recently with an EC2 instance running out of disk space. Unfortunately free disk space is not a metric that comes out of the box with AWS CloudWatch. This post is about implementing a custom metric and getting notifications via AWs CloudWatch based on that metric.

Steps to monitor disk space with CloudWatch

Step 1: Download sample config file

AWS provides a sample JSON file at this location: https://s3.amazonaws.com/ec2-downloads-windows/CloudWatchConfig/AWS.EC2.Windows.CloudWatch.json

Download a copy of this file.

Step 2: Set IsEnabled to true

By default it comes disabled so set the value as shown below:

"IsEnabled": true

Step 3: Add the custom metric for disk usage

Add the custom metric to monitor disk space:

{
    "Id": "PerformanceCounterDisk",
    "FullName": "AWS.EC2.Windows.CloudWatch.PerformanceCounterComponent.PerformanceCounterInputComponent,AWS.EC2.Windows.CloudWatch",
    "Parameters": {
        "CategoryName": "LogicalDisk",
        "CounterName": "% Free Space",
        "InstanceName": "C:",
        "MetricName": "FreeDiskPercentage",
        "Unit": "Percent",
        "DimensionName": "InstanceId",
        "DimensionValue": "{instance_id}"
    }
}

Step 4: Add the new metric to flows

After defining the metric we need to add it to the flows so that it can be sent to CloudWatch. To achieve this update the flows section as shown below:

"Flows": {
    "Flows": 
    [
        "(ApplicationEventLog,SystemEventLog),CloudWatchLogs",
        "(PerformanceCounter,PerformanceCounterDisk),CloudWatch"
    ]
}

Step 5: Add IAM role to server

It’s a good practice to manage permissions of EC2 instances via IAM roles assigned to the machine. To enable sending logs to CloudWatch add AmazonEC2RoleForSSM policy to the machine’s role

Without this role SSM agent service gets an access denied error.

Step 6: Restart Amazon SSM Agent service

Either by using Windows Services Manager or running the following command:

Restart-Service AmazonSSMAgent

Once this is all done wait a few minutes and check CloudWatch metrics. Under All -> Windows/Default you should be able to see new metric under InstanceId group (as that’s what we are using to group the logs). And when you click the metric you should be able to see a nice time-based graph of free disk space on the server:

Notes

  • It’s useful to know where SSM Agent’s logs are stored. They can be found in this path:

    %PROGRAMDATA%\Amazon\SSM\Logs\

  • The service reports every 5 minutes. The PollInterval in the JSON file is in seconds and is different than service report interval.

Resources

awssecurity organizations, iam

I have never been a huge fan of AWS Management Console. Some reasons for that being:

  • Inconsistencies: In some services you can search by anything (such as tag value in EC2 dashboard) whereas in others you have to put in the exact start of the object (such as CloudWatch)
  • Regional separation: Some might like it but I find it confusing and error-prone. If you need to work in multiple regions you have to constantly change the region from the dropdown menu. If you accidentally create a resource in another region you wouldn’t see it’s still running until you accidentally switch back to that region again. But S3 seems to be an exception to this as you can select the region while creating the bucket and you can see all in the same list (speaking of inconsistencies…)
  • Flat resource structure: Every resource is mixed together in an account. If you have multiple projects or teams in your company, you would see all the resources they created among yours. Also there is no environment concept. Your test and production resources live side by side.

This post is about AWS Organizations which addresses the 3rd point in the list above.

What is AWS Organizations?

It is a way to centrally manage multiple accounts inside an organization by creating a hierarchy between accounts.

Benefits of having an account structure

  • No need to label everything with project/team/environment name
  • Production and non-production resources don’t live side by side
  • Better access controls: No need to grant access on resource level. It can be done much easily on account-level

Also from a cost point of view it has the following benefits (taken from AWS Account Structure Considerations)

  • Grouping resources that require different payment instruments
  • Providing groups with different levels of administrative control over AWS resources
  • Better controlling Reserved Instances for specific workloads
  • Identifying untaggable costs such as data transfer
  • Using accounts associated with different business units or functional teams

Key Concepts

Account: Your regular AWS account. The first account you create is called a Master Account, the rest are Member Accounts.

Organization: A group of related accounts. The account creating the organization becomes the master account.

The star next to the account indicates it is the master account.

Organizational Unit: You can use organizational units (OUs) to group accounts together to administer as a single unit. This can be any logical grouping such as team, project, environment etc.

Service Control Policies (SCPs): Enables you to restrict, at the account level of granularity, what services and actions the users, groups, and roles in those accounts can do

Managing Projects and Environments

First I was tempted to separate projects as well but I’d end up with too many accounts so abandoned that idea and adopted an environment-based organizational structure. I ended up having these AWS accounts in my organization:

  • Dev
  • Integration
  • UAT (User Acceptance Test)
  • Sandbox
  • Production

Then I created an organizational unit named Stages and moved all these accounts under that OU. This is just one way of structuring projects. Based on organizational needs it can be customized. In my case I decided to keep all shared services (logging, auditing, source code) in the master account.

Logging into accounts

This baffled me at first. Initially I created a test account which I wanted to delete later on. But I wasn’t able to do that until I completed the sign up steps which in turn I wasn’t able to because I didn’t have the credentials to log in!

As stated in this document:

When you create a new account, AWS Organizations initially assigns password 
to the root user that is a minimum of 64 characters long. All characters 
are randomly generated with no guarantees on the appearance of certain 
character sets. You can't retrieve this initial password. To access the
account as the root user for the first time, you must go through the
process for password recovery.

So when you follow the sign-in link it redirects you to IAM login page. I needed to switch to root account login and recover my password by using the Forgot My Password link. On that note: Don’t use fake email addresses as you will need the confirmation email to recover your password.

Removing account issues

This one was tricky. In order to leave an organization first you need to enter a payment method and select a support plan. This way the account becomes eligible to be a standalone account. Only after that, you can Leave organization. But not right away!

After I entered all the data and completed the setup steps, I clicked Leave organization and I got this error:

I waited almost a full day after getting this error but to no avail. I kept getting the same error: “This operation requires a wait period. Try again later.”

I had a chat with a support engineer and created a case for this. Nothing helped at first but after a few days I tried again and it worked! So either the waiting period was very long or they fixed something in my account unbenownst to me.

Deleting the account without removing

Another issue I had was deleting an account before removing from the organization. I was assuming that if the account was closed permanently it would be removed from the organization as well. This was not the case. It remains listed as Suspended

Unfortunately, once this happens there is no way of resolving it using the tools at our disposal. The only solution is to contact AWS support, reactivated the suspended account, leave the organization and close it again!

But you have to do it from the suspended account, not from the master account. Since technically you’re requesting support for another account, they won’t do it (as they told in their response). Good news is that we still have access to support even though the account is suspended. Si I went to support page and created a support request to reinstate my account (so that I could close it again shortly after!)

Another option might be just to wait. I haven’t tried it myself bu in the account closure email it states “After 90 days, you will not be able to reopen your account, and any remaining content in your closed account will be deleted.” So I’m guessing it will be gone completely if I could just for 90 days.

Managing Accounts Programmatically

The coolest thing about AWS Organizations is accounts can be created via command line. It’s easy as this:

JOTUNHEIM:~ volkan$ aws organizations create-account --account-name {NAME}  --email {EMAIL}
{
    "CreateAccountStatus": {
        "Id": "xyz-abcabcabcabcabcabcabcabcabcabcab",
        "AccountName": "{NAME}",
        "State": "IN_PROGRESS",
        "RequestedTimestamp": 1532494396.633
    }
}

Notes

  • Free tier is shared among all accounts in the organization: “If your company creates your AWS account through AWS Organizations, free tier eligibility for all member accounts begins on the day the organization is created.”

Conclusion

This post is meant to be an introduction to AWS Organizations rather than a complete guide. I will post similar ones as I use this as basis of my infrastructure and build on this.

Resources

linuxsysops raid

AWS and cloud computing are awesome but I still enjoy having a server at home. I’ve decided to reinstate my old desktop. Replaced the disks with shiny new ones (A 500GB SSD and 2 4TB drives for data) and installed Ubuntu 18.04.

My primary goals were:

  • Partition SSD drive and use it for data that needs performance
  • Set up RAID1 on 2 4TB disks so that 1 disk failure wouldn’t result in data loss
  • Set up some sort of notifications to monitor SSD disk health
  • Set up some sort of notifications to monitor RAID disk health

Having all these in place was important for me to have a solid, reliable system before I started building stuff on top of it.

Partitioning

By default Ubuntu only adds a 512MB /boot/efi partition and leaves the rest for root (/). But since I recently had some free space issues in the boot drive recently I decided to create a boot partition as well. Also added a 32GB Swap partition. Swap is is virtual memory and it should have the same size as the computer’s memory.

So I ended up allocating 32GB for home as well and left the rest for the root. Going forward now I have more than 300GB to use for application data, Docker images etc.

RAID1

Now it’s time to set up a RAID array to have some redundancy. They are shiny new disks but you can never fully rely on them and they will eventually fail. Probably it’s not the best idea to install 2 identical disks at the same time as their lifecycle will likely end at similar times. So I’ll need to keep an eye on them and set up some monitoring and notifications (more on that later).

As my guide I started using this article called Setting up RAID 1 (Mirroring) using ‘Two Disks’ in Linux – Part 3 to set up my RAID.

This is a very nice article and explains everything step by step already so I’m not going to duplicate it here. But I bumped into an issue while following the guide: 4TB drives. My disks were MBR which only supported up to 2TB and I had to convert them into GPT disks.

The answer was parted. I followed the steps in this answer and managed to partition the drives to 4TB (3.7 to be exact!)

Testing RAID1

Now that I had a RAID it was time to go out and test it. First thing I wanted to see was it mounted automatically after a reboot. It did mount the drives but there was a problem with RAID. At boot the device was showing as md127 instead of md0 and it was failing to sync.

My mdadm.conf file looked like this when this was happening:

ARRAY /dev/md0 level=raid1 num-devices=2 metadata=1.2 name=asgard:0 UUID={device id}
   devices=/dev/sda1,/dev/sdb1

After some Googling I found the answere here

The solution was simply removing the name parameter from the file!. After that, when I rebooted and entered the following commands I was able to see a healthy RAID:

cat /proc/mdstat

and

mdadm -D /dev/md0

Monitoring System Disk with smartmontols

Now that everything looked in place, I needed to ensure it stays that way!

To achieve that goal, I edited this file /etc/smartd.conf and added this line:

DEVICESCAN -o on -H -l error -l selftest -t -m my.email@address.com -M test

Set up sending emails from server

When I first installed smartmontools, it asked Postfix configuration which looks like this:

Since I was interested in getting results fast, I selected No Configuration and moved on. Now it’s time to configure it.

To bring up this screen again I entered the following command:

dpkg-reconfigure postfix

I followed the wizard, entered my email domain. Then followed this documentation from AWS: Integrating Amazon SES with Postfix

As always, I ended up having some issues :-).

First problem was it didn’t work! It wasn’t finding the SES SMTP server as relay server and was always trying to send emails from localhost. The solution was here

As the instructions say, I updated /etc/postfix/main.cf with the values below:

myhostname = localhost
mydestination = $myhostname, localhost.$mydomain, localhost, $mydomain

and I was able to send emails to the SMTP server. But the SES didn’t like my IP address and the email bounced. The solution to that was to create an IP filter in SES and allow the traffic from that address.

Then I restarted the service to test

service smartmontools restart

and received the notification. Actually received 3 emails for some reason.

The service runs at startup so this way I can be notified whenever it reboots too.

Monitoring RAID with mdadm

Took some time to complete synchronizing 4TB disks but finally I was ready to rock:

Apart from smartmontools, mdadm application is also capable of sending emails when a disk fails.

The documentation tells to add MAILADDR followed by an email address to specify target email address but in my tests adding the line didn’t change anything.

In fact, turns out by default it’s sending the notifications. As my server was set up to send emails now, by just entering the following command to send out a test email I was able to receive it

mdadm --monitor --scan --test -1

The problem is it’s now using root@mydomain.com all the time now. I wasn’t able to change it. But as long as that mailbox exists at least I can receive notifications.

Conclusion

All hardware eventually fails. Disk failure is especially annoying because it may cause some precious data loss. Apart from good backup practices, it’s also helpful to have a good monitoring system and redundancy on the disks we use.

It took me some time to set up my system but now at least I have 1 disk redundancy for large amounts of data and the ability to be notified whenever something goes wrong with the disks which gives me some peace of mind (not too much though :-)).

Resources