Optimizing Images to Save Bandwidth on AWS

Last month our finance guy came to me in a bit of a panic to point out that our Amazon Web Services (AWS) bill was way higher than expected – by several thousand dollars.  After the initial shock wore off I started digging to figure out just what was going on.

From Amazon’s bill I could easily determine that our bandwidth costs were way up, but other expenses (like EC2) were in line with previous months.  Since I have several S3 buckets that house content served from S3 and CloudFront, and since Amazon doesn’t break down the costs enough I had to figure this one out on my own.

We serve millions of page views per day, and each page view causes several calls to different elements within our infrastructure.  Each call gets logged, which makes for a lot of log files – hundreds of millions of log lines per day to be exact.  But, this is where the detail I needed would be found so I had to dig in to the log files.

Because I collect so much log information daily I haven’t built processes (yet) to get detailed summary data from the logs.  I do however, collect the logs and do some high level analysis for some reports, then zip all the logs and stuff them to a location on S3.  I like to hang on to these because you never know when a) I might need them (like now), and b) I’ll get around to doing a deeper analysis on them (which I could really use, especially in light of what I’ve uncovered tracking down this current issue).

I have a couple of under-utilized servers so I copied a number of the log files from S3 to my servers and went to work analyzing them.

I use a plethora of tools on these log files (generally on Windows) such as S3.exe, 7zip, grep (GNU Win32 grep) and logparser.  One day I’m going to write a post detailing my log collection and analysis processes….

I used logparser to calculate the bandwidth served by each content type (css, html, swf, jpg, etc.) from each bucket on a daily basis.  My main suspect was image files (mostly jpg) because a) we serve a lot every day (100 million plus), and b) they are generally the biggest of the content we serve from S3/CloudFront.

Since my log files are so voluminous it actually took several days to cull enough information from the logs to get a good picture of just what was happening.  Even though I’d gotten over the initial shock of the increased Amazon bill I was a bit shocked again when I saw just what was happening.  Essentially, overnight the bucket serving my images (thumbs & stills) went up 3-4 times on a daily basis.  This told me either a) we were serving more images daily (but other indicators didn’t point to this), or b) our images were larger starting that day than they had been previously.

Well, that’s exactly what it was – the images were bigger.  Turned out a developer, trying to overcome one bottleneck, stopped “optimizing” our images and the average file size grew by nearly 4 times – just about the amount of bandwidth increase I’d found.  So, in order to save a few minutes per day and even a couple bucks with our encoder this one thing ended up costing us thousands of dollars.

Once I identified the issue I began working with our developers to fast-track a solution so we can accomplish all our goals: save time, save money encoding/optimizing the images, and get them small as possible to save on bandwidth with Amazon therefore saving a lot of money.  I also went to work to identify the images that were too big and optimize them.  In fact, this is actually an ongoing issue as our dev team hasn’t quite implemented their new image optimization deployment, so I actually grab unoptimized (i.e. too big) images a few times a day, optimize them, then upload to S3.  This process probably justifies its own post so I’ll do that soon & link to it from here.


Delete Files by Date With DOS Batch File

I have a server that is continuously filling it’s drive with temporary files left over from a video encoding process we use.  Every week or so I have to manually delete the old files to prevent the drive from filling.  Finally today I headed on a journey to figure out a way to programmatically clean up these old files regularly.  So I tooled around the Internet to find a way to do it from the DOS command line.  Unfortunately others had run into the same issue I was, that DOS commands like del don’t have a way to delete by date.  I found a few approaches using complex batch files & even tried one for a few minutes, but when I couldn’t get it to work I went back to the drawing board.

I found a really old post suggesting using xcopy to copy the newest (presumably the ones to keep) files to a temp location, then delete the original files, then move those copied away back to their original location.  This approach had some promise, but had some real drawbacks too; particularly that I’m dealing with tens of gigabytes & this would take forever, and that I’m dealing with several and varying subdirectories.

Since xcopy has been deprecated and replaced with robocopy in Windows 2008, Windows 7, etc. that’s what I chose to use.  In fact, robocopy has a couple switches that make it easy to move (/MOV) files older than x days (/MINAGE:x).

What I ended up with was a simple two line batch file that I’m using Windows task scheduler to run once a day.  The first line moves files older than x days to a temporary location & the second line deletes the moved files.

robocopy D:Original_Location D:DelOldFiles * /S /MOV /MINAGE:7
del D:DelOldFiles* /s /q

The robocopy syntax is ROBOCOPY source destination [file [file]…] [options].  I’m using the following switches (options):

  • /S – include subdirectories
  • /MOV – move (which is technically a copy, then delete original)
  • /MINAGE:x – act on files older than x days
After the files are moved I’m deleting all files (*), in all subdirectories (/s) in my temp folder, quietly (/q), i.e. don’t prompt, just do it.

See also

Using dos to delete based on date