Five years ago, when STH was significantly smaller, we ran a short series that was almost therapeutic for me. At the time we were struggling with AWS instances segfaulting due to disk I/O issues and AWS bandwidth costs were just becoming burdensome enough that the switch made sense. Every year or so, the topic comes up, regarding whether it makes sense to self-host versus AWS. The team these days is bigger, we operate out of multiple data centers for parts of the business, but we still need to deliver STH to our readers. Many of our readers are likely going through the same discussion themselves or for their clients. We track this data and have the benefit of learning a lot of lessons along the way.
Before we get too far into this, assume this analysis is for 1/2 cabinet to 3-5 cabinet installations for a single tenant infrastructure rather than larger installations. Having a small number of machines (<200) simplifies some aspects to the point that lightweight automation tools are good enough. At the same time, companies like Snap that have enormous AWS bills for not dissimilar reasons. We often hear at $10,000/mo companies start to look at a hybrid private/ public model but we are doing so at a much lower monthly AWS spend.
The Dirty Myth – Physically Servicing Servers is Commonplace
Before we get too far in this. In the last two years trips to the web hosting data center happened for two reasons:
- There is some cool new technology that we want to try. For example, recently we recently hosted STH on Optane drives to generate some real-world performance data.
- We have a meeting nearby and something failed or needs an upgrade.
We have had nodes and switches fail and one spectacularly bad event years ago that we learned a lot from. At the same time storage is all clustered / mirrored. We have extra switches available, extra nodes available, and frankly we just overbuilt. There was a point last year where we ran an experiment on how low we could power a Xeon D hosting node. The node did not have enough cooling and was not stable. We simply offlined the node and it did not warrant even a visit to the data center for two quarters since that would have been a 20-minute trip each way.
What has made this possible is the decently reliable SSD. When we had nodes with hard drives, we saw a 4-6% AFR. If you have 30-40 drives, there is a good chance one or more is going to fail in a machine in a given year. Our actual AFR on the SSD population is down to under 0.2% even though we are often using used drives. Check out our piece Used enterprise SSDs: Dissecting our production SSD population for some information on the experiment. Moving to SSDs essentially means that we no longer need to regularly count on swapping drives on a service call. Fans rarely fail which we confirmed with one of the hyper-scale data center teams. Power supplies are pretty good these days, although we did see one fail in the lab this year.
Overall, the message is to build spare capacity. Expect things to fail. Bias towards more reliable components. Those concepts will allow you to minimize needing to service gear. It is not difficult to calculate what having online spares costs versus cloud and that is how we model our costs. We also assume a shorter lifespan than many installations. Our oldest web hosting machines are currently on the Intel Xeon E5-2600 V3 generation, and those are being evaluated for an upgrade in the next few months.
We have seen a lot of models that assume using ancient hardware (e.g. Xeon 5600 series and older) or just enough capacity to equal AWS capacity and those models typically result in higher failure rates. We also have seen a number of models rely upon using pricey virtualization software versus KVM and Linux based hypervisors. To compete with the cloud you will need to be OK using fundamental, well-known, and mature infrastructure technologies that cloud providers also rely upon.
AWS EC2 Costs v. Self-Hosting
We simplified our hosting set down to what we like to think of as a “bare minimum” set of VMs. It takes around 23 VMs to service our primary requirements, and that assumes we are not doing A/B testing or trying out additional services in additional VMs. These 23 VMs are essentially the “core” of running a site that services a few million page views per month plus runs a variety of services for our readers and clients.
We took these 23 VMs and constrained them to the mainstream AWS instance that most closely resembles our actual RAM usage. Our primary constraint, like many environments, is RAM. STH does not use a ton of storage, and we can effectively cache pages across the site. Since we are heavy on caching, we generally expect 18-24 month intervals between when we need more memory allocated.
We also are not including any temporary infrastructure such as VMs that are spun up to simply try a new feature or technology. We are using simply the base load.
AWS EC2 Cost Breakdown
On the AWS side, it would be completely irresponsible to model using on-demand instance pricing as is often done. We know from years of experience what a base load looks like and we know the instance sizes that can handle our modest peak to valley traffic. We are using 1 year reserved instance pricing because we do tend to upgrade annually. While this year we are heavy on m5 instances due to their memory to compute mix, next year may be something different. We also do opportunistic upgrades on our hosting servers hence why we already have Intel Optane in the hosting cluster.
Since we know we will have capacity needs at least for a year, we can model using 1-year reserved pricing. We tend to not look at all up-front pricing since that is not what we have in our hosting cluster. Instead, we look at no up-front and partial up-front reserved pricing. We tend to move to newer hardware and newer instances on an annual basis which is why we are not using a 3+ year reserved instance tier.
AWS 1-year No Up-Front Case
If we want to model our costs for a year, we can use the 1-year no up-front instance option. Here is what our breakdown looks like right now:
Our total cost for the 23 node fleet, in a single AZ, is $3029 per month or around $36,350 per year.
One item we should address, and more on this in the discussion, is that we modeled using 4TB of outbound traffic per month. That is a reasonable figure for us for Q1 2018 given our video and ad content is hosted externally.
AWS 1-year Partial Up-Front Case
AWS offers an option for those that know they are going to have a base load. One can pay an up-front sum and have a lower monthly bill. Here is what our breakdown looks like under this scenario:
This option brings our monthly bill down to around $1675 per month or about $20,100 in the recurring total. We still pay an initial sum of $14770 at the outset so our total is $34,870 for the twelve month period. That reserved model got us a solid $1480 savings or around 4%.
Self-hosting is a different animal. While with AWS the costs of running the internal networking, sourcing data centers and components, dealing with hardware failures, and etc. are all factored into the price, if you self-host, these are items that cost real money. We also get (significantly) better VM and storage performance on the self-hosting option, and there is usually a small shadow fleet of VMs that are online at a given time that are “free” in the self-hosting option but would add costs on a cloud provider.
We have three primary hardware cost lines:
- Hyper-Converged Nodes
- Replacement/ Upgrade Hardware
- Networking and Cabling
For networking gear, we assume a 3-year service life. We simply take initial purchase price and divide by 36 months here. As a result, here is what our budget looks like:
On the flip side, since we moved to all-flash hyper-converged nodes, this is not overly accurate. Actual expenditures in 2017 for web hosting were under $16,000. We also have undertaken a number of proactive upgrades such as making nodes 40GbE capable instead of 10GbE which added cost. We have been cycling in Optane and larger capacity flash. There are also a number of cost-cutting measures in the nodes that can cut another $4-5K per month by running higher utilization. We typically are building 4x the capacity we need to ensure that we can handle ad-hoc application needs and failures.
Comparing AWS and Self-Hosting Costs
Looking at what our 2018 budget is versus AWS 1-year Partial Up Front reserved instances, this is what our budget this year looks like:
This analysis shows a 42% savings by self-hosting versus AWS costs. That is fairly compelling, but there are a few additional points:
- We excluded the 12-30 VMs we have running on the hosting infrastructure at any given time. The incremental cost is essentially 0 for us, but there would be several hundred dollars per month more AWS costs, especially if we had to use on-demand instances.
- Our last major hardware purchase was in early 2017. Since then DDR4 and NAND prices have risen drastically. Since we have so much overcapacity, we can choose to wait for better component pricing.
- We do not have on-site next business day service on the machines. This is usually a huge cost item, but we deal with one drive failure per year at most. Our colocation contract has provisions for “free” remote hands to swap failed hot-swap drives.
- We are using a $120/ hour rate to drive to the data center (20 minutes each way) and perform any service tasks, including racking new gear.
- Our budget for 2017 was similarly around $20,000 yet our actual costs were well under $16,000.
- As we decommission nodes, their value is not $0. We can often get over 15-20% of our storage and networking purchase price back selling our used gear if we wanted. This is not included in the modeling but is a reasonable caveat.
- Upgrading intra-node switches and cabling from 10GbE to 40GbE added significant cost that could be avoided. One could also move to single switch fabrics which would save additional dollars.
- AWS can have significant cost reductions each year. If this was a 5% variance we would probably go with AWS. As it stands, our “worst case/ way overbuilt” 2018 budget is a 42% savings over AWS.
- AWS projected 2017 costs (using m4 generation instances) were approximately 135% actual self-hosted costs in 2017.
- Moving from a 4x to a 3x or 2x overbuild ratio would have a transformative impact on our self-hosting costs but would increase the need to quickly replace failed hardware.
- Beyond spare compute and memory, we also have excess bandwidth and storage. The excess bandwidth means that STH site growth translates into near zero incremental cost in the self-hosted model but in the AWS model would add significant cost. With STH growing at a triple-digit annual pace, bandwidth savings quickly outpace cloud provider data transfer price cuts.
On balance, we are understating the benefit of self-hosting and still end up 42% ahead. Overall, this is simple to do, but it is something we look at each year to ensure it is worthwhile.
Why does everyone not self-host versus cloud?
There are a few key factors to why there has been a big push to the cloud. STH’s growth profile is aggressive, but it is entirely manageable. While 1000% annual growth this year would be awesome, it is also something we can handle in our current model at near zero incremental cost. If you have a larger operation with a more complex set of applications, the economics can change. Likewise, if you are expecting 1000% per month of growth, scaling in the cloud is clearly advantageous. Many VC’s specifically dictate that their investment companies use cloud infrastructure over self-hosting.
There is a marked push for running hybrid cloud models. Many companies are essentially looking at their “base load” that they can run in their data centers, similar to what we are doing. They can then burst using cloud capacity for seasonal events or specific tasks.
AWS in 2018 is Microsoft of 2000 in that there is a huge vendor lock-in as companies build on the AWS services infrastructure. At STH, we use few AWS-specific tools (and we still use some), but that also makes our workloads budgeted here metal agnostic. We also are not handling healthcare data so having a HIPAA compliant infrastructure, that cloud providers have, is not a top priority.
One will hear a number of industry analysts push cloud as that is undoubtedly a major thrust in the market. Hardware vendors who pay for industry analysts look at hyper-scale customers as a major segment. Cloud is still cool. STH self-hosting is decidedly uncool however it is hardly a trailblazing endeavor. It is an established model and companies like Box that leave AWS to tout massive cost savings.
Perhaps the #1 reason cloud is so popular goes back to just how much VC money has been spent over the past few years. Much of the top-tier IT talent that is capable of managing this type of infrastructure is working on building product and services rather than optimizing on the cost side of the equation.
Everyone has different workloads so this should simply be taken as an example and data point. We are starting to see more companies look at the hybrid-cloud model since that provides flexibility benefits while still allowing a company to reap the rewards of better cost models. Our cost savings are particularly small. Saving $15,000-20,000 per year is less than 0.1 FTE in Silicon Valley wages. That drives why we generally see this as a push as AWS bills exceed $10,000 per month rather than at a smaller scale. We also have access to a ton of different technologies so we know exactly where we can make trade-offs. Perhaps instead of simply reasonably prudent IT buyers we are exceedingly prudent IT buyers which makes the model work.