It's not easy to build it just right, without building too much of one resource and not enough of another. By resources, I mean CPU power, memory, disk (space, throughput and IO performance), and network throughput. Understanding how these resources are typically used, and how they can be shared, is key to building your virtual infrastructure to the right spec.
If you look at a non-virtual server, the level of resources that it possesses and the level of resources that it uses throughout the day are usually quite a bit different. Commonly, a physical server will use 5 - 15% of it's CPU power on average, and perhaps 25 - 50% of its memory. Of course there are plenty of servers that are using all of their CPU or memory, and could use more, but on average, across all of your physical servers, these percentages tend to be fairly typical. Clearly, having all that extra CPU and memory in your data center is money wasted, not to mention the power you're paying for to run it all and keep it cool. It's all lost ROI.
With memory used more heavily than CPU, you can see right away that as you add VMs to a VM host, the memory requirements increase faster than the CPU requirements. This is mostly due to the fact that a VM can use a slice of CPU time, then another VM can take its turn, and so on and so on, by the microsecond. This is not true of memory, where large chunks of memory are allocated to a VM, and the process of reclaiming it for use by another VM is quite slow. So the first rule of high density virtualization is clear: Big RAM is a must.
About the RAM
Some of today's servers now support crazy big RAM. You can stuff 1 TB of RAM into an HP BL685 blade server. Sounds awesome, but before you do that you need to understand the downside of doing so. To get the maximum amount of RAM in some of these servers, you have to use the largest memory modules (DIMMs). DIMMs are made larger by adding more rows of chips or circuits to the DIMM. Each row (referred to as a rank) uses up some power. When you have too many ranks installed, the voltage drop on the memory channels increases, and forces the channel to operate at a slower speed.
The biggest DIMMs available today are often 16 GB and 32 GB DIMMs, and are often quad-rank. Filling your server with these DIMMs will likely lower the speed of memory pretty substantially. However, there are some 16GB dual-rank DIMMs that can operate at close to full speed. I've managed to get 512 GB of RAM into a BL685 running at high speed. So, rule number two is, stick to dual-rank DIMMs to keep your memory speed high. Pay close attention to the exact type of memory you're looking at before you buy it, it's critical for performance. That said, I'd rather have enough slow memory than not enough fast memory, because not enough memory means very bad performance or less VMs per host.
The more cores the merrier. A higher number of CPU cores per socket means less sockets, which means less servers, less space, power and cooling. The latest and greatest is the sixteen core processor from AMD. You can fit four of these in a BL685, giving you a 64-core server. You can certainly run a lot of VMs on that. With so many cores, the amount of cache on the processor becomes very important, so that the core can do a fair amount of work on cached data between trips out to main memory via the paths it has to share with all the other cores.
Some Quick Math
OK, we can build a 64-core host with 512GB RAM. That's huge. If you think about it, that's 8 GB of RAM per core. You could host 64 VMs each with 8 GB of RAM and have a one-to-one VM-to-core ratio. That's a pretty extreme. If you look at your physical servers, they probably only use 1 - 2 GB of RAM today on average. So you'd probably only end up using 2 x 64 = 128 GB of your 512 GB. This is where you have to think about the future. As you build this host, you have to wonder what will happen to your memory requirements a few years out.
Looking into the past, say five years, we were running Windows 2000 and Windows 2003, memory requirements were less back then. A typical Windows 2000 server used around 512MB or RAM, and a Windows 2003 server used around 1GB (more depending on what was running on them obviously). Windows 2008 uses a bit more, and you have to wonder how much the next version will want.
Windows 2008 uses most of its free memory for file caching by default, so the 2008 VM will actually use up a lot of the memory you give it. This is actually a good thing, since any files retrieved from the cache are file you didn't have to read from the disk, using up precious disk IO. So, having extra memory to give your VMs is probably a good thing. Still, you have to figure on how much is the right amount, and not buy too much, because it could be money wasted.
Boot Storm on the Horizon
In VMware you can build several hosts and setup a high availability (HA) cluster. With an HA cluster, if a host fails, it's VMs will be restarted on the other hosts. That's great, high availability for all your VMs. Now imagine our 64-core - 512GB host fails, and now 64 VMs have to be restarted. How hard are your disks going to get pounded during all those reboots? How bad will the performance of your entire cluster be during that time? How long is it going to take before all your VMs are back up and normal performance returns? Really good questions.
This is where you have to wonder if building hosts this big is a good idea or not. Perhaps it's too risky to lose that many VMs at once. Maybe we should build two-socket servers with 256GB RAM instead? Then again, how often will a host fail? How fast do you need those VMs to be rebooted? In VMware, you can set restart priorities and time delays, so you can slow down the rush to restart the VMs to control the disk pounding. You'll have to make these decisions based on how much disk performance you have, and how much down time you can stand.
A host with 64 VMs on it is certainly going to need good disk performance, and an HA cluster with several of these hosts, which must be attached to shared storage, is going to need great disk performance. I'll avoid writing a complete SAN guide here, but that performance will be based on two primary factors, the speed of the connection between the hosts and the disks, and the IO/per second capacity of the disks.
Common connection types for shared storage include Fibre Channel (FC), iSCSI, and FCOE, each with various speeds available. Now with 64 VMs pr host, I think it's fair to say that 1Gb/sec iSCSI is out of the question. What's needed here is 8Gb FC, 10Gb iSCSI, or 10Gb FCOE. Each will provide decent throughput to the shared storage. A pair of HBAs (in any of these flavors) in each host will provide good performance and fault tolerance. Once that's settled, the other factor in storage performance is the disk IOps (Input Output operations per second).
IOps is a function of disk type and the number of disks that your data is striped across. Common disk types include SATA, SAS, Fibre Channel (FC), and SSD. SATA disks are commonly 1 - 2 TB in size and provide around 75 IOps per disk, while SAS and FC disks are usually 300 - 600 GB in size and provide around 180-200 IOps per disk. SSD disks are much faster (in the thousands of IOps range), limited in size and very expensive.
The key to achieving great disk performance, once you've decided on a disk type, is to stripe your data across as many disks as possible (wide striping). Wide striping can be done a few different ways, depending on the features of your storage array. When you create RAID sets, best practice says that you shouldn't create a RAID 5 set with more than around 8 disks (the more disks in a set, the more often a failure may occur). Using a simple storage array, this may be as wide as you can stripe. The IOps of the resulting LUN will be the IOps per disk times the number of disks in the RAID set (minus one for parity). So an 8-disk RAID 5 made of FC disks might be 200 x 7 = 1400 IOps.
With more advanced arrays, you can stripe a LUN across multiple RAID sets(called a MetaLUN in EMC terminology), thus adding the IOps of multiple RAID sets together. Ideally, your LUNs could be striped across every disk in your storage array, providing all of the IOps possible. The underlying RAID sets could be kept pretty small, say 4 or 5 disks per RAID, and the LUNs would be striped across all of the RAID sets. This is how 3PAR arrays are typically configured, striping data across all of the disks of the array.
It's an important point. Just as we build hosts with a lot of CPU and memory, creating a large pool of shared resources, we also want a large pool of disk performance. Using the widest possible striping means that all our disk performance is shareable by all our VMs. Providing a few smaller stripe sets means that one set may be getting pounded while another is sitting idle. Once big stripe set should be the theoretical goal.
Briefly, with a lot of VMs per host, the network throughput requirements obviously go up. The best bet here would be to use a pair of 10Gbe network cards. The BL685 blade has two built in, which can be carved up into four logical NICs (8 total) presented to the blade. The speed of each logical NIC can be set to use a percentage of the total available throughput (2x10Gb).
With non-blade servers it may be harder to get 10Gbe cards (at least for now), but you can cram a bunch of 1Gb ports into them. But again, larger shared pipes are better than a bunch of smaller pipes.
Scale UP vs Scale Out
As you've probably noticed, I've been making a case for scale up (a few large servers) vs scale out (many smaller servers) throughout the article. I'm making a philosophical point here, that one large pool is better than many small pools. That's the whole point of virtualization, and that's where the maximum ROI is.
Imagine the CPU and memory graph of your VMs over the course of the day. Each one will go up and down as it receives requests over the network and performs its daily tasks. Some VMs will rise together, either because their part of the same application or because they often get hit at the same time of day (like around 9 AM on a Monday morning!), while other VMs will rise and fail with no correlation to other VMs.
You can think of the correlation between VMs as being in the range of 0 - 100%, where 0 % correlation means that VMs are all entirely independent of each other, while 100 % correlation means they all rise and fall together. Clearly, reality is somewhere in between.
Now if you had one giant host with all your VMs on it, you could watch the CPU and memory requirements go up and down and see what the peak usage is, and build your host to support that load, with a little head room. And if your correlation factor was 0, you could build two hosts half that size and everything would be fine. But we know already that our correlation factor really isn't 0, which means that as you build smaller hosts, the odds go up that correlated workloads will cause a larger spike in utilization (as a percentage of the size of the smaller host), meaning that we'll have to maintain more head room per host.
This is exactly where we were with physical servers. Each server had to be built to support the peak utilization. The larger and fewer hosts you build, the more headroom you can share, and so the less headroom you need to build.
The maximum host size will be based on risk, as I said before, how many VMs can you stand to lose at one time if a host fails, how long can you wait for them to come back up, and how much pounding can you stand in the mean time. Weigh the risk against how much ROI there is in building the largest hosts that are possible today, because that's where that money is hiding...