My Big Data Infrastructure Problem

Moving big data to and from that shiny new data warehouse solution is taking too long!  That's what we've discovered as we start ramping up our use of the solution.  We've got 150TB of space on this map/reduce cluster, ready to take some input data from some of our big Oracle servers, but when we try to copy a database over, it takes a long long time.  For example, a 3.5TB database took almost 32 hours to copy.  OK maybe we can live with that, assuming that input data loads are few and far between, but what if data loads are a regular occurrence?

I decided to do some math.  3.5TB is 3584GB or 28672Gb.  32 hours is 115200 seconds.  28672/115200 = 0.249 Gb/sec.  Looks like I'm traversing a 1Gb/sec network connection.  Who in the world would plug a 150TB data warehouse at 1Gb?  Time to walk into the datacenter and take a look.  After chasing some cables, I discovered that the warehouse appliance was actually connected at 10Gb/sec.  Hmmm.

Where are you guys copying the data from?  Via an NFS mount to the Oracle server.  Oh dear, it's got 1Gb connections.  Another copy job running over another NFS mount to NAS storage, also 1Gb.  Are we ready for big data?  Turns out, although we've got a fair amount of equipment connected at 10Gb, including our warehouse appliance, rows of blade enclosures, and a few backup solutions.  But few if any servers have 10Gb connections.  The blade enclosures have back-of-enclosure switches that have 10Gb uplinks but only 1Gb downlinks to the blades.  Looks like there's practically nothing in the room that we can do file transfers at 10Gb.  Even our primary backup architecture, although connected at 10Gb, doesn't stream data all that fast for various reasons.

The problem isn't just about data loads either.  What if we want to back this thing up?  Now you may say that there's no need to back the data up.  If data is lost, just get the input data again, and run your map/reduce process again.  But what if business critical applications rely on the output data?  Can you wait 32 hours to copy the data over the network again, then some more hours to process it again?  Can you stand two days of down time because someone accidentally smoked a file system or blew up a database?

Perhaps I'm not expressing myself very clearly with the mess of details I've written above, so let me put my architect hat on.  Why did we build this big data platform?  So business applications could consume the output data.  Some of those applications will certainly be considered business critical to some degree.  We're going to need a way to recover quickly from any data loss.

Now big data is often pretty static.  You get some input data, process it, and present the output data.  So you might think that you don't need to backup the output data.  If you lose it, you can just get the input data again and process it again.  But if that process takes you two days, that might not be an acceptable recovery SLA for your business applications.  Also, the data might be cyclical, in other words, you might need to reload fresh input data every month, every week or whatever, reprocess and present fresh output data.

To support acceptable data delivery and recovery times, we're going to have to extend our architecture review outside of the big data platform to include our data sources (often these are other database servers in the environment) and our data targets (backup systems, offsite replication targets, etc.).  All of these other systems will probably need to be brought up to 10Gb connectivity in order to establish reasonable data transfer times for multi-TB data.

Now it's likely that your data warehouse will contain a number of different data sets, and not all of them will require the same recovery SLAs.  In some cases, recovering the data by getting and reprocessing the input data again will be the best option.  In other cases, recovering the output data from backup may be the way to go.  And for the most critical data, replication to another warehouse appliance at a DR site could be the solution.  It's likely that you could end up with all of the above.

That said, building a big data platform and plugging it into your network at 10Gb is far from the end of your big data architecture solution.  We're going to have to look at each chunk of data, where it comes from and where it needs to go, establish recovery SLA's for each chunk, and make sure we can meet those SLAs.  This will likely require upgrading to 10Gb connectivity to your traditional database servers and your backup servers, and possibly increasing throughput on your WAN links.  If you plan to backup a lot of new data, you're going to have to add capacity to your backup architecture as well.

Or, I suppose, you could simply wait a day or two for your copy to complete...


Post a Comment

Related Posts Plugin for WordPress, Blogger...