Albert's notes on development and performance: AWS

After tinkering with Amazon Redshift I was interested to compare it with Google BigQuery. Two solutions that both are built for analysis of huge datasets, how different can they be? It turns out that they have much in common but in some areas they are very different beasts…

The general capability to easily handle incredible amounts of data is the same, it is truly mindboggling how these services allows you to handle multi billion row datasets. Just as the case was with Redshift it really helps to have the raw data available in a format that is heavily compressed (to reduce storage cost) and easy to process in a place that BigQuery can access efficiently. For BigQuery storing the raw data in Google Cloud Storage makes loading operations simple and fast.

Operating these two solutions is very different, where Redshift has a web ui that allows you to manage your cluster with all the different aspects of it (hardware type and number of nodes etc) BigQuery is more of a service offering that reliefs you of all the infrastructure details. The main benefit with the BigQuery model is ease of use and quick scalability (no need to resize clusters) and the main benefit with Redshift is that you really feel that your data is on your platform, not in a shared service (the kind of minor point that still seems to be important in some contexts).

Loading data is done with a batch load command (like the Redshift copy command), it has a wizard like user interface for configuring the details of the loading. Although I was seriously impressed with the fantastic performance of my large Redshift clusters BigQuery was even faster (single digit minutes instead of two digit minutes). The batch load wizard is simple to operate but I lack some of the flexibility in the Redshift copy command and I really missed the excellent result reports that you could get after a load operation. Due a weirdness in the internal functions of Google Cloud Storage and lack of result feedback I really struggled with data loading initially but the Google support was beyond expectations and helped me quickly with an immediate workaround and has fixed the problem now.

In terms of performance the services are quite a bit different. On BigQuery the performance is very consistent regardless of the size of the dataset, on Redshift you can determine the performance by scaling the cluster size (at a cost though). In general I think Google has managed to strike a good enough balance for me to not care about it at just be happy that I don't have to think about it. When factoring in the large cluster size you need on Redshift to get comparable performance I'd say you are likely to have better performance on BigQuery unless you are willing to spend a lot.

Web UI, a really nice feature BigQuery to get going quickly or for doing the odd ad-hoc query is that you don’t need any tools, there is a basic sql query interface built into the web console.

The pricing of the services is difficult to compare since you pay for cluster runtime in the case of Redshift compared to storage and queries in the case of BigQuery. For my scenario with fairly large data volumes and a pattern of short periods of intense querying with long periods of low to none quering BigQuery is more than a factor 10 cheaper for similar performance. This cost comes from the need to continuously running a Redshift cluster for low volumes of ad-hoc queries, you trade of this low latency access and high cost to a long latency access at a lower cost (eg: starting and restoring the cluster when you need it) but with BigQuery I get the best of both worlds, paying for storage needed is still very cheap for huge datasets compared to running a cluster. Also note that with the super fast data loading in BigQuery you can have even less data loaded and keep more raw data compressed instead of loaded. The largest cost for BigQuery is the query cost, paying for data processed when having large amounts of data and a service that can process terabytes in a few seconds can hit you unexpectedly, the feeling of paying for every single select statement is a bit nagging but in the end $5 per terabyte of processed data is fairly cheap and as long as you don’t query all the columns in the table you can make pretty efficient queries. It is probably worth while to consider the different pricing models for your specific workload, in some cases (obviously in my case) the difference is huge.

For a long time I've used a little hack (http://www.albert.nu/programs/filelinestatistics/) written in my spare time to do ad hoc analysis of large amounts of log files. With a decent sized machine to run it on there was no problem to dig in and query for any aggregation or finding details in gigabytes of compressed log files.

But every once in a while you come across that project where the data analysis needs are just that much greater. The last few days I've been doing my data analysis against some different AWS Redshift clusters. Some simple lessons learned are:

Size matters, when working with terrabytes of data even if you can load it into a fairly small cluster you need dozens of machines to get decent performance. At least for my use case with log files from webb applications it's best to go for the SSD nodes with less storage but more powerful machines, and to make sure to have as many as possible. You might want to contact Amazon to raise the node limit from the start.

Use the copy command and sacrifice a small bit of quality for shorter lead times. Depending on your options to continually load the data you might not need to optimize this but if you like me always have more systems and logs than you'd ever have capacity to keep in your database it becomes important to load the dataset you want fairly fast. If you store your logs on S3 it is simple to use the copy command to load surprising amounts of data in a few minutes provided you have a large enough cluster.

Beware of resizing the cluster with tons of data, if possible just empty the cluster and reload the new cluster. When loading from S3 you don't have any extra cost for data transfer as long as you keep the cluster in the same region as the log files. If the cluster is empty you can often do a resize in less then half an hour sometimes closer to fiften minutes.

Albert's notes on development and performance

onsdag 15 april 2015

Google BigQuery compared to Amazon Redshift

fredag 10 oktober 2014

AWS Redshift tinkering

Om mig

Bloggarkiv