fredag 10 oktober 2014

AWS Redshift tinkering

For a long time I've used a little hack (http://www.albert.nu/programs/filelinestatistics/) written in my spare time to do ad hoc analysis of large amounts of log files. With a decent sized machine to run it on there was no problem to dig in and query for any aggregation or finding details in gigabytes of compressed log files.

But every once in a while you come across that project where the data analysis needs are just that much greater. The last few days I've been doing my data analysis against some different AWS Redshift clusters. Some simple lessons learned are:

Size matters, when working with terrabytes of data even if you can load it into a fairly small cluster you need dozens of machines to get decent performance. At least for my use case with log files from webb applications it's best to go for the SSD nodes with less storage but more powerful machines, and to make sure to have as many as possible. You might want to contact Amazon to raise the node limit from the start.

Use the copy command and sacrifice a small bit of quality for shorter lead times. Depending on your options to continually load the data you might not need to optimize this but if you like me always have more systems and logs than you'd ever have capacity to keep in your database it becomes important to load the dataset you want fairly fast. If you store your logs on S3 it is simple to use the copy command to load surprising amounts of data in a few minutes provided you have a large enough cluster.

Beware of resizing the cluster with tons of data, if possible just empty the cluster and reload the new cluster. When loading from S3 you don't have any extra cost for data transfer as long as you keep the cluster in the same region as the log files. If the cluster is empty you can often do a resize in less then half an hour sometimes closer to fiften minutes.