onsdag 15 april 2015

Google BigQuery compared to Amazon Redshift

After tinkering with Amazon Redshift I was interested to compare it with Google BigQuery. Two solutions that both are built for analysis of huge datasets, how different can they be? It turns out that they have much in common but in some areas they are very different beasts…

The general capability to easily handle incredible amounts of data is the same, it is truly mindboggling how these services allows you to handle multi billion row datasets. Just as the case was with Redshift it really helps to have the raw data available in a format that is heavily compressed (to reduce storage cost) and easy to process in a place that BigQuery can access efficiently. For BigQuery storing the raw data in Google Cloud Storage makes loading operations simple and fast.

Operating these two solutions is very different, where Redshift has a web ui that allows you to manage your cluster with all the different aspects of it (hardware type and number of nodes etc) BigQuery is more of a service offering that reliefs you of all the infrastructure details. The main benefit with the BigQuery model is ease of use and quick scalability (no need to resize clusters) and the main benefit with Redshift is that you really feel that your data is on your platform, not in a shared service (the kind of minor point that still seems to be important in some contexts).

Loading data is done with a batch load command (like the Redshift copy command), it has a wizard like user interface for configuring the details of the loading. Although I was seriously impressed with the fantastic performance of my large Redshift clusters BigQuery was even faster (single digit minutes instead of two digit minutes). The batch load wizard is simple to operate but I lack some of the flexibility in the Redshift copy command and I really missed the excellent result reports that you could get after a load operation. Due a weirdness in the internal functions of Google Cloud Storage and lack of result feedback I really struggled with data loading initially but the Google support was beyond expectations and helped me quickly with an immediate workaround and has fixed the problem now.

In terms of performance the services are quite a bit different. On BigQuery the performance is very consistent regardless of the size of the dataset, on Redshift you can determine the performance by scaling the cluster size (at a cost though). In general I think Google has managed to strike a good enough balance for me to not care about it at just be happy that I don't have to think about it. When factoring in the large cluster size you need on Redshift to get comparable performance I'd say you are likely to have better performance on BigQuery unless you are willing to spend a lot.

Web UI, a really nice feature BigQuery to get going quickly or for doing the odd ad-hoc query is that you don’t need any tools, there is a basic sql query interface built into the web console.

The pricing of the services is difficult to compare since you pay for cluster runtime in the case of Redshift compared to storage and queries in the case of BigQuery. For my scenario with fairly large data volumes and a pattern of short periods of intense querying with long periods of low to none quering BigQuery is more than a factor 10 cheaper for similar performance. This cost comes from the need to continuously running a Redshift cluster for low volumes of ad-hoc queries, you trade of this low latency access and high cost to a long latency access at a lower cost (eg: starting and restoring the cluster when you need it) but with BigQuery I get the best of both worlds, paying for storage needed is still very cheap for huge datasets compared to running a cluster. Also note that with the super fast data loading in BigQuery you can have even less data loaded and keep more raw data compressed instead of loaded. The largest cost for BigQuery is the query cost, paying for data processed when having large amounts of data and a service that can process terabytes in a few seconds can hit you unexpectedly, the feeling of paying for every single select statement is a bit nagging but in the end $5 per terabyte of processed data is fairly cheap and as long as you don’t query all the columns in the table you can make pretty efficient queries. It is probably worth while to consider the different pricing models for your specific workload, in some cases (obviously in my case) the difference is huge.

fredag 10 oktober 2014

AWS Redshift tinkering

For a long time I've used a little hack (http://www.albert.nu/programs/filelinestatistics/) written in my spare time to do ad hoc analysis of large amounts of log files. With a decent sized machine to run it on there was no problem to dig in and query for any aggregation or finding details in gigabytes of compressed log files.

But every once in a while you come across that project where the data analysis needs are just that much greater. The last few days I've been doing my data analysis against some different AWS Redshift clusters. Some simple lessons learned are:

Size matters, when working with terrabytes of data even if you can load it into a fairly small cluster you need dozens of machines to get decent performance. At least for my use case with log files from webb applications it's best to go for the SSD nodes with less storage but more powerful machines, and to make sure to have as many as possible. You might want to contact Amazon to raise the node limit from the start.

Use the copy command and sacrifice a small bit of quality for shorter lead times. Depending on your options to continually load the data you might not need to optimize this but if you like me always have more systems and logs than you'd ever have capacity to keep in your database it becomes important to load the dataset you want fairly fast. If you store your logs on S3 it is simple to use the copy command to load surprising amounts of data in a few minutes provided you have a large enough cluster.

Beware of resizing the cluster with tons of data, if possible just empty the cluster and reload the new cluster. When loading from S3 you don't have any extra cost for data transfer as long as you keep the cluster in the same region as the log files. If the cluster is empty you can often do a resize in less then half an hour sometimes closer to fiften minutes.

tisdag 13 augusti 2013

CDN? The ISP's are doing it for you!

While analyzing the CDN access logs for a site I realized that the ratio of pageviews per visit didn’t at all reflected the amount of css/js files that was transferred. Considering browser cache I expected roughly one css/js access per visit, perhaps slighly less considering some return visitors.

I found that access to css/js was a factor 10 to 100 less than expected. I also found that a small bunch of IP's where causing huge amounts of traffic. The top IP’s causing traffic to the site fell into two categories, crawlers (googlebot, bingbot and similar) and ISP’s.

Obviously crawlers don’t need to get the css/js files over and over again but ISP’s not getting them when the traffic is obviously from multiple clients behind a large NAT setup or similar, why? Thinking about it for a while my best guess is that they simply do what the CDN does, pass the of traffic through a caching proxy and cache everything according to http headers, with some extra intelligence to figure out what is important enough to stay in the cache.

This has two important implications:

1. If you think you can control caching by controlling your CDN configuration and using the CDN purge function you are wrong. Just as something stuck in browser cache being out of your control the ISP cache is also out of your control.
2. If you don't cache bust your resources properly you'll end up with a lot of weird behaviour for users with ISP's that have a cache.

If enough ISP's start doing this and do it well, I even see this as an important improvement to the overall performance of the Internet. The ISP cache would be close to the user and in terms of traffic and end user performance it is a win-win for both the ISP and the site owner. This is really a half decent content delivery network and it's free!

If someone has insight into ISP's it would be interesting to hear what technology they are using and how they are thinking. My findings might be specific to a site targeting mobile users, maybe mobile operators are more aggressive in this area? But it would be really beneficial to all ISP's.

Surprised? I've always known that there is a potential for all kinds of proxies around the Internet. I just never seen it in effect, and I certainly didn't expect it to be this effective!

söndag 23 december 2012

Arm'ed for 2013

It wasn't planned and it wasn't expected it just happened, my main computers are now ARM-based. I could have sworn that it would not happen, but it did.

I've had at least 3-5 computers any given point in time the last 15 years. They've all been x86 machines, mainly destops. I still have those, but 80% of my usage is now on ARM devices.

How did it happen?
1. The IPad is my main surf/email/game/tv device.
2. The big file server is now an archive machine (rarely used) and a Synology diskstation is the main file server complementing the IPad with storage.

There are only three main things that I do on what used to be the "main rig", media encoding, FPS games and work (coding). Once I'm tired of the FPS games I still run on it, it will go and I'll just have the laptop left to work on.

It is fantastic how far I get now with soo little, instead of big bulky computers two small devices takes care of all my computing needs. The don't allow all the tinkering I do love but they work, silently always on always at hand.

Still I'm writing this on my Windows/Intel computer, why? It has the best keyboard, but I guess in 2013 I'll buy a nice keyboard for my IPad. Didn't see that one coming, I wonder what 2013 will bring...

lördag 20 oktober 2012

Retina support in CSS4

Retina displays are appearing in more and more devices and web developers really need a flexible solution for supporting both retina and non-retina devices in an efficient way.

Luckily additions to CSS4 propose a solution.

Before you question why you would consider CSS4 when working with the current browsers note that support for features sometimes appears quickly when it is really needed. This is such a case...

As blogged by Jason Grigsby here there is support in Safari 6 and Chrome version 21 (the most widely used version since late august 2012) for specifying a set of images when defining background images in CSS4.

#test {
background-image: url(assets/no-image-set.png);
background-image: -webkit-image-set(url(assets/test.png) 1x, url(assets/test-hires.png) 2x);

Edited example from James Grigsby's blog.

Various solutions based on JavaScript or dynamically generating device specific html are around. But they all share the same problem, you need to solve a presentational problem with code lacking information of basic stuff such as user preference, available bandwidth etc. With this solution you leave move the problem of selecting which image to load to the browser that has a better chance to make an informed choice.

Browser compatibility is not great yet but currently most retina devices are built by apple. A high portion of those users are likely use Safari 6 or Chrome which solves the problem as long as you remember to use the standard background-image for backwards compatibility everybody else.

söndag 17 juni 2012

Browser preloading

A classical optimization on a web site is to configure cache headers of a page to enable the browser to display the page instantly if it has been loaded recently. This works very well when the user is hitting the back button to go back to the previous page.

What if we could do the same for the next page that the user will request? This is possible if we have two component:
  1. We need to guess which page is going to be requested.
  2. We need to tell the browser to preload it.
Number one can be addressed by gathering statistics of which pages are browsed on your site.

Number two is solved by adding a specific link tag that is so far supported by FireFox and Chrome, although implemented in slightly different ways.

The html link
<link rel="prefetch prerender" href="url-to-preload">

prefetch is used by FireFox. My testing indicate that the response to FireFox needs to have correct cache headers otherwise it will be requested again when the user requests this page. You need to look at the web server logs to see the request, FireBug seems to disable the prefetching.

prerender is used by Chrome. My testing indicate that regardless of cache headers the next page load is instant if the user requests this page. The prerendering is displayed as a cancelled get request (screenshot below).

I'm working on a wordpress plugin that will gather usage statistics and generate preloading instructions to the browser.

torsdag 8 mars 2012

One sprite to rule them all?

It is widely known that sprites are a nice way to combine several images into one to make the web browser load your web page quicker. But how far can it be taken without negative side effects?

In the picture above there is one big image 300+ KB that contains all the images for an entire site theme. As you can see the browser correctly starts loading this image early. But it also continues starting load additional images. In the end the big sprite is the last to finish, the visual impact is a broken site that loads several images and at the end finally adds all the visually important bits and pieces of the theme.

Clearly a case when the concept was taken too far.

For site themes that have a lot of shadows and large theme graphics it is wise to split this load into multiple sprites. To avoid big visual impact consider moving all small and colourful minor graphics to one small sprite that can load quickly because waiting for these items is much more anoying than waiting for a background image of some type.

Remember to set far future cache headers on the sprites and your site will be lightning fast once the user has got the sprite once.