Chris Anderson's article in this month's Wired seems to be making a case that massive data sets (and our ability to manipulate them) means that casuality is no longer relevant — if you can find a corelation in data then that is all that is important.
It depends on what you are trying to do… the (possibly urban legend) Wal*Mart data warehousing story about spotting corelation between beer sales and nappy sales, leading to the items being stocked together on the shop floor (because new fathers wanted alcoholic relieve from their newborn progeny), or Google's ad words engine don't need to understand motive to take a punt at increasing sales. However, if you want to do something about an issue, corelation without causality surely is a big issue – otherwise, how do you know how to try and change things?
A great example of this was told to be by a guy I was working with a couple of years ago who came from one of the main electricity supply companies. The firm had noticed that there was a big peak in electricity demand when knock-out cup football matches were being televised, and that the peak seemed to corelate to the length of the game. A game ending after extra time and penalties would see a much greater peak demand than one that ended at the end of 90 minutes.
The corelation here is simple – between length of match and peak power demand. But the electricity supply companies couldn't approach FIFA to ask for changes to the rules to shorten games to address what was a major supply issue for them. Nor was that the problem.
After much investigation, the cause of the demand appeared to be as follows. The longer the match, the more liquid would be being consumed by the viewers. Extra time and penalities increased the tension, and thus increased the liquid (beer, tea, whatever…) intake. At the end of the match, the nation sighed a collective sigh of relief, and then as one pootled off to the bathroom to relieve themselves. The almost synchronised flushing of cisterns put an enormous draw on the water supply, and it was the water pumping stations, having to work at very high capacity, that was seen to be the primary culprit for the peak in electricity demand…