The dust isn’t yet settled on the UK’s General Election, but one element that is already getting coverage is the gap between opinion polling before the ballot, the exit poll, and what appears to be the final outcome.
After weeks of pollsters predicting a result “too close to call”, the exit poll published when the ballot boxes closed at 10pm last night seemed to take many by surprise. The Scottish Nationalists, who were predicted to have a very successful evening, urged caution. Paddy Ashdown, former leader of the Liberal Democrats, claimed he’d eat is hat if his party’s predicted collapse came to pass. Someone pass the ketchup.
Pre-election polls seem to mostly be based on sample sizes of around 1,000. Aggregating those polls together over time didn’t make them any more accurate in predicting the outcome. The exit poll took the views of 22,000 voters, crucially as they left polling stations, from around 140 constituencies across the country.
In the world of big data hype there are two things that are worth noticing here: firstly, the questions and situation of data gathering are crucial to the value of data (“Who are you going to vote for?” less valid than “Who did you just vote for?”); and secondly that more data might be helpful, but if you sample or questions suck, it doesn’t make things any better. The thousands of people surveyed in aggregate across the pre-election polls were far less valuable than the 22,000 on the day of polling. And even that sample of 22,000 is absolutely tiny in the context of Big Data. The entire data set could imaginably be contained within about 44 kilobytes. The whole thing undoubtedly could be crammed onto an old school floppy disk.
The lesson? Take your time getting your method and sampling right and your data science doesn’t necessarily need to be big. Look at data that is similar but not quite precisely what you want to know and you can create false conclusions. Chucking more wrong data at the question doesn’t make the answer any more right.