Much of yesterday was spent at an ERP vendor event listening to the usual combination of analyst (“cloud is good, data is great”), vendor (“you should buy or product”) and vendor client (“you should buy their product because we did”) speakers. One big theme is that AI and machine intelligence will enable companies to achieve more because of the mining of insight from data. Data is the new oil…
Coincidentally yesterday I found a really interesting number. Machine-based statistical translation is one of the big success stories in the realms of data science. One of the boffins behind Google Translate has determined that for statistical translation between two languages you need body of training data equivalent to around 2.5 billion words. Google used documents from the UN and the EU, and the net result is a really useful tool that provides results that are intelligible. You wouldn’t, however, use the results from Google Translate in something important like, say, a contract.
How big is 2.5 billion words? Well on the back of an envelope is around 15 gigabytes of data. In 2017 where the average phone packs eight gigs, that doesn’t sound like an awful lot. But our perceptions of data storage have been seriously skewed in the past decade with the ubiquity of digital images and, particularly, video. A picture might paint a thousand words, but it uses up a crapload of video.
What’s one billion words in a more human context? The average novel is around 70,000 words; 2.5 billion is around 37,000 books. That’s a bookshop’s worth.
Which brings me to my point. The amount of data required by AI techniques to generate a useful bit still inaccurate service to translate from one language to another is a bookshop’s worth. How much data does the average company hold about its employees? Maybe a bookshelf’s worth in total?
Now don’t get me wrong. Just by having better structured data an organisation is likely to be able to make better informed decisions. But the leap from a few performance reviews and some holiday booking to magic insight popping out of the back of a data mining robot is huge.
Big data and AI takes seriously large amounts of information. There might be external training data sets available from outside an organisation, but if a vendor is claiming you’ll find insight merely from your own information you need to ask yourself a hard question: how big is your data lake?
3 thoughts on “How big is your data lake?”
“Sculpture is starting with a lump of rock and taking away all the things that aren’t a person.”
Just because a company has the data doesn’t mean it has the structure to be useful or that the company has the skills to use it.
I think those skills are the more valuable part of the analysis. The data is necessary but it’s far from sufficient.
For a lot of outcomes you might also argue that the data was not strictly necessary; a lot of insight can be reached by approaching specific problems in the right way. Data could be a brute-force way of doing things but don’t over-discount the skills required to wield it.
Reminds me of a slightly left-field post of mine from a couple of years ago… Data Jazz: https://mmitii.mattballantine.com/2015/02/04/data-jazz/