I read “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights.” The write up from the newspaper that does not yet have hot links to the New York Times’ store, has revealed that Big Data involves “janitor work.”
Interesting. I thought that Big Data was a silver bullet, a magic jinni, a miracle, etc. The write up reports that “far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required.”
And who does the work? The MBAs? The venture capitalists? The failed Webmasters? The content management specialists? The faux consultants pitching information governance?
The work is done by data scientists.
The New York Times has learned:
Before a software algorithm can go looking for answers, the data must be cleaned up and converted into a unified form that the algorithm can understand.
Quiet a surprise for the folks at the newspaper.
How much of a data scientist’s time goes to data clean up? The New York Times has learned:
Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
What’s this mean in terms of cost?
Put simply, Big Data is likely to cost more than the MBAs, the venture capitalists, faux information government consultants, et al assumed.
So as the volume of Big Data expands ever larger, doesn’t this mean that the price tag for Big Data grows ever larger. I don’t want to follow this logic too far. Exponentiating costs and falling farther and farther behind the most recent data is likely to make the folks with those fancy, real time predictive models based on Big Data uncomfortable.
Don’t worry. Be happy. The Big Data did miss the Ebola issue, the caliphate, various financial problems, and a handful of trivial events.
Stephen E Arnold, August 18, 2014