Stephen E. Arnold: What Is a Big Data Lake?

IO Impotency
Stephen E. Arnold
Stephen E. Arnold

Chasing Non Swimmers from the Data Lake

If you are one of the Big Data believers, you will find “Clearing Up Muddied Waters in the ‘Data Lakes’” a reminder about the plasticity of concepts and their connotations. The write up addresses a clever phrase used to describe a storage pool into which

You store raw data at its most granular level so that you can perform any ad-hoc aggregation at any time. The classic data warehouse and data mart approaches do not support this.

The write up points out that the original notion of a data lake has been prodded, stretched, and pulled. Not surprisingly, after the verbal chiropractic, data lake is just not its old self.

Who are the perpetrators of this conceptual improvement? A “real” journalist and—no big surprise—several Big Data experts laboring away at a mid tier consulting firm.

So what? The coiner of the phrase points me and other readers to the original write up about data lakes here. Worth revisiting? Will the “real” journalist or the mid tier consultants likely to read the source document? I would guess not.

Stephen E Arnold, November 22, 2014

Phi Beta Iota: CTOs did not get to be CTOs by actually understanding the guts of Information Technology — they are decades from actually having programmed code and they have little understanding of the totality of the digital and analog worlds, of holistic analytics, true cost economics (for which databases generally do not exist), and a limited understanding of open source everything engineering and why that matters in terms of affordability, interoperability at the code and datum levels, and of course scale — crossing all boundaries and borders. The myths and malpractice that most CTOs accept about Big Data are quite astonishing. Below is the key quote from Jamie Dixon’s 2010 blog post that created the Data Lake concept (which is as far removed from Oracle or any other  structured “big data” repository as one can get):

If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

See Also:

Analytics @ Phi Beta Iota

Big Data @ Phi Beta Iota

Open Source @ Phi Beta Iota

True Cost @ Phi Beta Iota