Stephen E. Arnold: Startup Tamr Focuses on Automated Data Cleanup for Incoherent Legacy Big Data

Advanced Cyber/IO
Stephen E. Arnold
Stephen E. Arnold

Rising Startup Tamr Has Big Plans for Data Cleanup

An article Gigaom is titled Michael Stonebraker’s New Startup, Tamr, Wants to Help Get Messy Data in Shape. With the help ($16 million) from Google Ventures and New Enterprise Associates, Stonebraker and partner Andy Palmer are working to crack the ongoing problem of data transformation and normalization. The article explains,

“Essentially, the Tamr tool is a data cleanup automation tool. The machine-learning algorithms and software can do the dirty work of organizing messy data sets that would otherwise take a person thousands of hours to do the same, Palmer said. It’s an especially big problem for older companies whose data is often jumbled up in numerous data sources and in need of better organization in order for any data analytic tool to actually work with it.”

Attempting to allow for machines to learn some human-like insight into repetitive cleanup work just might be the trick. Tamr does still require a human in the management seat known as the data steward, someone who will read the results of a projected comparison between two sets of separate data and decide whether it is a good relationship. Tamr has been compared to Trifacta, but Palmer insists that Tamr is preferable for its ability to compare thousands of data sources with a data stewards oversight. He also noted that Trifacta co-founder Joe Hellerstein was a student of Stonebraker’s in a PhD program.

Chelsea Kerwin, June 13, 2014

Sponsored by ArnoldIT.com, developer of Augmentext

Phi Beta Iota: The reality that most big data is the wrong data collected within the wrong model for the wrong reasons, can never be over-stated. It is helpful to be able to clean up and integrate legacy data, particularly in Earth Science, but as a general statement, you are better off creating a new architecture such as Medard Gabel has proposed, to be able to collect, process, and analyze all data from all sources in all languages and mediums, in near real time, without running into all the legacy obstacles.

See Also:

Big Data @ Phi Beta Iota

Yoda @ Phi Beta Iota