Stephen E. Arnold: Big Data Tired, Limited Data Wired

IO Sense-Making
Stephen E. Arnold
Stephen E. Arnold

Advice on Making the Most of Limited Data

The article How To Do Predictive Analytics with Limited Data from Datameer on Slideshare suggests that Limited Data may replace Big Data in import. The idea of “semi-supervised learning” is presented to handle the difficulties associated with creating predictions based on limited data such as expense and manageability and simply missing key data. The overview states,

“As it turns out, recent research on machine learning techniques has found a way to deal effectively with such situations with a technique called semi-supervised learning. These techniques are often able to leverage the vast amount of related, but unlabeled data to generate accurate models. In this talk, we will give an overview of the most common techniques including co-training regularization. We first explain the principles and underlying assumptions of semi-supervised learning and then show how to implement such methods with Hadoop.”

The presentation summarizes possible approaches to semi-supervised learning and the assumptions it is possible to make about unlabeled data (these include such models as clustering, low density and manifold assumptions). It also covers the concepts of Label Propagation and Nearest Neighbor Join. However, as inviting as it is to forget Big Data, and switch to predictive analytics with Limited Data the suggestion may sound too much like Bayes-Laplace.

Chelsea Kerwin, February 12, 2014

Sponsored by ArnoldIT.com, developer of Augmentext