Enhancing Discovery in Data Science Through Novelty in Machine Learning

Read original article here

Note: Kirk will present two training sessions at the ODSC Europe 2020 Virtual Conference. One will focus on “Solving the Data Scientist’s Dilemma: the...

Note: Kirk will present two training sessions at the ODSC Europe 2020 Virtual Conference. One will focus on “Solving the Data Scientist’s Dilemma: the Cold-Start Problem with 10+ Machine Learning Examples” and the other will look at “Atypical Applications of Typical Machine Learning Algorithms.”

I have always appreciated the unusual, unexpected, and surprising things in science and in data. As famous science author Arthur C. Clarke once said, “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ (I found it) but ‘That’s funny!’” This is the primary reason that I motivated most of the doctoral students that I mentored at GMU to work on some variation of Surprise (or Novelty) Discovery for their Ph.D. dissertations.

“Surprise discovery” for me is a much more positive, exciting phrase than “outlier detection” or “anomaly detection”, and it is much richer in meaning, in algorithms, and in new opportunities. Finding the surprising unexpected thing in your data is what inspires our exclamation “That’s funny!” that may be signaling a great discovery (either about your data’s quality, or about your data pipeline’s deficiencies, or about some wholly new scientific concept). As famous astronomer Vera Rubin said, “Science progresses best when observations force us to alter our preconceptions.”

My two training sessions will look at two different topics from a common perspective that reflects the theme of “novelty” through the study of some uncommon examples. Specifically, some (hopefully, most) of these examples may alter the sessions’ participants’ preconceptions (in a positive way) about your data science applications and the typical machine learning algorithms that you use every day. Each of the training sessions will present a series of examples (approximately 10 each) to demonstrate the over-arching idea represented in the title of the corresponding session.

As a simple example, consider the following. In my first lesson in the Data Ethics course that I taught at the university, I asked my students to critique this: a famous politician once said that he was “absolutely shocked that half the students in this country receive below-average standardized test scores.” The responses that I heard from my students over the years ranged from a noncommittal “okay”, to political statements, to “of course, that’s what we would expect! Duh!” The latter students were right, sort of. I explained to them that they were right if the distribution of scores was symmetric (or, more generally, if the mean equals the median), but what happens when the score distribution is skewed? That really got them to think about things, even kurtosis!

My training sessions will not focus on those basic statistical concepts, but on other techniques and algorithms that data scientists may commonly use. These include Bayes theorem, independent component analysis, Markov modeling, recommender engines, K-means clustering, K-nearest neighbors, neural networks, deep learning, TensorFlow, knowledge graphs, and more.

The machine learning cold-start problem is the focus of my first session. It will explore examples of meta-learning and optimization when there is very little initial knowledge about where to start in model hyperparameter space. This is a frequent challenge in data science applications, encountered either when there is very little labeled data to adequately train a supervised learning model or when our goal is to figure out what the data is saying to us (i.e., applying unsupervised learning, to explore them without the added baggage of our preconceptions as to what we think the data is revealing). We will review backpropagation and TensorFlow in this same context.

My second training session will examine atypical applications of some typical machine learning algorithms. This will include predicting tropical storm intensification using retail market basket analysis, and it will include predicting solar storm impact on astronauts in space using customer journey mapping techniques. It will even include examples from Formula 1 racing and finding a cure for cancer. The most surprising example might be the one where a company achieved a 100,000% ROI on a data analytics investment to reduce customer churn – and they used perhaps the simplest algorithm in the known Universe.

[Related article: Adapting Machine Learning Algorithms to Novel Use Cases]

When we take a novel look at the methods and algorithms that we use every day, which then leads to unexpected and surprising discoveries in data, that should get us excited for each new day with data.

Images Powered by Shutterstock

The Data Daily

Enhancing Discovery in Data Science Through Novelty in Machine Learning