Statistical Algorithms for Success in Data Science Competition

Read original article here

dRecently, predictive modeling platform Kaggle hosted a Big Data Combine competition to predict short term changes in the prices of stocks. The competition was hosted by the tournament platform BattleFin – a platform that’s dedicated to crowdsourcing investment analysis talent. Big Data Combine competitors were supplied news data and sentiment data by RavenPack. They are then asked to use that data to build predictive models for forecasting the price changes. With these predictions in hand, traders and investors would have access to the information they need. This will be for improved risk management when making investment decisions.

Dr. Steve Donaho was the winner of the Big Data Combine competition. He also won three others that have been hosted by Kaggle. In fact, Dr. Donaho’s outstanding performance in Kaggle competitions has earned him a current rank of 3rd out of 250,987 total competitors. At one point in time, Donaho was the top ranked competitor on the entire Kaggle platform. This success speaks volumes about Donaho’s ingenuity, acumen, and agility in data science. In an exclusive interview for Statistics Views website, Donaho discusses his interest in and success with data science and Kaggle competitions.

Over the last few years, I’ve found the GBM algorithm (Generalized Boosted Regression Models) in R to be very useful and broadly applicable to a wide variety of problems. I used GBM in a 2nd place finish in the Allstate Purchase Prediction competition. I used it as well and won 3rd place in the Deloitte Insurance Churn Prediction competition. More recently, I’ve started to use the XGBoost (eXtreme Gradient Boosting) algorithm which is similar in nature to GBM. But it is much faster and has some improved features. Most recently, I’ve also been intrigued by some of the online learning algorithms posted by the user tinrtgu and others in contests for Criteo, Tradeshift, and Avazu. For very large volumes of data, the online learning techniques give pretty good results quickly and without using a lot of memory.

I usually spend quite a bit of time at the beginning of a contest just sifting through the data. I make sure I get to know it before I apply any learning algorithms. This has sometimes given me a competitive edge. For example, in an Allstate competition, I found that certain combinations of products never occurred in certain U.S. states.

Ruling out those combinations as a post-pass to our algorithm gave my partner and I an edge in the competition. Something else I do at the beginning of a competition is to try simple approaches first. I create what I call “improved baselines”. This is where I choose a simple idea and tweak it in a few ways to see how much mileage I can get out of it.

I do this for a few reasons. Well, 1) Sometimes I find some relatively simple solutions that perform pretty well (complex is not necessarily better), 2) In practice I’ve found that customers prefer simple solutions that they are able to grasp, and 3) If a solution is doing well, I like to understand what is driving its success, and that is easier to do with simple solutions. If you jump straight to complex solutions, it is hard to know what is driving the success and if all the complexity is necessary.

I first heard about Kaggle from an article in the Wall Street Journal in 2011. The data science competitions sounded like fun. I had a one week lull in my normal work so I entered a contest with about one week left until it finished. I used a pseudonym BreakfastPirate when I signed up because I thought I might not be any good. It turns out I got 10th place in that first contest, and the thrill of placing well got me addicted to Kaggle contests.

Maybe there are readers out there whose real passions are in analysis. In these cases, these people should be told that math, computers, etc. are simply supporting skills and tools that are available to help them in their analysis endeavors.

First of all, it’s fun! I’m an unabashed data-lover. I love to get my hands on a new set of data and start digging through it and analyzing it. It is fun to learn about industries that I have not worked on before. These include retail sales, airline arrival times, soil composition in Africa, flu forecasting, click-through prediction, etc.

Second, it forces me to learn new techniques and new algorithms. I always sift through the solutions posted by winners, and I often learn clever, new approaches. Kaggle has definitely become more competitive even in the last 12 months. If I see that people are winning using an algorithm that I have not used before, I’m forced to learn about that algorithm in order to stay competitive. That is how I started using XGBoost.

Third, it is fun to be part of a community of data scientists where we are sharing ideas. Yes, it is a competition, but there is a lot of idea sharing that happens on the message boards, and it is fun to contribute to that when possible.

When I was in high school, the only career advice I got was, “You’re good at math. You should be an engineer.” So I went off to college to study to be an engineer. I knew I liked computers so I majored in Computer and Electrical Engineering. While working on my Bachelor’s degree, I found that I was more interested in the software than the hardware. So I went on to work on a Master’s degree and PhD in Computer Science.

About the time I was finishing my PhD, I came to the realization, “I don’t really like computers nearly as much as all the students around me. What I *really* like to do is analyze data, the computers are simply a handy tool for pursuing my passion of analysis.” So it took me all those years and all those degrees to figure out that my real underlying skill is *not* math. My real underlying skill is that I have good analytical skills, and I enjoy analyzing things.

Unfortunately, when I was in high school, analytical skills were not specifically identified, so no one was able to say, “You’ve got good analytical skills, and here are a set of career paths for people who enjoy analysis.” Hopefully schools do a better jobs these days of identifying skills and going beyond, “You’re good at math. You should be an engineer.” But just in case, maybe there are readers out there whose real passions are in analysis. In these cases, these people should be told that math, computers, etc. are simply supporting skills and tools that are available to help them in their analysis endeavors. They need not be ends of themselves, but rather serve as means to an end.

Dr. Steve Donaho has 20 years of experience architecting solutions for discovering interesting patterns in large quantities of data. He has placed in the Top 10 in multiple Kaggle competitions. These competitions are across a wide variety of areas including stock market sentiment analysis, insurance, name resolution, retail sales prediction, pharmaceutical sales prediction, and airline arrival times.

Prior to starting Donaho Analytics, he was Director of Research at Mantas (now part of Oracle Financial Services). It’s a leader in delivering business intelligence to the financial services industry. At Mantas he was a driving force behind the creation of much of their new analytics technology. He was an inventor on four of the company’s patents. He has published and spoken at multiple Knowledge Discovery in Databases (KDD) conferences on topics including algorithms for detecting fraud and insider trading. His areas of expertise include fraud detection, money laundering detection, financial markets, banking and brokerage, healthcare, telecommunications, and customer analytics.

This is a contribution that I originally made to Statistics Views website back in February, 2015. If you’re interested in learning more about the practice of data science, or how you can learn to do it yourself, make sure to check out Data-Mania’s learning resources.

Images Powered by Shutterstock

The Data Daily

Statistical Algorithms for Success in Data Science Competition