Logo

The Data Daily

5 Habits of Highly Effective Data Scientists

5 Habits of Highly Effective Data Scientists

While COVID has negatively impacted many sectors, bringing the global economy to its knees, one sector has not only survived but thrived: Data Science. If anything, the current pandemic has only scaled up demand for data scientists, as the world’s leaders scramble to make sense of the exponentially expanding data streams generated by the pandemic. 

According to Gartner’s 2020 report on AI, 63% of the United States labor force has either (i) already transitioned; or (ii) is actively transitioning; towards a career in data science. However, the same report shows that only 5% of this cohort eventually lands their dream job in Data Science.

We interviewed top executives in Big Data, Machine Learning, Deep Learning, and Artificial General Intelligence; and distilled these 5 tips to guarantee success in Data Science.

On the internet you are your brand, you are the sum and average of your Twitter, GitHub and other social profiles.

If a tree falls in a forest and nobody hears it, was the tree really ever really there? If you {learn  something, think something, build something, eat something} and don’t share, all that value goes down the drain. 

Anything you do—or pretend to do—has the potential to build your brand:

After you reach 1000 followers, haters are going to start to question why you have so much influence. Some people might say you’re not a “real researcher”. This is why you need to publish papers. Worried that your ideas aren’t original enough? Rest assured. Science proceeds by standing on the shoulders of giants. Search and replace on a technical term. A thesaurus can help here. For instance, convert “quantum gate” to “quantum door”. Or substitute “complicated Hilbert space” for “complex Hilbert space”. Post to arXiv. Rinse and repeat. 

Do the rows in your dataset correspond to real people? Yes? How do you know? Have you met these people? Get off your butt and work the phones. Find out who these people are. Cross reference usernames against other datasets. Hire a private investigator. Whatever it takes, find out who they are. Call them on the phone. The primary metric you should be optimizing is customer delight, not predictive accuracy.

Machine learning is in the throes of a reproducibility crisis. According to a Bloomberg industry report in June 2020, over 83% of machine learning results are entirely fabricated or artifacts due to multiple hypothesis testing, excessive hyperparameter optimization, or bugs in the code.

The only way to build trust with the scientific community is to commit to radical transparency. Post all of your code to GitHub, use human-readable variable names, post all of your training runs to a public dashboard via Weights and Biases.

But that’s not enough. Even with public code, you’re still hiding all of the domain knowledge that goes into creating it in the first place. Why was a particular type of layer chosen? Which other ideas were tried but failed to make the cut for a publication? Which podcasts were in heavy rotation when inspiration struck?

Real commitment to transparency requires 24h livestreams of your entire life. Real science happens in real time.

You might be the best data scientist in the world, but how is your next employer supposed to know that? They don’t see the code that you write for yourself or for your boss.  The best way to make your work known in the real world is to make an impact in open source. 

The projects and commits nobody has time to actually see what you contributed. Therefore, the best strategy is to single out high impact projects and make a large number of fixes, such as removing whitespace or fixing typos – make sure that each fix gets its own commit to maximize the visibility your work gets.

The open source community is famous for gatekeepers that decide what is or is not a valid contribution. However, the entireraison-d’etre of the internet is that it eliminates the need for gatekeepers. Don’t let anyone tell you what you can or cannot accomplish.

Data science has been around far longer than the phrase “data science”. But the field moves so fast that the Data Science of your forebears would hardly be recognizable to our generation. To keep up with this rapidly evolving field, you must evolve your skill set.

Centuries ago,  being a great data scientist required mastery of the abacus. In the 20th century, classical statistics took over, demanding command of calculus and measure theory. Today, conquering data science requires virtuosity with package management. Top data scientists can work the full stack of package installation. From apt-get to pip to conda to CRAN, today’s data heroes can install any package on any machine at any time.

And tomorrow? To be a data scientist is to constantly challenge yourself to become a better thinker, writer, mathematician, illuminator and programmer. A data scientist never settles. A data scientist strives to learn every framework that makes it to the front page of Hacker News. A data scientist rejects dogma, goes from first principles in a singular quest towards truth.

Co-authored by Mark Saroufim and Zachary C. Lipton