Logo

The Data Daily

Why you should contribute to open-source as a data scientist

Why you should  contribute to open-source as a data scientist

This was probably the strongest reason for me. I wanted to improve my understanding of Scikit-learn from a developer perspective, not only a user one. Furthermore, I am currently learning applied statistics and I wanted to connect the dots between what I am learning in class and how those concepts are reflected in this Python package. I have used the preprocessing module in the past, but actually getting to contribute to its documentation made me relearn scaling techniques that I was previously exposed to. Also, ‘looking under the hood’ helped me appreciate the rationale behind the code.

For 4 hours, we had to virtually collaborate with a pair programming partner. A lot of data science work can be done in isolation, but having a fellow collaborator share problems that they are facing helps you avoid the same mistakes. You can also help unblock each other by focusing on solving the same problem. For instance, thanks to my programming partner, I was able to figure out which parts of the Microsoft Visual Studio Build Tools I should install to save on time (You need a C++ compiler to install scikit-learn into your local environment). It also felt good for both of us, who had never collaborated on a project in GitHub before, to have our pull requests merged into the main repository.

One of the cool things about open-source development is that it allows you to be both developer and user at the same time. This means that as you grow in knowledge regarding use of the tool, you are able to

The videos below was really helpful in guiding me in setting up the development environment and making a pull request:

The open source community is a very welcoming space where you can ask ‘stupid’ questions and receive help. Sometimes when one is a beginner in the data space, even after learning languages such as Python and R, someone may feel intimidated contributing to a hackathon, or in this case, a Scikit-learn sprint, where someone has only 4 hours to come up with a solution and submit a pull request. I would like to reassure you, dear reader, that you will have a community with mentors and representatives from the core development team as well as your pair programming partner to assist you in finding relevant issues, changing the code, passing it through tests, creating a pull request and having that pull request reviewed. I was glad to see that all the issues that I contributed to with my pair programming partner had encouragement and helpful critiques of our work.