Logo

The Data Daily

Data science intro for math/phys background

Data science intro for math/phys background

After posting What I do or: science to data science I got a lot of emails on how to make this transition.

In this post I try to summarize my advice. I don’t intend to write a complete walkthrough, but to provide a starting point, with links to further materials. I target it at people with academic, quantitative background (e.g. physics, mathematics, statistics), regardless if they are undergraduate students, PhDs or after a few postdocs. Some points may be valid for other backgrounds (but then - use it at your own risk).

Here and everywhere else: please don’t take approach of learn book[s] then play - start with playing!

All projects required me to learn something new - be it a library, a machine learning model or a software tool.

Analyzing real, and often - dirty, data using a mixture of programming and statistics. Or, as Josh Wills put it:

From my perspective the whole process looks that way:

And everything needs to be done in a reproducible way - so others can interact with your code, or even run it on a server. Depending on the job, there may be more emphasis on one part or the other. Or even look at this tweet - while humorous, it shows a balanced list of typical skills and activities of a data scientist:

If you want to learn more about what is data science, look at the following links:

When you have some academic title, no-one will question your intelligence. But they are justified to question your practical skills. From my experience, you need to fulfill two requirements:

Most data science things are simple and at the point that you are able to use R or Python you can start working, gradually increasing your knowledge and experience. That is, after a few months you should be ready to start an entry-level job.

Initially, I was afraid that it is a problem that I lack 10+ years of experience with C++ and Java. So how could I compete with serious software engineers, who did their computer science major? But it turned out that most of my commercial projects are for IT companies - they have wonderful programmers but often no-one proficient at dealing with real data. So (from Academia to Industry linked below):

In academia, you are allowed to cherry-pick an artificial problem and work on it for 2 years. The result needs to be novel, and you need to research previous and similar solutions. The solution needs to be perfect, even if not on time.

In industry, you should solve a given problem end-to-end. Things need to work, and there is little difference if it is based on an academic paper, usage of an existing library, your own code or an impromptu hack. The solution needs to be on time, even if just good enough and based on shady and poorly understood assumptions.

So, contrary to its name, it’s rarely science. That is, in data science the emphasis is on practical results (like in engineering) - not proofs, mathematical purity or rigor characteristic to academic science.

In the software industry resume plays a different role than CV in academia. Rather than being a complete record or all positions, awards and publication, it is a short (typically 1 page) summary of the main skills and the most important positions/accomplishments. It is used to screen candidates, not as the final judgement. To see the difference, compare and contrast my data science resume with my academic CV.

Applying for a job involves being asked technical questions - on the phone or Skype. For software engineering it involves both conceptual questions and whiteboard coding; for data science it may vary. In any case, take a look at:

If you need learn basic algorithms and data structures, I recommend:

If you get no technical questions, it may be a red flag. If you get only software engineering questions, it may be a sign that they want to hire a programmer, not - a data scientist (no matter what their job calling says); and given you background you want to be a Type A Data scientist (i.e. more a statistician than a regular programmer), according to this taxonomy.

When they consider hiring you, this piece is crucial:

Most likely practical programming is the main skill you are missing. For general data science, the standard tools are Python and R. If you already know some other languages it will help, still - learn one of the above.

tl;dr: both are good choices. Pick one you prefer for any reason; two really good ones are:

I mean, there are use cases when one is better than the other. But in the majority of tasks both are fine. And well (some may disagree), but they are tools, not religions (no need of fighting, not need of using exclusively one).

I won’t point to a general tutorials - there are tons of it and personal preferences vary (MOOCs, interactive courses, websites, textbooks, …) and I tried to link only to things I recommend myself. When I provide links - it is usually web materials rather than classical books. And it is for a reason:

R is a tool for statistics turned into a language. The standard way of using it is via RStudio (though, you can use Jupyter). Be sure to learn basics of dplyr and ggplot2 (I almost always load them by default; especially dplyr, which makes operations on dataframes much easier, faster and more readable). Then everything else depends on the problems you are solving.

If you go the R way, at least:

Python is a much better general-purpose language (with pros and cons on not being statistics-oriented).

For Python, I would suggest installing it (Python 3) through Anaconda, and using Jupyter Notebook. Main packages are NumPy, SciPy (numerics), Pandas (like R dataframes), matplotlib (plots, but not as nice as ggplot2) and scikit-learn (for machine learning). Learn to be comfortable with Python (installing packages, loading, saving and transforming data, etc) - links below may help:

You need some basic linear algebra (vectors, matrices, SVD, …), calculus (exp, log, differentiation, integration, …) probability (independence, conditional probability, …), but if you are from natural science background, you already know that. It does not mean that you know all - it just means that right now you have mathematical skills sufficient to be an employable data scientists and you are able to read about other methods, algorithms, etc.

If you need to get a real dataset suitable for working with a given machine learning algorithm, there is a wonderful collection:

For statistics, screw learning by heart various statistical distributions and tests - you can easily look them up later. What is crucial, is to understand the idea of tests, cross validation, bootstrapping and Bayesian inference. For the latter I recommend:

It’s a fast changing field - I am constantly tracking new libraries and updates to ones I am using. I read a lot of academic papers - not just to stretch my intellectual muscles, but solve a particular problem.

Often you will need to install something, collaborate with others and do other tasks. The crucial point so to know what is possible - especially not to reinvent the wheel.

Don’t be afraid of learning new technologies (e.g. this data is in MongoDB, a NoSQL database; can you fetch it?) - often you can get the basics in a day. Most technologies, from the user’s perspective, are easy (at least comparing to algebraic geometry or quantum field theory).

Some people recommend Kaggle as a starting point but I would take it with a grain of salt. Don’t get me wrong - there are great resources, it provides feedback (otherwise it is hard to tell if your solution is good) and some people find it really engaging. But if you start with a goal of winning - you will end up disappointed, with neither fame nor gold (prized competitions are not beginner-level). Moreover, beware that industrial problems rarely look like that (e.g. in all mine data cleaning was a big thing, and in none 5% score improvement mattered). More on that:

Personally, I enjoy the most working on data I care about and find genuinely interesting. It drives my motivation much more than any competition could. Also, this way it is a complete data science - from asking questions and getting data to presenting the results in a meaningful form.

Making results public, including code, is a great room for both feedback and building a showcase. It can be an IPython Notebook, or a website, or even a just a plot (but then be sure to sign it - if it goes viral you want to get due recognition!). E.g. some mine (see also Projects):

So, once again, be sure to get a GitHub account (for hosting code, notebooks and websites). Mine looks like that: github.com/stared. And don’t be afraid to put premature code: if it is not good yet then no-one will notice (or care) anyway. Also, some people like writing about problems they have just learnt (e.g. How gzip uses Huffman coding - Julia Evans). If it is your thing - just do it (see my post on Jekyll)!

EDIT (Feb 2018): If you want to play with data Kaggle Datasets are a wonderful place to start.

It’s totally fine to learn things on your own. But doing on a boot camp may be a huge boosts - motivational, with access to tutors/experts, with job opportunities. Here are some camps I am aware of:

If you are still a student - doing an internship may be a great way to get a lot of experience, feedback, confidence and contacts. I did mine during my PhD studies (in Europe it is not common to take a break, and a lot of people in academia dissuaded me, but I consider it a wonderful, life-changing experience).

To search for offers try googling and visit some job listings (e.g. Indeed). Sometimes it makes sense to mail a company even if they don’t use words or - especially smaller ones may be flexible. Some bigger tech companies (Facebook, Google, IBM, Microsoft) offer internships, see:

Aim at tech companies (to actually work in data science). In the [San Francisco] Bay Area (i.e. north of Silicon Valley) there are plenty opportunities to learn data science - it should be your primary destination. To work in US you need to get J-1 visa (of course, after they want you), but it’s relatively easy (but takes ~2-3 months).

Once on-site, start look for various meeting and hackathons, especially via Meetups. Search for anything that may fit (data science, R communities, big data etc) and try to visit a lot of events. In the Bay Area it is an advantage to be “bold”. So don’t be afraid to asking about or for anything, starting talking to people etc - on the average it will be much better than taking a passive posture. See also:

And if you have a question, a good place to ask (and search for answers) is:

Since you are in maths, it may be possible for you to make a shortcut and get into advanced topics. Here is a random list of starting points I consider interesting:

EDIT (Feb 2018) - some of my new introductions:

This blog post started as emails, and went through a stage of an extract of emails (shared on Google Docs). It took me way more time than I expected to present it in the current form.

There are many people who helped me with this post, at its various stages (starting from asking me questions!). But I would like to especially thank to: Adam Goliński, Sebastian Jaszczur, Kasia Kulma and Robert Bogucki for their remarks on the final version.

I would love to hear your feedback! Did you find it useful? Or maybe you would recommend another learning strategy? Or additional links?

Or maybe your company needs a data science training? I would be happy to provide it! See http://deepsense.ai for the menu (and we are happy to make custom workshops) and fill the form or contact me directly!

Images Powered by Shutterstock