Logo

The Data Daily

Be a more efficient data scientist today, master pandas with this guide

Be a more efficient data scientist today, master pandas with this guide

Python is open source. It’s great, but has the inherent problem of open source: many packages do (or try to do) the same thing. If you’re new to Python, it’s hard to know the best package for a specific task. You need someone who has experience to tell you. And I tell you today: there’s one package you absolutely need to learn for data science, and it’s called pandas.

And what’s really interesting with pandas, is that many other packages are hidden in it. Pandas is a core package plus features from a variety of other packages. And that’s great, because you can work only using pandas.

pandas is like Excel in Python: it uses tables (namely DataFrame) and operates transformations on the data. But it can do a lot more.

If you’re already familiar with Python, you can go straight to the 3rd paragraph

The most usual functions: read_csv, read_excel Some other great functions: read_clipboard, read_sql

I usually don’t go for the other functions, like .to_excel, .to_json, .to_pickle since .to_csv does very well the job. And because csv is the most common way to save tables.

This feature is made possible thanks to the matplotlib package. As we said in the intro, it’s usable directly in pandas.

Alright, now you can do things that were easily accessible in Excel. Let’s dig in some amazing things that are not doable in Excel.

The .map() operation applies a function to each element of a column.

.applymap() applies a function to all cells in the table (DataFrame).

When working with large datasets, pandas can take some time running .map(), .apply(), .applymap() operations. tqdm is a very useful package that helps predict when theses operations will finish executing (yes I lied, I said we would use only pandas).

Not quite simple at the beginning, you need to master the syntax first, and you’ll see yourself using this feature all the time.

The .iterrows() loops through 2 variables together: the index of the row, and the row (i and row in the code above).

There are many other interesting pandas features I could have shown, but it’s already enough to understand why a data scientist cannot do without pandas. To sum up, pandas is

It is THE tool that helps a data scientist to quickly read and understand data and be more efficient at his role.

I hope you found this article useful, and if you did, consider giving at least 50 claps :)

Images Powered by Shutterstock