Logo

The Data Daily

Notes from JupyterCon — How Project Jupyter is Enabling Large Scale DataScience

Notes from JupyterCon — How Project Jupyter is Enabling Large Scale DataScience

Notes from JupyterCon — How Project Jupyter is Enabling Large Scale DataScience
Increased enterprise adoption, new tools/extensions, new ways to improve datascience.
Credit: O’Reilly Flickr
At the just concluded JupyterCon 2018 conference, it really was all about leveraging the power of Project Jupyter for collaborative, extensible, scalable, and reproducible data science. I got to attend tutorials, talks, keynotes and a poster session (where I presented updates on my work on automated visualization ) across 3 days. A TLDR summary of some highlights I found interesting are below:
Jupyter has been embraced as a way to easily share code, analyses and results both in academic and industry settings. Schools use it in delivering courses while analysts, journalists, researchers, large organizations etc use it as a scalable way to manage their data analysis workflows.
Enterprise use Jupyter notebooks is changing. As opposed to adhoc use (i.e. individual employees using notebooks) firms are beginning to explore enterprise-wide, enterprise-managed Jupyter deployments.
Enterprises are extending Jupyter to solve new problems/usecases and have begun to contribute their updates back as products and as open source software (OSS). Examples: IBM (Watson Studio, Jupyter Enterprise Gateway ), Netflix ( nteract , papermill, commuter, titus), Google (Colaboratory, Dataflow, Kaggle), Amazon (SageMaker) etc.
Access and security is becoming an important area of focus (who can access a notebook, what sections of the notebook, data/results within the notebook). As enterprises (with well defined resource access policies) integrate Jupyter into their datascience workflow, it is becoming important to define templates and data access rules that comply with existing policies.
A need for a more mindful approaches to open source. It can be hard to maintain a large open source project like Jupyter, especially when the burden falls on a few people. We need to explore more ways to support the community and maintenance process (paid maintainers, enterprise grants etc).
Keynotes —
A summary on some keynotes and insights are below:
Will Farr from the department of physics and astronomy at Stony Brook University mentioned how he and a team of over 1000 astrophysics scientists (at LIGO ) use Jupyter notebooks to share results on gravitational wave astronomy.
Paco Nathan, conference co-chair, highlighted trends and themes in the Jupyter community. He mentioned how Jupyter adoption in large enterprises is enabling rapid progress (students already know Jupyter and dont have to spend time learning proprietary tools), how Jupyter can help future proof your infrastructure given the rapid changes in the industry (hardware, software, process), and how it might help address privacy issues.
Carol Willing, a leader within Project Jupyter, highlighted how Jupyter makes writing code more approachable to users and encourages experimentation. She however, also cautions on challenges which can limit the success of project Jupyter — selfish users who take more than they give (maximizing personal benefit at expense of the group), maintainer burnout, non-recognition for maintainers, complacent bystanders etc. She also discussed the idea of integrating a business model and grant funding as ways to address these issues.
Mark Hansen from the Columbia Journalism School told a rather interesting story of how Jupyter Notebooks have become “the tool” to help journalists integrate data, code and algorithms into digital journalism. He gave the specific case of a student project where they ran experiments to identify the impact of fake twitter accounts on the spread of information (hint: pretty interesting article). Even more interesting, the results of the experiment led to a large scale purge of millions of twitter accounts and a published digital/print NYtimes article.
Tracy Teal from the Carpentries discussed their work on helping users build data skills by running hundreds of workshops around the world. She argued that people are the key to converting data into insights and motivated a focus on scaling the number of people with data skills. She shared an interesting quote
If you want to go fast, go alone, if you want to fo far go together.
Ryan Abernathey, assistant professor of Earth and environmental science at Columbia University, makes a case for why we should store data in the cloud. He mentioned that complex problems usually require large datasets (TBs, PBs) which can sometimes be locked behind slow FTP portals or other on prem storage systems. Researchers then have to create dark repositories (local copies of data) which may be a subset of original data and can constrain their ability to explore diverse research questions. He advocates for the use of the cloud (and related tools) including - zero3jupyterhub+kubernetes, parallel computing (DASK, Spark), domain software (xarray, astropy), cloud optimized data. He mentioned NASA has begun progress in this direction by committing to making petabytes of data available via cloud endpoints over the next 5 years.
Michelle Ufford, Head of Data Engineering and Infrastructure at Netflix highlighted reasons why Netflix is betting big on Jupyter Notebooks. She mentions notebooks are the future, they are useful beyond simply providing interactivity and importantly
Notebooks bridge the chasm between technical and non-technical team members.
She discussed how Netflix is betting really big on notebooks are migrating over 10k workflows to notebooks. At Netflix they have created multiple usecases (and tools) around notebooks including data access, notebook templates for different tasks and even notebook scheduling! More on this in the Netflix Tech Blog here .
Talks -
Brian Granger, conference co-chair, gave a talk on the “The business case and best practices for leveraging open source”. He discussed changes in usage (e.g. high traffic and interest from Asia), rewards of OSS (free 3–5year R&D, allows companies focus on core business, supports flexible customization)and risks of OSS (disruptive effect, slow turnaround, lack of enterprise support). He also suggested best practices for enterprises interested in engaging with OSS projects — take time to understand the real needs of OSS, begin with helping out reviewing PRs (actual code writing makes up a small part of the OSS process). Things to avoid — don’t hire an OSS project maintainer and assign their time to other projects; this damages the community.
Julia Lane, Professor from the NYU Center for Urban Science and Progress, discussed how Jupyter notebooks are used in projects which involve sensitive data necessary for public policy. She describes how her work on labour economics requires data from multiple agencies (e.g. data on firms from the IRS etc, and data on employees from the censorship bureau). These contain confidential microdata (e.g. EINs, SSNs etc) and require high security standards. Furthermore, it is hard to implement the equivalent of a “Data Clearing House” where multiple agencies can “share data” — they may not be incentivized to share these data. She suggests an approach where instead of a clearing house, data collection efforts are organized around specific questions, full teams are set up (data analysis, network analysis, machine learning, text analysis, user behaviour etc.), and appropriate agency staff are trained.
Panel | The current environment — Compliance, ethics, ML model interpretation, GDPR
Panelists explored several questions including responses to GDPR. Some thoughts were shared by Julia Lane on how the GDPR is based on two problematic constructs — informed consent (she mentions this is meaningless) and data anonymization (she mentions data can be de-identified but not anonymized). She suggests we move more towards “ethical use of data” (perhaps a hippocratic oath equivalent for data scientists). Michelle Ufford from Netflix also mentioned all data is biased, and that rather than asking “is my data biased” we should ask “how is my data biased”. Others also encouraged the design of tools that automate bias testing and for the introduction of the role of model tester/breaker - individuals who question, test and break models (the equivalent of the QA engineer in traditional software engineering development).
Learning Resources
Chakri Cherukuri from Bloomberg gave an interesting presentation on visualizations for machine learning algorithms. Notebooks here — https://github.com/ChakriCherukuri/jupytercon_2018
Bruno Goncalves from NYU, delivered a tutorial on data visualization with Matplotlib. https://github.com/bmtgoncalves/DataVisualization
Interesting (and slightly unexpected) to see financial companies like Two Sigma , Capital One etc, taking on leadership roles in OSS and contributing code/projects.
Conclusion
pronoun and expertise tags at JupyterCon 2018
The organization of the conference was excellent (did I mention the excellent hot meals served at lunch?) with with thoughtful gestures to make attendees feel welcome and included. From the pacman rule (leave space for one more person when you gather for conversations) to tags that help facilitate communication and interaction among attendees (tags around interests/expertise — “hire me” “Im hiring” “ML/AI” , preferred pronouns “he/she/they etc”, etc.). I left this conference feeling inspired by the excellent work being done, the generosity of spirit of the Project Jupyter community members and its impact in diverse areas. Going forward, I plan to do more with notebooks (share demo’s, sample code etc).
Many thanks to the conference co-chairs Paco Nathan, Fernando Perez, and Brian Granger; fantastic work! Thanks to the conference hosts — Project Jupyter  , O’Reilly Media Inc and the NumFOCUS Foundation. Thanks to Capital One for the generous scholarship that allowed me attend this really interesting conference! Definitely looking forward to Jupytercon 2019!

Images Powered by Shutterstock