The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation

Read original article here

As machine learning (ML) research moves toward large-scale models capable of numerous downstream tasks, a shared understanding of a dataset’s origin, development, intent, and evolution becomes increasingly important for the responsible and informed development of ML models. However, knowledge about datasets, including use and implementations, is often distributed across teams, individuals, and even time. Earlier this year at the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT), we published Data Cards, a dataset documentation framework aimed at increasing transparency across dataset lifecycles. Data Cards are transparency artifacts that provide structured summaries of ML datasets with explanations of processes and rationale that shape the data and describe how the data may be used to train or evaluate models. At minimum, Data Cards include the following: (1) upstream sources, (2) data collection and annotation methods, (3) training and evaluation methods, (4) intended use, and (5) decisions affecting model performance.

In practice, two critical factors determine the success of a transparency artifact, the ability to identify the information decision-makers use and the establishment of processes and guidance needed to acquire that information. We started to explore this idea in our paper with three “scaffolding” frameworks designed to adapt Data Cards to a variety of datasets and organizational contexts. These frameworks helped us create boundary infrastructures, which are the processes and engagement models that complement technical and functional infrastructure necessary to communicate information between communities of practice. Boundary infrastructures enable dataset stakeholders to find common ground used to provide diverse input into decisions for the creation, documentation, and use of datasets.

Today, we introduce the Data Cards Playbook, a self-guided toolkit for a variety of teams to navigate transparency challenges with their ML datasets. The Playbook applies a human-centered design approach to documentation — from planning a transparency strategy and defining the audience to writing reader-centric summaries of complex datasets — to ensure that the usability and utility of the documented datasets are well understood. We’ve created participatory activities to navigate typical obstacles in setting up a dataset transparency effort, frameworks that can scale data transparency to new data types, and guidance that researchers, product teams and companies can use to produce Data Cards that reflect their organizational principles.

We created the Playbook using a multi-pronged approach that included surveys, artifact analysis, interviews, and workshops. We studied what Googlers wanted to know about datasets and models, and how they used that information in their day-to-day work. Over the past two years, we deployed templates for transparency artifacts used by fifteen teams at Google, and when bottlenecks arose, we partnered with these teams to determine appropriate workarounds. We then created over twenty Data Cards that describe image, language, tabular, video, audio, and relational datasets in production settings, some of which are now available on GitHub. This multi-faceted approach provided insights into the documentation workflows, collaborative information-gathering practices, information requests from downstream stakeholders, and review and assessment practices for each Google team.

Moreover, we spoke with design, policy, and technology experts across the industry and academia to get their unique feedback on the Data Cards we created. We also incorporated our learnings from a series of workshops at ACM FAccT in 2021. Within Google, we evaluated the effectiveness and scalability of our solutions with ML researchers, data scientists, engineers, AI ethics reviewers, product managers, and leadership. In the Data Cards Playbook, we’ve translated successful approaches into repeatable practices that can easily be adapted to unique team needs.

The Data Cards Playbook is modeled after sprints and co-design practices, so cross-functional teams and their stakeholders can work together to define transparency with an eye for real-world problems they experience when creating dataset documentation and governance solutions. The thirty-three available Activities invite broad, critical perspectives from a wide variety of stakeholders, so Data Cards can be useful for decisions across the dataset lifecycle. We partnered with researchers from the Responsible AI team at Google to create activities that can reflect considerations of fairness and accountability. For example, we've adapted Evaluation Gaps in ML practices into a worksheet for more complete dataset documentation.

We’ve formed Transparency Patterns with evidence-based guidance to help anticipate challenges faced when producing transparent documentation, offer best practices that improve transparency, and make Data Cards useful for readers from different backgrounds. The challenges and their workarounds are based on data and insights from Googlers, industry experts, and academic research.

The Playbook also includes Foundations, which are scalable concepts and frameworks that explore fundamental aspects of transparency as new contexts of data modalities and ML arise. Each Foundation supports different product development stages and includes key takeaways, actions for teams, and handy resources.

The Playbook is organized into four modules: (1) Ask, (2) Inspect, (3) Answer, and (3) Audit. Each module contains a growing compendium of materials teams can use within their workflows to tackle transparency challenges that frequently co-occur. Since Data Cards were created with scalability and extensibility in mind, modules leverage divergence-converge thinking that teams may already use, so documentation isn’t an afterthought. The Ask and Inspect modules help create and evaluate Data Card templates for organizational needs and principles. The Answer and Audit modules help data teams complete the templates and evaluate the resulting Data Cards.

In Ask, teams define transparency and optimize their dataset documentation for cross-functional decision-making. Participatory activities create opportunities for Data Card readers to have a say in what constitutes transparency in the dataset’s documentation. These address specific challenges and are rated for different intensities and durations so teams can mix-and-match activities around their needs.

The Inspect module contains activities to identify gaps and opportunities in dataset transparency and processes from user-centric and dataset-centric perspectives. It supports teams in refining, validating, and operationalizing Data Card templates across an organization so readers can arrive at reasonable conclusions about the datasets described.

The Answer module contains transparency patterns and dataset-exploration activities to answer challenging and ambiguous questions. Topics covered include preparing for transparency, writing reader-centric summaries in documentation, unpacking the usability and utility of datasets, and maintaining a Data Card over time.

The Audit module helps data teams and organizations set up processes to evaluate completed Data Cards before they are published. It also contains guidance to measure and track how a transparency effort for multiple datasets scales within organizations.

A data operations team at Google used an early version of the Lenses and Scopes Activities from the Ask modules to create a customized Data Card template. Interestingly, we saw them use this template across their workflow till datasets were handed off. They used Data Cards to take dataset requests from research teams, tracked the various processes to create the datasets, collected metadata from vendors responsible for annotations, and managed approvals. Their experiences of iterating with experts and managing updates are reflected in our Transparency Patterns.

Another data governance group used a more advanced version of the activities to interview stakeholders for their ML health-related initiative. Using these descriptions, they identified stakeholders to co-create their Data Card schema. Voting on Lenses was used to rule out typical documentation questions, and identify atypical documentation needs specific to their data type, and important for decisions frequently made by ML leadership and tactical roles within their team. These questions were then used to customize existing metadata schemas in their data repositories.

We present the Data Cards Playbook, a continuous and contextual approach to dataset transparency that deliberately considers all relevant materials and contexts. With this, we hope to establish and promote practice-oriented foundations for transparency to pave the path for researchers to develop ML systems and datasets that are responsible and benefit society.

In addition to the four Playbook modules described, we’re also open-sourcing a card builder, which generates interactive Data Cards from a Markdown file. You can see the builder in action in the GEM Benchmark project’s Data Cards. The Data Cards created were a result of activities from this Playbook, in which the GEM team identified improvements across all dimensions, and created an interactive collection tool designed around scopes.

We acknowledge that this is not a comprehensive solution for fairness, accountability, or transparency in itself. We’ll continue to improve the Playbook using lessons learned. We hope the Data Cards Playbook can become a robust platform for collaboratively advancing transparency research, and invite you to make this your own.

This work was done in collaboration with Reena Jana, Vivian Tsai, and Oddur Kjartansson. We want to thank Donald Gonzalez, Dan Nanas, Parker Barnes, Laura Rosenstein, Diana Akrong, Monica Caraway, Ding Wang, Danielle Smalls, Aybuke Turker, Emily Brouillet, Andrew Fuchs, Sebastian Gehrmann, Cassie Kozyrkov, Alex Siegman, and Anthony Keene for their immense contributions; and Meg Mitchell and Timnit Gebru for championing this work.

We also want to thank Adam Boulanger, Lauren Wilcox, Roxanne Pinto, Parker Barnes, and Ayça Çakmakli for their feedback; Tulsee Doshi, Dan Liebling, Meredith Morris, Lucas Dixon, Fernanda Viegas, Jen Gennai, and Marian Croak for their support. This work would not have been possible without our workshop and study participants, and numerous partners, whose insights and experiences have shaped this Playbook.

Images Powered by Shutterstock

The Data Daily

The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation