Logo

The Data Daily

Using AI to Extract a Knowledge Base of COVID-19 Mechanisms

Using AI to Extract a Knowledge Base of COVID-19 Mechanisms

The web of science related to COVID-19 is immense — scientists in fields ranging from medicine, genetics, microbiology, and zoology, all the way to physics, mathematics, computer science, climatology, sociology and macroeconomics are working to understand different angles of the pandemic and its effects. Can we leverage artificial intelligence to help researchers navigate the eclectic landscape of scientific literature around the disease, that keeps growing by the day?

To help accelerate the pace of discovery, we release our COVID-19 mechanism knowledge base (KB) and online search tool, containing diverse and structured information on causal relations, methods, objectives and activities — coming from any area.

To create our KB, we train AI models to extract information from over 200K scientific papersold and new, with an approach we discuss in our recent paper. We built our tool for scientists to rapidly search and explore the web of COVID-19 science — not only for biomedical phenomena such as mechanisms involved in viral activity or drugs and their effects, but also information on algorithms used for diagnosis, designs for safer air circulation, public policies for pandemic control, models of climatic effects on disease spread, and many more. Current biomedical knowledge bases contain important but limited information on entities such as genes and drugs; in contrast, we’ve designed our KB to have broad reach across all scientific disciplines.

In one important recent example, a group of 239 scientists called attention to the airborne transmissibility of the virus, based on interdisciplinary research spanning virology, aerosol physics, flow dynamics, exposure and epidemiology, medicine, and building engineering. In this scenario, a scientist can use our online search tool to discover, for instance, the use of ceiling-level exhausts for controlling airborne transmission, or optical methods for measuring viral particle size:

The same search also reveals computer simulations used to study droplets:

A researcher looking to find out about applications of AI — perhaps looking for AI solutions to their problem, or new opportunities to apply their AI method — can search for algorithms such as convolutional neural networks,with COVID-19 as an objective/target:

and retrieve a table of structured results, such as applications of CNN models to COVID-19 detection/testing, along with the original context:

Of course, more “conventional” biomedical mechanisms can be searched, such as the effects of Vitamin D on COVID-19:

Finally, for an example beyond STEM sciences, a researcher can quickly find a list of factors impacting society, such as school closures:

Importantly, our focus is not on finding and displaying papers, but discovering full lists of structured, pinpointed mechanism relationships. Aside from being valuable information that can now be directly targeted unlike other search engines, this can also help scientists cut through the clutter and help mitigate information overload— by focusing their attention on the information they need.

In Homo Deus: A Brief History of Tomorrowby Yuval Noah Harari, the author refers to the vast web of interdisciplinary science governing the world:

By building a knowledge base with diverse mechanisms across fields, we aim to make progress toward connecting those dots, starting with one of the pressing challenges of our time — the COVID-19 pandemic.

We focus on the fundamental concept of mechanismsthat captures important knowledge across disciplines, including:

Although seemingly intuitive, a definition of what mechanisms exactly are is subject to debate in the philosophy of science, discussed in detail in our paper. However, a simple dictionary definition reveals the generality of the concept:

In biomedicine, AI-based Information Extraction (IE) tools have been used to extract mentions of entities such as proteins or chemicals and their relations. Some of these relations correspond to our notion of mechanisms (e.g., chemical-protein regulation, or drug-drug interactions), but capture only a fraction of the full breadth and depth of mechanisms in the literature. Our unified view of mechanisms is designed to help generalize and scale the study of these important relations.

We train an IE model that automatically extracts mechanism information (functional relations) from scientific papers into a KB. We technically define mechanisms as relations between spans of text appearing in the literature (such as in paper abstracts). The spans we use are open and free-form,to strike a balance between expressivity and breadth across domains. We formulate two main types of relations: Coarse-grained and fine-grained relations.

Coarse-grained relations are ordered pairs (tuples) of spans capturing mechanism patterns such as (method, goal),(cause, effect), (agent, action). In the screenshot of our search interface shown before, we can see examples of these pairs — such as (ceiling-level exhausts, controlling airborne transmission).

Fine-grained relations are triples of the form (subject, predicate, object),where the predicate may indicate the type of mechanism as in the following examples:

While more granular, these relations are also less general — as the natural language of scientific papers describing mechanisms often does not conform to this more rigid structure, as discussed in our paper along with more details on the dataset and models.

In another example from our tool, a scientist can search for mechanisms referring to cardiovascular effects of COVID-19:

Among the many mechanism results, we discover a complication of COVID-19 associated with arterial disease — “thrombosis of both radial arteries”.

This result is semantically related to the search query of “cardiovascular disease”, even though the result and query do not share any keywords. This result is found using a language model fine-tuned for semantic similarity using the excellent sentence-transformers library, representing both the query and the KB entries as soft vectors such that entries with similar meaning should have vectors that are close to one another. For fast similarity-based search we use FAISS, a specialized index of vectors for the search task.

We can also further filter the retrieved mechanisms by context — for example, taking the above query for cardiovascular effects of COVID-19 and filtering for a context that explicitly mentions “patients”.

To assess our tool more quantitatively, we recruit annotators with background in computer science (AI), medicine, biology and material science. Annotators are given two types of tasks:

In both tasks, annotators view search results from our KB, with varying degrees of relevance. Overall, our results indicate the retrieved relations are both accurately extracted and retrieved:

Our hope is that our framework can support research on COVID-19, and boost knowledge discovery more broadly across the sciences.

Images Powered by Shutterstock