Logo

The Data Daily

Making sense of news — the knowledge graph way

Making sense of news — the knowledge graph way

Making sense of news — the knowledge graph way
How to combine Named Entity Linking with Wikipedia data enrichment to analyze the Internet news
Feb 2 · 12 min read
A wealth of information is being produced every day on the Internet. Understanding the news and other content-generating websites is becoming increasingly crucial to successfully run a business. It can help you spot opportunities, generate new leads, or provide indicators about the economy. In this blog post, I want to show you how you can create a news monitoring data pipeline that combines Natural Language Processing and knowledge graphs technologies.
The data pipeline consists of three parts. In the first part, we scrape articles from an Internet provider of news. Next, we run the articles through an NLP pipeline and store results in the form of a knowledge graph. In the last part of the data pipeline, we enrich our knowledge with information from the WikiData API . To demonstrate the benefits of using a knowledge graph to store the information from the data pipeline, we perform simple network analysis and try to find insights.
Agenda
Network analysis
Graph Model
We use Neo4j to store our knowledge graph. If you want to follow along with this blog post, you need to download Neo4j and install both the APOC and Graph Data Science libraries. All the code is available on GitHub as well.
Graph schema. Image by author
Our graph data model consists of articles and their tags. Each article has many sections of text. Once we run the section text through the NLP pipeline, we extract and store mentioned entities back to our graph.
We start by defining unique constraints for our graph.
Uniqueness constraints are used to ensure data integrity as well as to optimize Cypher query performance.
CREATE CONSTRAINT IF NOT EXISTS ON (a:Article) ASSERT a.url IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS ON (e:Entity) ASSERT e.wikiDataItemId is UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS ON (t:Tag) ASSERT t.name is UNIQUE;
Internet news scraping
Next, we scrape the CNET news portal . I have chosen the CNET portal because it has the most consistent HTML structure, making it easier to demonstrate the data pipeline concept without focusing on the scraping element. We use theapoc.load.html procedure for the HTML scraping. It uses jsoup under the hood. Find more information in the documentation .
First, we iterate over popular topics and store the link of the last dozen of articles for each topic in Neo4j.
{topics:"div.tag-listing > ul > li > a"}) YIELD value
UNWIND value.topics as topic
WITH " https://www.cnet.com " + topic.attributes.href as link
CALL apoc.load.html(link, {article:"div.row.asset > div > a"}) YIELD value
UNWIND value.article as article
WITH distinct " https://www.cnet.com " + article.attributes.href as article_link
MERGE (a:Article{url:article_link});
Now that we have the links to the articles, we can scrape their content as well as their tags and publishing date. We store the results according to the graph schema we defined in the previous section.
MATCH (a:Article)
{date:"time", title:"h1.speakableText", text:"div.article-main-body > p", tags: "div.tagList > a"}) YIELD value
SET a.datetime = datetime(value.date[0].attributes.datetime)
FOREACH (_ IN CASE WHEN value.title[0].text IS NOT NULL THEN [true] ELSE [] END |
CREATE (a)-[:HAS_TITLE]->(:Section{text:value.title[0].text})
)
MERGE (tag:Tag{name:t.text}) MERGE (a)-[:HAS_TAG]->(tag)
)
WITH a, value.text as texts
UNWIND texts as row
WHERE text IS NOT NULL
CREATE (a)-[:HAS_SECTION]->(:Section{text:text});
I did not want to complicate the Cypher query that stores the results of the articles even more, so we must perform a minor cleanup of tags before we continue.
MATCH (n:Tag)
WHERE n.name CONTAINS "Notification"
DETACH DELETE n;
Let’s evaluate our scraping process and look at how many of the articles have been successfully scraped.
MATCH (a:Article)
RETURN exists((a)-[:HAS_SECTION]->()) as scraped_articles,
count(*) as count
In my case, I have successfully collected the information for 245 articles. Unless you have a time machine, you won’t be able to recreate this analysis identically. I have scraped the website on the 30th of January 2021, and you will probably do it later. I have prepared most of the analysis queries generically, so they work regardless of the date you choose to scrape the news.
Let’s also examine the most frequent tags of the articles.
MATCH (n:Tag)
RETURN n.name as tag, size((n)(entity)
WHERE t.name = "Stock Market" AND (entity:Person OR entity:Business)
RETURN entity.title as entity, count(*) as mentions
ORDER BY mentions DESC
Results
Image by author
Ok, so GameStop is huge this weekend with more than 40 mentions. Very far behind are Jim Cramer, Elon Musk, and Alexandria Ocasio-Cortez. Let’s try to understand why GameStop is so huge by looking at the co-occurring entities.
MATCH (b:Business{title:"GameStop"})(other_entity)
RETURN other_entity.title as co_occurent_entity, count(*) as mentions
ORDER BY mentions DESC
Results
Image by author
The most frequently-mentioned entities in the same section as GameStop are Stock, Reddit, and US dollar. If you look at the news you might see that the results make sense. I would venture a guess that AMC (TV channel) was wrongly identified and should probably be the AMC Theaters company. There will always be some mistakes in the NLP process. We can filter the results a bit and look for the most co-occurring person or business entities of GameStop.
MATCH (b:Business{title:"GameStop"})(other_entity:Person)
RETURN other_entity.title as co_occurent_entity, count(*) as mentions
ORDER BY mentions DESC
Results
Image by author
Alexandria Ocasio-Cortez(AOC) and Elon Musk each appear in three sections with GameStop. Let’s examine the text where AOC co-occurs with GameStop.
MATCH (b:Business{title:"GameStop"})(p:Person{title:"Alexandria Ocasio-Cortez"})
RETURN section.text as text
Image by author
Graph Data Science
So far, we have only done a couple of aggregations using the Cypher query language. As we are utilizing a knowledge graph to store our information, let’s execute some graph algorithms on it. Neo4j Graph Data Science library is a plugin for Neo4j that currently has more than 50 graph algorithms available. The algorithms range from community detection and centrality to node embedding and graph neural network categories.
We have already inspected some co-occurring entities so far. Next, we infer a co-occurrence network of persons within our knowledge graph. This process basically translates indirect relationships, where two entities are mentioned in the same section, to a direct relationship between those two entities. This diagram might help you understand the process.
Image by author
The Cypher query for inferring the person co-occurrence network is:
MATCH (s:Person)(t:Person)
WITH s,t, count(*) as weight
MERGE (s)-[c:CO_OCCURENCE]-(t)
SET c.weight = weight
The first graph algorithm we use is the Weakly Connected Components algorithm. It is used to identify disconnected components or islands within the network.
CALL gds.wcc.write({
Results
Image by author
The algorithm found 134 disconnected components within our graph. The p50 value is the 50th percentile of the community size. Most of the components consist of a single node. This implies that they don’t have any CO_OCCURENCE relationships. The largest island of nodes consists of 30 members. We mark its members with a secondary label.
MATCH (p:Person)
WITH p.wcc as wcc, collect(p) as members
ORDER BY size(members) DESC LIMIT 1
UNWIND members as member
SET member:LargestWCC
We further analyze the largest component by examining its community structure and trying to find the most central nodes. When you have a plan to run multiple algorithms on the same projected graph, it is better to use a named graph . The relationship in the co-occurrence network is treated as undirected.
CALL gds.graph.create('person-cooccurence', 'LargestWCC',
First, we run the PageRank algorithm, which helps us identify the most central nodes.
CALL gds.pageRank.write('person-cooccurence', {relationshipWeightProperty:'weight', writeProperty:'pagerank'})
Next, we run the Louvain algorithm, which is a community detection algorithm.
CALL gds.louvain.write('person-cooccurence', {relationshipWeightProperty:'weight', writeProperty:'louvain'})
Some people say that a picture is worth a thousand words. When you are dealing with smaller networks it makes sense to create a network visualization of the results. The following visualization was created using Neo4j Bloom.
Node color represents communities and node size represents the PageRank score. Image by author
Conclusion
I really love how NLP and knowledge graphs are a perfect match. Hopefully, I have given you some ideas and pointers on how you can go about implementing your data pipeline and storing results in a form of a knowledge graph. Let me know what do you think!
As always, the code is available on GitHub .
References
[1] Janez Brank, Gregor Leban, Marko Grobelnik. Annotating Documents with Relevant Wikipedia Concepts . Proceedings of the Slovenian Conference on Data Mining and Data Warehouses (SiKDD 2017), Ljubljana, Slovenia, 9 October 2017.

Images Powered by Shutterstock