Logo

The Data Daily

GitHub - cohere-ai/sandbox-topically: Topic modeling helpers using managed language models from Cohere. Name text clusters using large GPT models.

GitHub - cohere-ai/sandbox-topically: Topic modeling helpers using managed language models from Cohere. Name text clusters using large GPT models.

Topic modeling helpers using managed language models from Cohere. Name text clusters using large GPT models.
Insights
cohere-ai/sandbox-topically
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
main
View all tags
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Cancel
Use Git or checkout with SVN using the web URL.
Work fast with our official CLI. Learn more .
You don't have access just yet, but in the meantime, you can learn about Codespaces
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching GitHub Desktop
If nothing happens, download GitHub Desktop and try again.
Launching Xcode
If nothing happens, download Xcode and try again.
Launching Visual Studio Code
Your codespace will open once ready.
There was a problem preparing your codespace, please try again.
Latest commit
Failed to load latest commit information.
Type
Nov 3, 2022
View code
README.md
################################################################################ # ____ _ ____ _ _ # # / ___|___ | |__ ___ _ __ ___ / ___| __ _ _ __ __| | |__ _____ __ # # | | / _ \| '_ \ / _ \ '__/ _ \ \___ \ / _` | '_ \ / _` | '_ \ / _ \ \/ / # # | |__| (_) | | | | __/ | | __/ ___) | (_| | | | | (_| | |_) | (_) > < # # \____\___/|_| |_|\___|_| \___| |____/ \__,_|_| |_|\__,_|_.__/ \___/_/\_\ # # # # This project is part of Cohere Sandbox, Cohere's Experimental Open Source # # offering. This project provides a library, tooling, or demo making use of # # the Cohere Platform. You should expect (self-)documented, high quality code # # but be warned that this is EXPERIMENTAL. Therefore, also expect rough edges, # # non-backwards compatible changes, or potential changes in functionality as # # the library, tool, or demo evolves. Please consider referencing a specific # # git commit or version if depending upon the project in any mission-critical # # code as part of your own projects. # # # # Please don't hesitate to raise issues or submit pull requests, and thanks # # for checking out this project! # # # ################################################################################
Project maintained until at least: 2023-04-30
A picture is worth a thousand sentences
When you want to explore thousands or millions of texts (messages, emails, news headlines), topic modeling tools help you make sense of them rapidly and visually.
Topically
Topically is a [work-in-progress] suite of tools that help make sense of text collections (messages, articles, emails, news headlines) using large language models.
Topically's first feature is to name clusters of short texts based on their content. For example, here are news headlines from the machinelearning and investing subreddits, and the names suggested for them by topically:
Usage Example
Use Topically to name clusters in the course of topic modeling
import topically app = topically.Topically('cohere_api_key') example_texts = [ # Three headlines from the machine learning subreddit "[Project] From books to presentations in 10s with AR + ML", "[D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition", "[R] First Order Motion Model applied to animate paintings", # Three headlines from the investing subreddit "Robinhood and other brokers literally blocking purchase of $GME, $NOK, $BB, $AMC; allow sells", "United Airlines stock down over 5% premarket trading", "Bitcoin was nearly $20,000 a year ago today"] # We know the first three texts belong to one topic (topic 0), the last three belong to another topic (topic 1) example_topics = [0, 0, 0, 1, 1, 1] topics_of_examples, topic_names_dict = app.name_topics((example_texts, example_topics)) #Optional: num_generations=5 topics_of_examples # Run again to get new suggested names. More text examples should result in better names.
Output:
['Text recognition', 'Text recognition', 'Text recognition', 'Stock Market Closing Bell', 'Stock Market Closing Bell', 'Stock Market Closing Bell']
In this simple example, we know the cluster assignments. In actual applications, a topic modeling library like BERTopic can cluster the texts for us, and then we can name them with topically.
Usage Example: Topically + BERTopic
Use Topically to name clusters in the course of topic modeling with tools like BERTopic. Get the cluster assignments from BERTopic, and name the clusters with topically. This improves on the keyword topic labels (and can build upon them).
Here's example code and a colab notebook demonstrating this.
Code excerpt:
from bertopic import BERTopic from topically import Topically # Load and initialize BERTopic to use KMeans clustering with 8 clusters only. cluster_model = KMeans(n_clusters=8) topic_model = BERTopic(hdbscan_model=cluster_model) # df is a dataframe. df['title'] is the column of text we're modeling df['topic'], probabilities = topic_model.fit_transform(df['title'], embeds) # Load topically app = Topically('cohere_api_key') # name clusters df['topic_name'], topic_names = app.name_topics((df['title'], df['topic'])) df[['title', 'topic', 'topic_name']]
Installation

Images Powered by Shutterstock