Logo

The Data Daily

TidyTuesday Week 12: Programming Languages | R-bloggers

TidyTuesday Week 12: Programming Languages | R-bloggers

% count(line_comment_token, sort = TRUE) programming_lang2 %>% gt() %>% tab_header(title = "Most Common Comment Tokens") %>% cols_label(line_comment_token = "Token", n = "# of Languages that use token") Most Common Comment Tokens Token # of Languages that use token // 161 # 70 ; 49 -- 31 ' 16 % 12 ! 7 * 5 REM 2 *> 1 --- 1 / 1 NB. 1 \ 1 \* 1 __ 1 ~ 1 ⍝ 1 There is a language rank, which measures the popularity of the language based on signals such as number of users and number of jobs. Let’s see the average rank of languages for each token. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(avg_rank = mean(language_rank)) %>% ggplot(aes((fct_reorder(line_comment_token, avg_rank)), avg_rank)) + geom_col(fill = "dodgerblue2") + ylab("Average Rank of Language") + xlab("Comment Token") + ggtitle("Average rank of languages using different comment tokens") + theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.25, hjust = 0.25)) The highest (average) ranked token is “*>”. What languages use this? programming_lang %>% filter(line_comment_token == "*>") %>% select(title, language_rank, line_comment_token) # A tibble: 1 × 3 title language_rank line_comment_token 1 COBOL 19 *> Only COBOL does, so the rank of this token isn’t diluted by many less popular languages. We can view the distribution of the language ranks for all the tokens. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(line_comment_token, language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token.") + xlab("Token") + ylab ("Language Rank") + theme_classic() Okay, let’s clean this up. I’d like it sorted by the median rank. Remeber rank is in reverse numerical order- a low number means a higher rank. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(fct_reorder(line_comment_token, language_rank, .fun = median, .desc = FALSE), language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token") + xlab("Token") + ylab("Language Rank") + theme_classic() Let’s see the most popular language for each symbol. There might be a way to do this all at once, but I’m going to pull it out with joins to previous tables I’ve created. programming_lang3 % filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(highest_rank = min(language_rank)) join_madness % left_join(programming_lang3, by = "line_comment_token") %>% left_join(programming_lang, by = c("highest_rank" = "language_rank", "line_comment_token" = "line_comment_token")) join_madness % select(line_comment_token, n, highest_rank, title, appeared, number_of_users, number_of_jobs) So now we have a bunch of summarized data in a single dataframe. Here’s a graph. It is saying something, but I’m not sure what. When you can’t come up with a concise title, then you probably don’t know what you are trying to say… join_madness %>% ggplot(aes(highest_rank, n, size = log(number_of_users), color = log(number_of_users), label = line_comment_token)) + scale_y_log10() + scale_x_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("Popularity of tokens by language rank and usage") + xlab("Highest Rank of language using Token") + ylab("Number of Languages using token") + theme_classic() This is a visualization of the highest ranked languages for each token. The number of users of the dominant language is also encoded in the size and color of the label. Having it ordered makes it difficult to tell if Java or Python is the most popular/ highest ranked language. join_madness %>% ggplot(aes(fct_rev(fct_reorder(line_comment_token, highest_rank)), n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() Here is the same graph just ordered “alphabetically” by token. join_madness %>% ggplot(aes(line_comment_token, n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() CitationBibTeX citation:@online{e.sinks2023, author = {Louise E. Sinks}, title = {TidyTuesday {Week} 12: {Programming} {Languages}}, date = {2023-03-21}, url = {https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/}, langid = {en} } For attribution, please cite this work as: Louise E. Sinks. 2023. “TidyTuesday Week 12: Programming Languages.” March 21, 2023. https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/." />
% count(line_comment_token, sort = TRUE) programming_lang2 %>% gt() %>% tab_header(title = "Most Common Comment Tokens") %>% cols_label(line_comment_token = "Token", n = "# of Languages that use token") Most Common Comment Tokens Token # of Languages that use token // 161 # 70 ; 49 -- 31 ' 16 % 12 ! 7 * 5 REM 2 *> 1 --- 1 / 1 NB. 1 \ 1 \* 1 __ 1 ~ 1 ⍝ 1 There is a language rank, which measures the popularity of the language based on signals such as number of users and number of jobs. Let’s see the average rank of languages for each token. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(avg_rank = mean(language_rank)) %>% ggplot(aes((fct_reorder(line_comment_token, avg_rank)), avg_rank)) + geom_col(fill = "dodgerblue2") + ylab("Average Rank of Language") + xlab("Comment Token") + ggtitle("Average rank of languages using different comment tokens") + theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.25, hjust = 0.25)) The highest (average) ranked token is “*>”. What languages use this? programming_lang %>% filter(line_comment_token == "*>") %>% select(title, language_rank, line_comment_token) # A tibble: 1 × 3 title language_rank line_comment_token 1 COBOL 19 *> Only COBOL does, so the rank of this token isn’t diluted by many less popular languages. We can view the distribution of the language ranks for all the tokens. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(line_comment_token, language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token.") + xlab("Token") + ylab ("Language Rank") + theme_classic() Okay, let’s clean this up. I’d like it sorted by the median rank. Remeber rank is in reverse numerical order- a low number means a higher rank. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(fct_reorder(line_comment_token, language_rank, .fun = median, .desc = FALSE), language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token") + xlab("Token") + ylab("Language Rank") + theme_classic() Let’s see the most popular language for each symbol. There might be a way to do this all at once, but I’m going to pull it out with joins to previous tables I’ve created. programming_lang3 % filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(highest_rank = min(language_rank)) join_madness % left_join(programming_lang3, by = "line_comment_token") %>% left_join(programming_lang, by = c("highest_rank" = "language_rank", "line_comment_token" = "line_comment_token")) join_madness % select(line_comment_token, n, highest_rank, title, appeared, number_of_users, number_of_jobs) So now we have a bunch of summarized data in a single dataframe. Here’s a graph. It is saying something, but I’m not sure what. When you can’t come up with a concise title, then you probably don’t know what you are trying to say… join_madness %>% ggplot(aes(highest_rank, n, size = log(number_of_users), color = log(number_of_users), label = line_comment_token)) + scale_y_log10() + scale_x_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("Popularity of tokens by language rank and usage") + xlab("Highest Rank of language using Token") + ylab("Number of Languages using token") + theme_classic() This is a visualization of the highest ranked languages for each token. The number of users of the dominant language is also encoded in the size and color of the label. Having it ordered makes it difficult to tell if Java or Python is the most popular/ highest ranked language. join_madness %>% ggplot(aes(fct_rev(fct_reorder(line_comment_token, highest_rank)), n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() Here is the same graph just ordered “alphabetically” by token. join_madness %>% ggplot(aes(line_comment_token, n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() CitationBibTeX citation:@online{e.sinks2023, author = {Louise E. Sinks}, title = {TidyTuesday {Week} 12: {Programming} {Languages}}, date = {2023-03-21}, url = {https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/}, langid = {en} } For attribution, please cite this work as: Louise E. Sinks. 2023. “TidyTuesday Week 12: Programming Languages.” March 21, 2023. https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/." />
% count(line_comment_token, sort = TRUE) programming_lang2 %>% gt() %>% tab_header(title = "Most Common Comment Tokens") %>% cols_label(line_comment_token = "Token", n = "# of Languages that use token") Most Common Comment Tokens Token # of Languages that use token // 161 # 70 ; 49 -- 31 ' 16 % 12 ! 7 * 5 REM 2 *> 1 --- 1 / 1 NB. 1 \ 1 \* 1 __ 1 ~ 1 ⍝ 1 There is a language rank, which measures the popularity of the language based on signals such as number of users and number of jobs. Let’s see the average rank of languages for each token. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(avg_rank = mean(language_rank)) %>% ggplot(aes((fct_reorder(line_comment_token, avg_rank)), avg_rank)) + geom_col(fill = "dodgerblue2") + ylab("Average Rank of Language") + xlab("Comment Token") + ggtitle("Average rank of languages using different comment tokens") + theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.25, hjust = 0.25)) The highest (average) ranked token is “*>”. What languages use this? programming_lang %>% filter(line_comment_token == "*>") %>% select(title, language_rank, line_comment_token) # A tibble: 1 × 3 title language_rank line_comment_token 1 COBOL 19 *> Only COBOL does, so the rank of this token isn’t diluted by many less popular languages. We can view the distribution of the language ranks for all the tokens. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(line_comment_token, language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token.") + xlab("Token") + ylab ("Language Rank") + theme_classic() Okay, let’s clean this up. I’d like it sorted by the median rank. Remeber rank is in reverse numerical order- a low number means a higher rank. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(fct_reorder(line_comment_token, language_rank, .fun = median, .desc = FALSE), language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token") + xlab("Token") + ylab("Language Rank") + theme_classic() Let’s see the most popular language for each symbol. There might be a way to do this all at once, but I’m going to pull it out with joins to previous tables I’ve created. programming_lang3 % filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(highest_rank = min(language_rank)) join_madness % left_join(programming_lang3, by = "line_comment_token") %>% left_join(programming_lang, by = c("highest_rank" = "language_rank", "line_comment_token" = "line_comment_token")) join_madness % select(line_comment_token, n, highest_rank, title, appeared, number_of_users, number_of_jobs) So now we have a bunch of summarized data in a single dataframe. Here’s a graph. It is saying something, but I’m not sure what. When you can’t come up with a concise title, then you probably don’t know what you are trying to say… join_madness %>% ggplot(aes(highest_rank, n, size = log(number_of_users), color = log(number_of_users), label = line_comment_token)) + scale_y_log10() + scale_x_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("Popularity of tokens by language rank and usage") + xlab("Highest Rank of language using Token") + ylab("Number of Languages using token") + theme_classic() This is a visualization of the highest ranked languages for each token. The number of users of the dominant language is also encoded in the size and color of the label. Having it ordered makes it difficult to tell if Java or Python is the most popular/ highest ranked language. join_madness %>% ggplot(aes(fct_rev(fct_reorder(line_comment_token, highest_rank)), n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() Here is the same graph just ordered “alphabetically” by token. join_madness %>% ggplot(aes(line_comment_token, n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() CitationBibTeX citation:@online{e.sinks2023, author = {Louise E. Sinks}, title = {TidyTuesday {Week} 12: {Programming} {Languages}}, date = {2023-03-21}, url = {https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/}, langid = {en} } For attribution, please cite this work as: Louise E. Sinks. 2023. “TidyTuesday Week 12: Programming Languages.” March 21, 2023. https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/." />
This is my first attempt at Tidy Tuesday. The dataset today is about Programming Languages. The sample visualizations are about the comment codes. library(tidytuesdayR) library(tidyverse) library(skimr) library(ggthemes) library(gt) library(ggrepel) Load the data first. There has been some cleaning done as outlined on the TidyTuesday ...
" />
This is my first attempt at Tidy Tuesday. The dataset today is about Programming Languages. The sample visualizations are about the comment codes. library(tidytuesdayR) library(tidyverse) library(skimr) library(ggthemes) library(gt) library(ggrepel) Load the data first. There has been some cleaning done as outlined on the TidyTuesday ...
">
% count(line_comment_token, sort = TRUE) programming_lang2 %>% gt() %>% tab_header(title = "Most Common Comment Tokens") %>% cols_label(line_comment_token = "Token", n = "# of Languages that use token") Most Common Comment Tokens Token # of Languages that use token // 161 # 70 ; 49 -- 31 ' 16 % 12 ! 7 * 5 REM 2 *> 1 --- 1 / 1 NB. 1 \ 1 \* 1 __ 1 ~ 1 ⍝ 1 There is a language rank, which measures the popularity of the language based on signals such as number of users and number of jobs. Let’s see the average rank of languages for each token. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(avg_rank = mean(language_rank)) %>% ggplot(aes((fct_reorder(line_comment_token, avg_rank)), avg_rank)) + geom_col(fill = "dodgerblue2") + ylab("Average Rank of Language") + xlab("Comment Token") + ggtitle("Average rank of languages using different comment tokens") + theme_classic() + theme(axis.text.x = element_text(angle = 45, vjust = 0.25, hjust = 0.25)) The highest (average) ranked token is “*>”. What languages use this? programming_lang %>% filter(line_comment_token == "*>") %>% select(title, language_rank, line_comment_token) # A tibble: 1 × 3 title language_rank line_comment_token 1 COBOL 19 *> Only COBOL does, so the rank of this token isn’t diluted by many less popular languages. We can view the distribution of the language ranks for all the tokens. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(line_comment_token, language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token.") + xlab("Token") + ylab ("Language Rank") + theme_classic() Okay, let’s clean this up. I’d like it sorted by the median rank. Remeber rank is in reverse numerical order- a low number means a higher rank. programming_lang %>% filter(is.na(line_comment_token) == FALSE) %>% ggplot(aes(fct_reorder(line_comment_token, language_rank, .fun = median, .desc = FALSE), language_rank)) + geom_boxplot(color = "dodgerblue2") + ggtitle("The rank of languages by token") + xlab("Token") + ylab("Language Rank") + theme_classic() Let’s see the most popular language for each symbol. There might be a way to do this all at once, but I’m going to pull it out with joins to previous tables I’ve created. programming_lang3 % filter(is.na(line_comment_token) == FALSE) %>% group_by(line_comment_token) %>% summarize(highest_rank = min(language_rank)) join_madness % left_join(programming_lang3, by = "line_comment_token") %>% left_join(programming_lang, by = c("highest_rank" = "language_rank", "line_comment_token" = "line_comment_token")) join_madness % select(line_comment_token, n, highest_rank, title, appeared, number_of_users, number_of_jobs) So now we have a bunch of summarized data in a single dataframe. Here’s a graph. It is saying something, but I’m not sure what. When you can’t come up with a concise title, then you probably don’t know what you are trying to say… join_madness %>% ggplot(aes(highest_rank, n, size = log(number_of_users), color = log(number_of_users), label = line_comment_token)) + scale_y_log10() + scale_x_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("Popularity of tokens by language rank and usage") + xlab("Highest Rank of language using Token") + ylab("Number of Languages using token") + theme_classic() This is a visualization of the highest ranked languages for each token. The number of users of the dominant language is also encoded in the size and color of the label. Having it ordered makes it difficult to tell if Java or Python is the most popular/ highest ranked language. join_madness %>% ggplot(aes(fct_rev(fct_reorder(line_comment_token, highest_rank)), n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() Here is the same graph just ordered “alphabetically” by token. join_madness %>% ggplot(aes(line_comment_token, n, size = log(number_of_users), color = log(number_of_users), label = title)) + # geom_point() + scale_y_log10() + geom_text_repel(show.legend = FALSE) + ggtitle("The Most Popular Language for Each Comment Token") + xlab("Token") + ylab("Number of languages using token") + theme_classic() CitationBibTeX citation:@online{e.sinks2023, author = {Louise E. Sinks}, title = {TidyTuesday {Week} 12: {Programming} {Languages}}, date = {2023-03-21}, url = {https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/}, langid = {en} } For attribution, please cite this work as: Louise E. Sinks. 2023. “TidyTuesday Week 12: Programming Languages.” March 21, 2023. https://lsinks.github.io/posts/2023-03-21-tidytuesday-programming-languages/. " />

Images Powered by Shutterstock