This in-class exercise explores the concepts and methods of Text Visualisation. It visualises and analyses the text data from a collection of 20 newsgroups. It introduces the tidytext framework for processing, wrangling, analysing and visualising text data using tidytext, tidyverse, widyr, wordcloud, ggwordcloud, textplot, DT, lubridate, and hms packages.
packages = c('tidytext',
'widyr', 'wordcloud',
'DT', 'ggwordcloud',
'textplot', 'lubridate',
'hms','tidyverse',
'tidygraph', 'ggraph',
'igraph')
for(p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
news20 <- "data/20news/"
read_lines() of readr package is used to read up to n_max lines from a file.
map() of purrr package is used to transform their input by applying a function to each element of a list and returning an object of the same length as the input.
unnest() of dplyr package is used to flatten a list-column of data frames back out into regular columns.
mutate() of dplyr is used to add new variables and preserves existing ones;
transmute() of dplyr is used to add new variables and drops existing ones.
read_rds() is used to save the extracted and combined data frame as rds file for future use.
Frequency of messages by newsgroup
raw_text %>%
group_by(newsgroup) %>%
summarize(messages = n_distinct(id)) %>%
ggplot(aes(messages, newsgroup)) +
geom_col(fill = "lightblue") +
labs(y = NULL)
Notice that each message has some structure and extra text that we don’t want to include in our analysis. For example, every message has a header, containing field such as “from:” or “in_reply_to:” that describe the message. Some also have automated email signatures, which occur after a line like “–”.
cumsum() of base R is used to return a vector whose elements are the cumulative sums of the elements of the argument.
str_detect() from stringr is used to detect the presence or absence of a pattern in a string.
Regular expressions are used to remove with nested text representing quotes from other users.
cleaned_text <- cleaned_text %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
| text == "",
!str_detect(text,
"writes(:|\\.\\.\\.)$"),
!str_detect(text,
"^In article <")
)
str_detect() from stringr is used to detect the presence or absence of a pattern in a string.
filter() of dplyr package is used to subset a data frame, retaining all rows that satisfy the specified conditions.
unnest_tokens() of tidytext package is used to split the dataset into tokens, while stop_words() is used to remove stop-words.
usenet_words <- cleaned_text %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word)
Find the most common words in the entire dataset, or within particular newsgroups
usenet_words %>%
count(word, sort = TRUE)
# A tibble: 5,542 x 2
word n
<chr> <int>
1 people 57
2 time 50
3 jesus 47
4 god 44
5 message 40
6 br 27
7 bible 23
8 drive 23
9 homosexual 23
10 read 22
# ... with 5,532 more rows
Instead of counting individual word, you can also count words within by newsgroup
words_by_newsgroup <- usenet_words %>%
count(newsgroup, word, sort = TRUE) %>%
ungroup()
wordcloud() of wordcloud package is used to plot a static wordcloud
wordcloud(words_by_newsgroup$word,
words_by_newsgroup$n,
max.words = 300)
A DT table can be used to complement the visual discovery.
DT::datatable(words_by_newsgroup,
filter = 'top') %>%
formatStyle(0, target = 'row',
lineHeight='25%')
The wordcloud below is plotted by using ggwordcloud package.
bind_tf_idf() of tidytext is used to compute and bind the term frequency, inverse document frequency and ti-idf of a tidy text dataset to the dataset.
tf_idf <- words_by_newsgroup %>%
bind_tf_idf(word, newsgroup, n) %>%
arrange(desc(tf_idf))
datatable() of DT package to create a html table that allows pagination of rows and columns
filter argument is used to turn control the filter UI.
formatRound() is used to customise the values format. The argument digits define the number of decimal places.
formatStyle() is used to customise the output table. In this example, the arguments target and lineHeight are used to reduce the line height by 25%.
Facet bar charts technique is used to visualise the tf-idf values of science related newsgroup.
tf_idf %>%
filter(str_detect(newsgroup, "^sci\\.")) %>%
group_by(newsgroup) %>%
slice_max(tf_idf, n = 12) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)) %>%
ggplot(aes(tf_idf, word, fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup, scales = "free") +
labs(x = "tf-idf", y = NULL)
widyr package first ‘casts’ a tidy dataset into a wide matrix, performs an operation such as a correlation on it, then re-tidies the result.
pairwise_cor() of widyr package is used to compute the correlation between newsgroup based on the common words found.
newsgroup_cors <- words_by_newsgroup %>%
pairwise_cor(newsgroup,
word,
n,
sort = TRUE)
Visualise the relationship between newgroups in network graph
set.seed(2017)
newsgroup_cors %>%
filter(correlation > .025) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation,
width = correlation)) +
geom_node_point(size = 6,
color = "lightblue") +
geom_node_text(aes(label = name),
color = "red",
repel = TRUE) +
theme_void()
Bigram data frame is created by using unnest_tokens() of tidytext
bigrams <- cleaned_text %>%
unnest_tokens(bigram,
text,
token = "ngrams",
n = 2)
bigrams
# A tibble: 28,824 x 3
newsgroup id bigram
<chr> <chr> <chr>
1 alt.atheism 54256 <NA>
2 alt.atheism 54256 <NA>
3 alt.atheism 54256 as i
4 alt.atheism 54256 i don't
5 alt.atheism 54256 don't know
6 alt.atheism 54256 know this
7 alt.atheism 54256 this book
8 alt.atheism 54256 book i
9 alt.atheism 54256 i will
10 alt.atheism 54256 will use
# ... with 28,814 more rows
Count and sort the bigram data frame in ascending order
bigrams_count <- bigrams %>%
filter(bigram != 'NA') %>%
count(bigram, sort = TRUE)
bigrams_count
# A tibble: 19,885 x 2
bigram n
<chr> <int>
1 of the 169
2 in the 113
3 to the 74
4 to be 59
5 for the 52
6 i have 48
7 that the 47
8 if you 40
9 on the 39
10 it is 38
# ... with 19,875 more rows
Separate the bigram into two words
bigrams_separated <- bigrams %>%
filter(bigram != 'NA') %>%
separate(bigram, c("word1", "word2"),
sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigrams_filtered
# A tibble: 4,604 x 4
newsgroup id word1 word2
<chr> <chr> <chr> <chr>
1 alt.atheism 54256 defines god
2 alt.atheism 54256 term preclues
3 alt.atheism 54256 science ideas
4 alt.atheism 54256 ideas drawn
5 alt.atheism 54256 supernatural precludes
6 alt.atheism 54256 scientific assertions
7 alt.atheism 54256 religious dogma
8 alt.atheism 54256 religion involves
9 alt.atheism 54256 involves circumventing
10 alt.atheism 54256 gain absolute
# ... with 4,594 more rows
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
Network graph is created by using graph_from_data_frame() of igraph.
bigram_graph <- bigram_counts %>%
filter(n > 3) %>%
graph_from_data_frame()
bigram_graph
IGRAPH 5c81da3 DN-- 40 24 --
+ attr: name (v/c), n (e/n)
+ edges from 5c81da3 (vertex names):
[1] 1 ->2 1 ->3
[3] static ->void time ->pad
[5] 1 ->4 infield ->fly
[7] mat ->28 vv ->vv
[9] 1 ->5 cock ->crow
[11] noticeshell->widget 27 ->1993
[13] 3 ->4 child ->molestation
[15] cock ->crew gun ->violence
+ ... omitted several edges
ggraph package is used to plot the bigram.
set.seed(1234)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name),
vjust = 1,
hjust = 1)
set.seed(1234)
a <- grid::arrow(type = "closed",
length = unit(.15,
"inches"))
ggraph(bigram_graph,
layout = "fr") +
geom_edge_link(aes(edge_alpha = n),
show.legend = FALSE,
arrow = a,
end_cap = circle(.07,
'inches')) +
geom_node_point(color = "lightblue",
size = 5) +
geom_node_text(aes(label = name),
vjust = 1,
hjust = 1) +
theme_void()
For attribution, please cite this work as
Dolit (2021, July 11). Visual Analytics & Applications: Text Visual Analytics with R. Retrieved from https://adolit-vaa.netlify.app/posts/2021-07-11-text-vis-r/
BibTeX citation
@misc{dolit2021text, author = {Dolit, Archie}, title = {Visual Analytics & Applications: Text Visual Analytics with R}, url = {https://adolit-vaa.netlify.app/posts/2021-07-11-text-vis-r/}, year = {2021} }