Text Visual Analytics with R

Text Visualisation R In-class Exercise

This in-class exercise explores the concepts and methods of Text Visualisation. It visualises and analyses the text data from a collection of 20 newsgroups. It introduces the tidytext framework for processing, wrangling, analysing and visualising text data using tidytext, tidyverse, widyr, wordcloud, ggwordcloud, textplot, DT, lubridate, and hms packages.

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)https://scis.smu.edu.sg/
07-11-2021

Install and Lauch R Packages

packages = c('tidytext', 
             'widyr', 'wordcloud',
             'DT', 'ggwordcloud', 
             'textplot', 'lubridate', 
             'hms','tidyverse', 
             'tidygraph', 'ggraph',
             'igraph')
for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Import Multiple Text Files from Multiple Folders

Step 1: Creating a folder list

news20 <- "data/20news/"

Step 2: Define a function to read all files from a folder into a data frame

read_folder <- function(infolder) {
  tibble(file = dir(infolder, 
                    full.names = TRUE)) %>%
    mutate(text = map(file, 
                      read_lines)) %>%
    transmute(id = basename(file), 
              text) %>%
    unnest(text)
}

Step 3: Reading in all the messages from the 20news folder

raw_text <- tibble(folder = 
                     dir(news20, 
                         full.names = TRUE)) %>%
  mutate(folder_out = map(folder, 
                          read_folder)) %>%
  unnest(cols = c(folder_out)) %>%
  transmute(newsgroup = basename(folder), 
            id, text)
write_rds(raw_text, "data/rds/news20.rds")

Initial Explorartory Analysis

Frequency of messages by newsgroup

raw_text %>%
  group_by(newsgroup) %>%
  summarize(messages = n_distinct(id)) %>%
  ggplot(aes(messages, newsgroup)) +
  geom_col(fill = "lightblue") +
  labs(y = NULL)

Cleaning Text Data

Step 1: Removing header and automated email signatures

Notice that each message has some structure and extra text that we don’t want to include in our analysis. For example, every message has a header, containing field such as “from:” or “in_reply_to:” that describe the message. Some also have automated email signatures, which occur after a line like “–”.

cleaned_text <- raw_text %>%
  group_by(newsgroup, id) %>%
  filter(cumsum(text == "") > 0,
         cumsum(str_detect(
           text, "^--")) == 0) %>%
  ungroup()

Step 2: Removing lines with nested text representing quotes from other users.

Regular expressions are used to remove with nested text representing quotes from other users.

cleaned_text <- cleaned_text %>%
  filter(str_detect(text, "^[^>]+[A-Za-z\\d]")
         | text == "",
         !str_detect(text, 
                     "writes(:|\\.\\.\\.)$"),
         !str_detect(text, 
                     "^In article <")
  )

Text Data Processing

unnest_tokens() of tidytext package is used to split the dataset into tokens, while stop_words() is used to remove stop-words.

usenet_words <- cleaned_text %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)

Find the most common words in the entire dataset, or within particular newsgroups

usenet_words %>%
  count(word, sort = TRUE)
# A tibble: 5,542 x 2
   word           n
   <chr>      <int>
 1 people        57
 2 time          50
 3 jesus         47
 4 god           44
 5 message       40
 6 br            27
 7 bible         23
 8 drive         23
 9 homosexual    23
10 read          22
# ... with 5,532 more rows

Instead of counting individual word, you can also count words within by newsgroup

words_by_newsgroup <- usenet_words %>%
  count(newsgroup, word, sort = TRUE) %>%
  ungroup()

Visualising Words in newsgroups

wordcloud() of wordcloud package is used to plot a static wordcloud

wordcloud(words_by_newsgroup$word,
          words_by_newsgroup$n,
          max.words = 300)

A DT table can be used to complement the visual discovery.

DT::datatable(words_by_newsgroup, 
              filter = 'top') %>% 
  formatStyle(0, target = 'row', 
              lineHeight='25%')

The wordcloud below is plotted by using ggwordcloud package.

set.seed(1234)
words_by_newsgroup %>%
  filter(n > 5) %>%
ggplot(aes(label = word,
           size = n)) +
  geom_text_wordcloud() +
  theme_minimal() +
  facet_wrap(~newsgroup)

Computing tf-idf within newsgroups

bind_tf_idf() of tidytext is used to compute and bind the term frequency, inverse document frequency and ti-idf of a tidy text dataset to the dataset.

tf_idf <- words_by_newsgroup %>%
  bind_tf_idf(word, newsgroup, n) %>%
  arrange(desc(tf_idf))

Visualising tf-idf as interactive table

datatable() of DT package to create a html table that allows pagination of rows and columns

DT::datatable(tf_idf, filter = 'top') %>% 
  formatRound(columns = c('tf', 'idf', 
                          'tf_idf'), 
              digits = 3) %>%
  formatStyle(0, 
              target = 'row', 
              lineHeight='25%')

Visualising tf-idf within newsgroups

Facet bar charts technique is used to visualise the tf-idf values of science related newsgroup.

tf_idf %>%
  filter(str_detect(newsgroup, "^sci\\.")) %>%
  group_by(newsgroup) %>%
  slice_max(tf_idf, n = 12) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)) %>%
  ggplot(aes(tf_idf, word, fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, scales = "free") +
  labs(x = "tf-idf", y = NULL)

Counting and correlating pairs of words with the widyr package

widyr package first ‘casts’ a tidy dataset into a wide matrix, performs an operation such as a correlation on it, then re-tidies the result.

pairwise_cor() of widyr package is used to compute the correlation between newsgroup based on the common words found.

newsgroup_cors <- words_by_newsgroup %>%
  pairwise_cor(newsgroup, 
               word, 
               n, 
               sort = TRUE)

Visualising correlation as a network

Visualise the relationship between newgroups in network graph

set.seed(2017)

newsgroup_cors %>%
  filter(correlation > .025) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, 
                     width = correlation)) +
  geom_node_point(size = 6, 
                  color = "lightblue") +
  geom_node_text(aes(label = name),
                 color = "red",
                 repel = TRUE) +
  theme_void()

Bigram

Bigram data frame is created by using unnest_tokens() of tidytext

bigrams <- cleaned_text %>%
  unnest_tokens(bigram, 
                text, 
                token = "ngrams", 
                n = 2)

bigrams
# A tibble: 28,824 x 3
   newsgroup   id    bigram    
   <chr>       <chr> <chr>     
 1 alt.atheism 54256 <NA>      
 2 alt.atheism 54256 <NA>      
 3 alt.atheism 54256 as i      
 4 alt.atheism 54256 i don't   
 5 alt.atheism 54256 don't know
 6 alt.atheism 54256 know this 
 7 alt.atheism 54256 this book 
 8 alt.atheism 54256 book i    
 9 alt.atheism 54256 i will    
10 alt.atheism 54256 will use  
# ... with 28,814 more rows

Counting bigrams

Count and sort the bigram data frame in ascending order

bigrams_count <- bigrams %>%
  filter(bigram != 'NA') %>%
  count(bigram, sort = TRUE)

bigrams_count
# A tibble: 19,885 x 2
   bigram       n
   <chr>    <int>
 1 of the     169
 2 in the     113
 3 to the      74
 4 to be       59
 5 for the     52
 6 i have      48
 7 that the    47
 8 if you      40
 9 on the      39
10 it is       38
# ... with 19,875 more rows

Cleaning bigram

Separate the bigram into two words

bigrams_separated <- bigrams %>%
  filter(bigram != 'NA') %>%
  separate(bigram, c("word1", "word2"), 
           sep = " ")
bigrams_filtered <- bigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)

bigrams_filtered
# A tibble: 4,604 x 4
   newsgroup   id    word1        word2        
   <chr>       <chr> <chr>        <chr>        
 1 alt.atheism 54256 defines      god          
 2 alt.atheism 54256 term         preclues     
 3 alt.atheism 54256 science      ideas        
 4 alt.atheism 54256 ideas        drawn        
 5 alt.atheism 54256 supernatural precludes    
 6 alt.atheism 54256 scientific   assertions   
 7 alt.atheism 54256 religious    dogma        
 8 alt.atheism 54256 religion     involves     
 9 alt.atheism 54256 involves     circumventing
10 alt.atheism 54256 gain         absolute     
# ... with 4,594 more rows

Counting the bigram again

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

Network graph from bigram data frame

Network graph is created by using graph_from_data_frame() of igraph.

bigram_graph <- bigram_counts %>%
  filter(n > 3) %>%
  graph_from_data_frame()
bigram_graph
IGRAPH 5c81da3 DN-- 40 24 -- 
+ attr: name (v/c), n (e/n)
+ edges from 5c81da3 (vertex names):
 [1] 1          ->2           1          ->3          
 [3] static     ->void        time       ->pad        
 [5] 1          ->4           infield    ->fly        
 [7] mat        ->28          vv         ->vv         
 [9] 1          ->5           cock       ->crow       
[11] noticeshell->widget      27         ->1993       
[13] 3          ->4           child      ->molestation
[15] cock       ->crew        gun        ->violence   
+ ... omitted several edges

Visualizing a network of bigrams with ggraph

ggraph package is used to plot the bigram.

set.seed(1234)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), 
                 vjust = 1, 
                 hjust = 1)

Revised version

set.seed(1234)

a <- grid::arrow(type = "closed", 
                 length = unit(.15,
                               "inches"))
ggraph(bigram_graph, 
       layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), 
                 show.legend = FALSE,
                 arrow = a, 
                 end_cap = circle(.07,
                                  'inches')) +
  geom_node_point(color = "lightblue", 
                  size = 5) +
  geom_node_text(aes(label = name), 
                 vjust = 1, 
                 hjust = 1) +
  theme_void()

Reference:

Citation

For attribution, please cite this work as

Dolit (2021, July 11). Visual Analytics & Applications: Text Visual Analytics with R. Retrieved from https://adolit-vaa.netlify.app/posts/2021-07-11-text-vis-r/

BibTeX citation

@misc{dolit2021text,
  author = {Dolit, Archie},
  title = {Visual Analytics & Applications: Text Visual Analytics with R},
  url = {https://adolit-vaa.netlify.app/posts/2021-07-11-text-vis-r/},
  year = {2021}
}