Visual Detective Assignment

Visual Detective R Assignment

This assignment attempts to solve the 2021 IEEE Visual Analytics Science and Technology (VAST) Challenge: Mini-Challenge 2 by applying different visual analytics concepts, methods, and techniques with relevant R data visualisation and data analysis packages.

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)
07-25-2021

1. Overview

The VAST Challenge 2021 is a rerun of the VAST Challenge 2014 with the same story line about the fictitious island country of Kronos, the company GASTech, and the incidents involving missing GAStech employees. However, the data for the VAST Challenge 2021 were modified and new questions were introduced.

Using the two weeks worth of data leading to the disappearance of the GAStech employees, the goal of Mini-Challenge 2 is to analyze the movement and tracking of company vehicles GPS data. Together with GAStech employee’s credit card transactions and Kronos Kares loyalty card data, the main objectives are:

2. Literature Review

2.1 Data Understanding

The 2021 data was compared against 2014 data using the diffr package to better understand the variations of VAST Challenge 2021 from the previous year’s challenge.

library(diffr)
diffr("data/aspatial/car-assignments.csv", "data_2014/aspatial/car-assignments.csv")

diffr("data/aspatial/loyalty_data.csv", "data_2014/aspatial/loyalty_data.csv")

diffr("data/aspatial/cc_data.csv", "data_2014/aspatial/cc_data.csv")

library(tools)
md5sum("data/aspatial/gps.csv") == md5sum("data_2014/aspatial/gps.csv")
data/aspatial/gps.csv 
                 TRUE 
2021 VAST Challenge Map 2014 VAST Challenge Map
md5sum("data/aspatial/MC2-tourist.jpg") == md5sum("data_2014/aspatial/MC2-tourist.jpg")
data/aspatial/MC2-tourist.jpg 
                         TRUE 
md5sum("data/Geospatial/Abila.dbf") == md5sum("data_2014/Geospatial/Abila.dbf")
md5sum("data/Geospatial/Abila.kml") == md5sum("data_2014/Geospatial/Abila.kml")
md5sum("data/Geospatial/Abila.prj") == md5sum("data_2014/Geospatial/Abila.prj")
md5sum("data/Geospatial/Abila.sbn") == md5sum("data_2014/Geospatial/Abila.sbn")
md5sum("data/Geospatial/Abila.sbx") == md5sum("data_2014/Geospatial/Abila.sbx")
md5sum("data/Geospatial/Abila.shp") == md5sum("data_2014/Geospatial/Abila.shp")
md5sum("data/Geospatial/Abila.shx") == md5sum("data_2014/Geospatial/Abila.shx")

md5sum("data/Geospatial/Kronos Island.kmz") == md5sum("data_2014/Geospatial/Kronos Island.kmz")
md5sum("data/Geospatial/Kronos_Island.dbf") == md5sum("data_2014/Geospatial/Kronos_Island.dbf")
md5sum("data/Geospatial/Kronos_Island.prj") == md5sum("data_2014/Geospatial/Kronos_Island.prj")
md5sum("data/Geospatial/Kronos_Island.sbn") == md5sum("data_2014/Geospatial/Kronos_Island.sbn")
md5sum("data/Geospatial/Kronos_Island.sbx") == md5sum("data_2014/Geospatial/Kronos_Island.sbx")
md5sum("data/Geospatial/Kronos_Island.shp") == md5sum("data_2014/Geospatial/Kronos_Island.shp")
md5sum("data/Geospatial/Kronos_Island.shx") == md5sum("data_2014/Geospatial/Kronos_Island.shx")

2.2 Guide Questions

VAST Challenge 2014 focuses about ‘Patterns of Life’ analysis. It asked about the common daily routines of GAStech employees and what does a day in the life of typical GAStech employee look like.

In contrast, VAST Challenge 2021 asks to infer the owners of each credit card and loyalty card since the employee names were replaced by last 4 digits of the credit or debit card number and unique 5-character code loyalty number.

Nevertheless, both challenges want to know about unusual events, anomalies, and evidences of suspicious activities.

2.3 Visualisation Approaches

By reviewing the submissions for VAST Challenge 2014, several approaches were identified to be relevant to the current VAST challenge and reproducible using R data visualisation and data analysis packages.

Mini-Challenge 2 emphasizes the geospatial-temporal data analysis with the financial data from the credit card and loyalty transactions. The common approach from several submissions was to highlight roadway paths of the car and indicate the position and time relationship. The figure below from the Peking University, recipient of Excellent Comprehensive Visual Analysis System Award, shows an example geospatial-temporal visualisation.

This example of movement data visualisation can be achieved using sf, raster, readr, clock and tmap packages. It can also be improved by having an interactive map and tooltip information.

The heatmap visualisation below from Central South University, recipient of Outstanding Visualization and Analysis Award, shows the credit card transactions of general staff which can also be used to identify the most popular spots and when they are popular.

This example of heatmap visualisation can be achieved using gglot2 and plotly packages. It can also be improved by having an interactive tooltip information.

The concept of ‘Point of Interest’ (POI) from Virginia Tech, recipient of Effective Presentation Honorable Mention, shows the location with a diameter of 50 meters where an employee stops for more than 5 minutes. The POI concept can be utilized to correlate the gps tracking data, credit and debit card transactions, and loyalty card data to help in identifying the owners of the credit card and loyalty cards.

This example of POI data visualisation can be achieved by reusing the geospatial-temporal packages.

3. Methodology

3.1 Install and Lauch R Packages

The code chunk below is used to install and load the packages.

packages = c('ggiraph', 'plotly','lobstr',
             'raster','sf', 'tmap', 
             'igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'widyr', 'wordcloud',
             'ggwordcloud', 'DT',
             'textplot', 'hms',
             'timetk','tidyverse')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Import Data

Import the csv files into R using read_csv() of readr package.

car_data <- read_csv("data/aspatial/car-assignments.csv")
cc_data <- read_csv("data/aspatial/cc_data.csv")
loyalty_data <- read_csv("data/aspatial/loyalty_data.csv")
gps_data <- read_csv("data/aspatial/gps.csv")

glimpse(car_data)
Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
glimpse(cc_data)
Rows: 1,490
Columns: 4
$ timestamp  <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(loyalty_data)
Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
glimpse(gps_data)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

Produce a georeference tif file called abila_map.tif from the tourist map MC2-Tourist.jpg and Abila shapefiles using an external open-source geographic information system (GIS) software QGIS.

Import abila_map.tif into R using raster() of Raster package,

bgmap <- raster("data/Geospatial/abila_map.tif")

tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255)

3.3 Prepare Data

There are several employees with the same last name and same first name. Create a new column FullName and combine FirstName and LastName to have unique employee name identifier using mutate() of dplyr package.

Additionally, rename columns CurrentEmploymentType to Department and CurrentEmploymentTitle to Title using rename(). Finally, convert carID field from numerical to factor data type.

car_data <- car_data %>%
  #concatenate first and last name
  mutate(FullName = paste(FirstName, LastName, sep = " ")) %>%
  rename(Deparment = CurrentEmploymentType) %>%
  rename(Title = CurrentEmploymentTitle)

car_data$CarID <- as_factor(car_data$CarID)

glimpse(car_data)
Rows: 44
Columns: 6
$ LastName  <chr> "Calixto", "Azada", "Balas", "Barranco", "Baza", "~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "Isak", "Linnea~
$ CarID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
$ Deparment <chr> "Information Technology", "Engineering", "Engineer~
$ Title     <chr> "IT Helpdesk", "Engineer", "Engineer", "SVP/CFO", ~
$ FullName  <chr> "Nils Calixto", "Lars Azada", "Felix Balas", "Ingr~

Katerina’s Cafe’ causes error when plotting a graph because of special characters. Convert the special characters into string format using mutate() and str_detec() functions.

Additionally, convert the timestamp() from character datatype to date-time format using data-time_parse() of clock package, then get the date, day of the week, and hour of transaction.

#detect and replace Katerina to Katerina's Cafe
cc_data <- cc_data %>%
    mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

#convert to date-time format
cc_data$date <- date_time_parse(cc_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")
cc_data$day <- wday(cc_data$date,
                          label = TRUE,
                          abbr = TRUE)

cc_data$timestamp <- date_time_parse(cc_data$timestamp,
                zone = "",
                format = "%m/%d/%Y %H:%M")

cc_data$hour <- get_hour(cc_data$timestamp)

glimpse(cc_data)
Rows: 1,490
Columns: 7
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~

Similar to cc_data, convert the special characters of Katerina’s Cafe’ into string format, convert the timestamp from character datatype to date-time format, then get the date and day of the week of transaction. Note that loyalty_data does not include the hour and minutes of the transaction.

#detect and replace Katerina to Katerina's Cafe
loyalty_data <- loyalty_data %>%
    mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

#convert to date-time format
loyalty_data$date <- date_time_parse(loyalty_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

loyalty_data$timestamp <- date_time_parse(loyalty_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

loyalty_data$day <- wday(loyalty_data$timestamp,
                          label = TRUE,
                          abbr = TRUE)

glimpse(loyalty_data)
Rows: 1,392
Columns: 6
$ timestamp  <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~

Rename Timestamp to timestamp and id to CarID so it will be consistent with other data frame. Similar to cc_data and loyalty_data, convert the timestamp from character datatype to date-time format using data-time_parse(), then get the date and day of the week.

Convert CarID field from numerical to factor data type. Lastly, convert the gps data frame into a simple feature data frame using st_as_sf() of sf package.

#rename columns for consistency
gps_data <- gps_data %>%
  rename(timestamp = Timestamp) %>%
  rename(CarID = id)

#convert to date-time format
gps_data$date <- date_time_parse(gps_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

gps_data$day <- as.factor(wday(gps_data$date,
                          label = TRUE,
                          abbr = TRUE))

gps_data$timestamp <- date_time_parse(gps_data$timestamp,
                zone = "",
                format = "%m/%d/%Y %H:%M:%S")

gps_data$hour <- get_hour(gps_data$timestamp)

#convert to factor data type
gps_data$CarID <- as_factor(gps_data$CarID)


glimpse(gps_data)
Rows: 685,169
Columns: 7
$ timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
#convert to simple feature 
gps_sf <- st_as_sf(gps_data, 
                   coords = c("long", "lat"),
                       crs= 4326)
gps_sf
Simple feature collection with 685169 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 6
   timestamp           CarID date                day    hour
 * <dttm>              <fct> <dttm>              <ord> <int>
 1 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 2 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 3 2014-01-06 06:28:03 35    2014-01-06 00:00:00 Mon       6
 4 2014-01-06 06:28:05 35    2014-01-06 00:00:00 Mon       6
 5 2014-01-06 06:28:06 35    2014-01-06 00:00:00 Mon       6
 6 2014-01-06 06:28:07 35    2014-01-06 00:00:00 Mon       6
 7 2014-01-06 06:28:09 35    2014-01-06 00:00:00 Mon       6
 8 2014-01-06 06:28:10 35    2014-01-06 00:00:00 Mon       6
 9 2014-01-06 06:28:11 35    2014-01-06 00:00:00 Mon       6
10 2014-01-06 06:28:12 35    2014-01-06 00:00:00 Mon       6
# ... with 685,159 more rows, and 1 more variable:
#   geometry <POINT [°]>

3.4 Join Data

Combine the cc_data and loyalty_data based on the purchase information like the location, date and price of transaction using full_join() of dplyr package. Exclude day and timestamp from loyalty_data since these fields are redundant with cc_data. Rearrange the columns into timestamp, date, day, hour, location, price, last4ccnum, loyaltynum.

#combine based on date, location, price, exclude day and timestamp
cc_loyalty_data <- full_join(cc_data %>% select(-c("day")),
                             loyalty_data %>% select(-c("day","timestamp")), 
                             by = c("date" = "date", 
                                    "location" = "location", 
                                    "price" = "price"))

#get day of the joint data
cc_loyalty_data$day <- wday(cc_loyalty_data$date,
                          label = TRUE,
                          abbr = TRUE)

#rearrange columns
cc_loyalty_data <- cc_loyalty_data %>%
  select("timestamp", "date", "day", "hour", "location", "price", "last4ccnum", "loyaltynum")

glimpse(cc_loyalty_data)
Rows: 1,807
Columns: 8
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ loyaltynum <chr> "L8566", NA, "L8148", "L5553", "L3800", "L2247", ~

The joint financial data reveals 1,807 entries. Some entries have last4ccnum but without loyaltynum, have loyaltynum but without last4ccnum. Additionally, last4ccnum does not necessarily correpond to only 1 loyaltynum which means the owner may use multiple credit or debit cards for their loyalty card or vice versa.

Combine the car_data and gps_data based CarID using left_join() of dplyr package. Exclude FirstName and LastName from car_data since these fields are redundant with FullName.

#combine based on CarID
car_gps_data <- left_join(gps_data, 
                          car_data %>% select(-c("FirstName", "LastName")),
                          by = "CarID")

glimpse(car_gps_data)
Rows: 685,169
Columns: 10
$ timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ Deparment <chr> "Executive", "Executive", "Executive", "Executive"~
$ Title     <chr> "Environmental Safety Advisor", "Environmental Saf~
$ FullName  <chr> "Willem Vasco-Pais", "Willem Vasco-Pais", "Willem ~
car_gps_sf <- left_join(gps_sf,
                        car_data %>% select(-c("FirstName", "LastName")),
                        by = "CarID")

car_gps_sf
Simple feature collection with 685169 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 9
   timestamp           CarID date                day    hour
   <dttm>              <fct> <dttm>              <ord> <int>
 1 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 2 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 3 2014-01-06 06:28:03 35    2014-01-06 00:00:00 Mon       6
 4 2014-01-06 06:28:05 35    2014-01-06 00:00:00 Mon       6
 5 2014-01-06 06:28:06 35    2014-01-06 00:00:00 Mon       6
 6 2014-01-06 06:28:07 35    2014-01-06 00:00:00 Mon       6
 7 2014-01-06 06:28:09 35    2014-01-06 00:00:00 Mon       6
 8 2014-01-06 06:28:10 35    2014-01-06 00:00:00 Mon       6
 9 2014-01-06 06:28:11 35    2014-01-06 00:00:00 Mon       6
10 2014-01-06 06:28:12 35    2014-01-06 00:00:00 Mon       6
# ... with 685,159 more rows, and 4 more variables:
#   geometry <POINT [°]>, Deparment <chr>, Title <chr>,
#   FullName <chr>

The joint geospatial data reveals that some CarID cannot be mapped to specific employees. Most probably they are the truck drivers who have no specific car assignment.

Click HERE to view the Visual Detective Assignment Part 2.

Citation

For attribution, please cite this work as

Dolit (2021, July 25). Visual Analytics & Applications: Visual Detective Assignment. Retrieved from https://adolit-vaa.netlify.app/posts/2021-07-18-assignment/

BibTeX citation

@misc{dolit2021visual,
  author = {Dolit, Archie},
  title = {Visual Analytics & Applications: Visual Detective Assignment},
  url = {https://adolit-vaa.netlify.app/posts/2021-07-18-assignment/},
  year = {2021}
}