Visual Analytics & Applications: Visual Detective Assignment

Archie Dolit

1. Overview

The VAST Challenge 2021 is a rerun of the VAST Challenge 2014 with the same story line about the fictitious island country of Kronos, the company GASTech, and the incidents involving missing GAStech employees. However, the data for the VAST Challenge 2021 were modified and new questions were introduced.

Using the two weeks worth of data leading to the disappearance of the GAStech employees, the goal of Mini-Challenge 2 is to analyze the movement and tracking of company vehicles GPS data. Together with GAStech employee’s credit card transactions and Kronos Kares loyalty card data, the main objectives are:

Identify anomalies and suspicious behaviours
Identify which people use which credit card and loyalty cards

2. Literature Review

2.1 Data Understanding

The 2021 data was compared against 2014 data using the diffr package to better understand the variations of VAST Challenge 2021 from the previous year’s challenge.

car-assignments.csv contains vehicle assignments by employee. It lists the employee name, Car ID assignment, employee department, and employee job title.
- Based on the result of code chunk below, all employee names are the same. However, there are 8 employees with different car ID assignment and employee job title. For example, Calixto,Nils was assigned to CardID 1 as IT Helpdesk in 2021 challenge but was previously assigned to CardID 8 as IT Technician in 2014 challenge.

library(diffr)
diffr("data/aspatial/car-assignments.csv", "data_2014/aspatial/car-assignments.csv")

loyalty_data.csv contains the loyalty card transaction data. It lists the date of the transaction, name of the business, price, and the unique 5-character code loyalty number.
- The major difference in VAST Challenge 2021 is that loyalty number was used instead of employee names. Other minor differences are the date format ( line:2 01/06/2014 vs 1/6/2014) and price of transaction ( example: line:1145 Hippokampos,14.89 vs. Hippokampos,17.66).

diffr("data/aspatial/loyalty_data.csv", "data_2014/aspatial/loyalty_data.csv")

cc_data.csv contains the credit and debit card transaction data. It lists the date and time of the transaction, name of the business, price, and last 4 digits of the credit or debit card number.
- Similar to the loyalty data, the major difference in that last 4 digits was used instead of employee names. Other minor differences are the date and time format ( line:2 01/06/2014 07:28 vs 1/6/2014 7:28) as well as the time and price of transaction ( example: line:701 01/12/2014 13:17,Ouzeri Elian,98.45 vs. 1/12/2014 13:03,Ouzeri Elian,38.5).

diffr("data/aspatial/cc_data.csv", "data_2014/aspatial/cc_data.csv")

gps.csv contains the vehicle tracking data. It lists the date and time, car ID, latitude, and longitude coordinates with 685,170 entries.
- The file size is too large and it is not practical to do a line by line comparison. Instead of using the diffr , the md5sum from tools package was used to compute the MD5 hashes and detect if there are differences between the 2 files. Based on the result of code chunk below, there are no differences between the 2021 and 2014 gps.csv file.

library(tools)
md5sum("data/aspatial/gps.csv") == md5sum("data_2014/aspatial/gps.csv")

data/aspatial/gps.csv 
                 TRUE

MC2-Tourist.jpg is the tourist map of Abila, the Capital city of Kronos, with identified location of interest.
- Based on visual inspection and md5sum code chunk below, there are no differences between the 2021 and 2014 tourist map.

2021 VAST Challenge Map	2014 VAST Challenge Map

md5sum("data/aspatial/MC2-tourist.jpg") == md5sum("data_2014/aspatial/MC2-tourist.jpg")

data/aspatial/MC2-tourist.jpg 
                         TRUE

ESRI shapefiles of Abila and Kronos contains the geospatial vector data format for storing geometric location and associated attribute information.
- Based on the md5sum code chunk below, there are also no differences between the 2021 and 2014 shapefiles.

md5sum("data/Geospatial/Abila.dbf") == md5sum("data_2014/Geospatial/Abila.dbf")
md5sum("data/Geospatial/Abila.kml") == md5sum("data_2014/Geospatial/Abila.kml")
md5sum("data/Geospatial/Abila.prj") == md5sum("data_2014/Geospatial/Abila.prj")
md5sum("data/Geospatial/Abila.sbn") == md5sum("data_2014/Geospatial/Abila.sbn")
md5sum("data/Geospatial/Abila.sbx") == md5sum("data_2014/Geospatial/Abila.sbx")
md5sum("data/Geospatial/Abila.shp") == md5sum("data_2014/Geospatial/Abila.shp")
md5sum("data/Geospatial/Abila.shx") == md5sum("data_2014/Geospatial/Abila.shx")

md5sum("data/Geospatial/Kronos Island.kmz") == md5sum("data_2014/Geospatial/Kronos Island.kmz")
md5sum("data/Geospatial/Kronos_Island.dbf") == md5sum("data_2014/Geospatial/Kronos_Island.dbf")
md5sum("data/Geospatial/Kronos_Island.prj") == md5sum("data_2014/Geospatial/Kronos_Island.prj")
md5sum("data/Geospatial/Kronos_Island.sbn") == md5sum("data_2014/Geospatial/Kronos_Island.sbn")
md5sum("data/Geospatial/Kronos_Island.sbx") == md5sum("data_2014/Geospatial/Kronos_Island.sbx")
md5sum("data/Geospatial/Kronos_Island.shp") == md5sum("data_2014/Geospatial/Kronos_Island.shp")
md5sum("data/Geospatial/Kronos_Island.shx") == md5sum("data_2014/Geospatial/Kronos_Island.shx")

2.2 Guide Questions

VAST Challenge 2014 focuses about ‘Patterns of Life’ analysis. It asked about the common daily routines of GAStech employees and what does a day in the life of typical GAStech employee look like.

In contrast, VAST Challenge 2021 asks to infer the owners of each credit card and loyalty card since the employee names were replaced by last 4 digits of the credit or debit card number and unique 5-character code loyalty number.

Nevertheless, both challenges want to know about unusual events, anomalies, and evidences of suspicious activities.

2.3 Visualisation Approaches

By reviewing the submissions for VAST Challenge 2014, several approaches were identified to be relevant to the current VAST challenge and reproducible using R data visualisation and data analysis packages.

Mini-Challenge 2 emphasizes the geospatial-temporal data analysis with the financial data from the credit card and loyalty transactions. The common approach from several submissions was to highlight roadway paths of the car and indicate the position and time relationship. The figure below from the Peking University, recipient of Excellent Comprehensive Visual Analysis System Award, shows an example geospatial-temporal visualisation.

This example of movement data visualisation can be achieved using sf, raster, readr, clock and tmap packages. It can also be improved by having an interactive map and tooltip information.

The heatmap visualisation below from Central South University, recipient of Outstanding Visualization and Analysis Award, shows the credit card transactions of general staff which can also be used to identify the most popular spots and when they are popular.

This example of heatmap visualisation can be achieved using gglot2 and plotly packages. It can also be improved by having an interactive tooltip information.

The concept of ‘Point of Interest’ (POI) from Virginia Tech, recipient of Effective Presentation Honorable Mention, shows the location with a diameter of 50 meters where an employee stops for more than 5 minutes. The POI concept can be utilized to correlate the gps tracking data, credit and debit card transactions, and loyalty card data to help in identifying the owners of the credit card and loyalty cards.

This example of POI data visualisation can be achieved by reusing the geospatial-temporal packages.

3. Methodology

3.1 Install and Lauch R Packages

The code chunk below is used to install and load the packages.

packages = c('ggiraph', 'plotly','lobstr',
             'raster','sf', 'tmap', 
             'igraph', 'tidygraph', 
             'ggraph', 'visNetwork', 
             'lubridate', 'clock',
             'widyr', 'wordcloud',
             'ggwordcloud', 'DT',
             'textplot', 'hms',
             'timetk','tidyverse')

for(p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.2 Import Data

Import the csv files into R using read_csv() of readr package.

car_data <- read_csv("data/aspatial/car-assignments.csv")
cc_data <- read_csv("data/aspatial/cc_data.csv")
loyalty_data <- read_csv("data/aspatial/loyalty_data.csv")
gps_data <- read_csv("data/aspatial/gps.csv")

glimpse(car_data)

Rows: 44
Columns: 5
$ LastName               <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName              <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID                  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType  <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~

glimpse(cc_data)

Rows: 1,490
Columns: 4
$ timestamp  <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~

glimpse(loyalty_data)

Rows: 1,392
Columns: 4
$ timestamp  <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~

glimpse(gps_data)

Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id        <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~

Produce a georeference tif file called abila_map.tif from the tourist map MC2-Tourist.jpg and Abila shapefiles using an external open-source geographic information system (GIS) software QGIS.

Import abila_map.tif into R using raster() of Raster package,

bgmap <- raster("data/Geospatial/abila_map.tif")

tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255)

3.3 Prepare Data

car_data

There are several employees with the same last name and same first name. Create a new column FullName and combine FirstName and LastName to have unique employee name identifier using mutate() of dplyr package.

Additionally, rename columns CurrentEmploymentType to Department and CurrentEmploymentTitle to Title using rename(). Finally, convert carID field from numerical to factor data type.

car_data <- car_data %>%
  #concatenate first and last name
  mutate(FullName = paste(FirstName, LastName, sep = " ")) %>%
  rename(Deparment = CurrentEmploymentType) %>%
  rename(Title = CurrentEmploymentTitle)

car_data$CarID <- as_factor(car_data$CarID)

glimpse(car_data)

Rows: 44
Columns: 6
$ LastName  <chr> "Calixto", "Azada", "Balas", "Barranco", "Baza", "~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "Isak", "Linnea~
$ CarID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,~
$ Deparment <chr> "Information Technology", "Engineering", "Engineer~
$ Title     <chr> "IT Helpdesk", "Engineer", "Engineer", "SVP/CFO", ~
$ FullName  <chr> "Nils Calixto", "Lars Azada", "Felix Balas", "Ingr~

cc_data

Katerina’s Cafe’ causes error when plotting a graph because of special characters. Convert the special characters into string format using mutate() and str_detec() functions.

Additionally, convert the timestamp() from character datatype to date-time format using data-time_parse() of clock package, then get the date, day of the week, and hour of transaction.

#detect and replace Katerina to Katerina's Cafe
cc_data <- cc_data %>%
    mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

#convert to date-time format
cc_data$date <- date_time_parse(cc_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")
cc_data$day <- wday(cc_data$date,
                          label = TRUE,
                          abbr = TRUE)

cc_data$timestamp <- date_time_parse(cc_data$timestamp,
                zone = "",
                format = "%m/%d/%Y %H:%M")

cc_data$hour <- get_hour(cc_data$timestamp)

glimpse(cc_data)

Rows: 1,490
Columns: 7
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~

loyalty_data

Similar to cc_data, convert the special characters of Katerina’s Cafe’ into string format, convert the timestamp from character datatype to date-time format, then get the date and day of the week of transaction. Note that loyalty_data does not include the hour and minutes of the transaction.

#detect and replace Katerina to Katerina's Cafe
loyalty_data <- loyalty_data %>%
    mutate(location = ifelse(str_detect(location, "Katerina"), "Katerina's Cafe", location))

#convert to date-time format
loyalty_data$date <- date_time_parse(loyalty_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

loyalty_data$timestamp <- date_time_parse(loyalty_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

loyalty_data$day <- wday(loyalty_data$timestamp,
                          label = TRUE,
                          abbr = TRUE)

glimpse(loyalty_data)

Rows: 1,392
Columns: 6
$ timestamp  <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ location   <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price      <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~

gps_data

Rename Timestamp to timestamp and id to CarID so it will be consistent with other data frame. Similar to cc_data and loyalty_data, convert the timestamp from character datatype to date-time format using data-time_parse(), then get the date and day of the week.

Convert CarID field from numerical to factor data type. Lastly, convert the gps data frame into a simple feature data frame using st_as_sf() of sf package.

#rename columns for consistency
gps_data <- gps_data %>%
  rename(timestamp = Timestamp) %>%
  rename(CarID = id)

#convert to date-time format
gps_data$date <- date_time_parse(gps_data$timestamp,
                zone = "",
                format = "%m/%d/%Y")

gps_data$day <- as.factor(wday(gps_data$date,
                          label = TRUE,
                          abbr = TRUE))

gps_data$timestamp <- date_time_parse(gps_data$timestamp,
                zone = "",
                format = "%m/%d/%Y %H:%M:%S")

gps_data$hour <- get_hour(gps_data$timestamp)

#convert to factor data type
gps_data$CarID <- as_factor(gps_data$CarID)


glimpse(gps_data)

Rows: 685,169
Columns: 7
$ timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~

#convert to simple feature 
gps_sf <- st_as_sf(gps_data, 
                   coords = c("long", "lat"),
                       crs= 4326)
gps_sf

Simple feature collection with 685169 features and 5 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 6
   timestamp           CarID date                day    hour
 * <dttm>              <fct> <dttm>              <ord> <int>
 1 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 2 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 3 2014-01-06 06:28:03 35    2014-01-06 00:00:00 Mon       6
 4 2014-01-06 06:28:05 35    2014-01-06 00:00:00 Mon       6
 5 2014-01-06 06:28:06 35    2014-01-06 00:00:00 Mon       6
 6 2014-01-06 06:28:07 35    2014-01-06 00:00:00 Mon       6
 7 2014-01-06 06:28:09 35    2014-01-06 00:00:00 Mon       6
 8 2014-01-06 06:28:10 35    2014-01-06 00:00:00 Mon       6
 9 2014-01-06 06:28:11 35    2014-01-06 00:00:00 Mon       6
10 2014-01-06 06:28:12 35    2014-01-06 00:00:00 Mon       6
# ... with 685,159 more rows, and 1 more variable:
#   geometry <POINT [°]>

3.4 Join Data

financial data

Combine the cc_data and loyalty_data based on the purchase information like the location, date and price of transaction using full_join() of dplyr package. Exclude day and timestamp from loyalty_data since these fields are redundant with cc_data. Rearrange the columns into timestamp, date, day, hour, location, price, last4ccnum, loyaltynum.

#combine based on date, location, price, exclude day and timestamp
cc_loyalty_data <- full_join(cc_data %>% select(-c("day")),
                             loyalty_data %>% select(-c("day","timestamp")), 
                             by = c("date" = "date", 
                                    "location" = "location", 
                                    "price" = "price"))

#get day of the joint data
cc_loyalty_data$day <- wday(cc_loyalty_data$date,
                          label = TRUE,
                          abbr = TRUE)

#rearrange columns
cc_loyalty_data <- cc_loyalty_data %>%
  select("timestamp", "date", "day", "hour", "location", "price", "last4ccnum", "loyaltynum")

glimpse(cc_loyalty_data)

Rows: 1,807
Columns: 8
$ timestamp  <dttm> 2014-01-06 07:28:00, 2014-01-06 07:34:00, 2014-0~
$ date       <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, ~
$ day        <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon,~
$ hour       <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7~
$ location   <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price      <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <dbl> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
$ loyaltynum <chr> "L8566", NA, "L8148", "L5553", "L3800", "L2247", ~

The joint financial data reveals 1,807 entries. Some entries have last4ccnum but without loyaltynum, have loyaltynum but without last4ccnum. Additionally, last4ccnum does not necessarily correpond to only 1 loyaltynum which means the owner may use multiple credit or debit cards for their loyalty card or vice versa.

geospatial data

Combine the car_data and gps_data based CarID using left_join() of dplyr package. Exclude FirstName and LastName from car_data since these fields are redundant with FullName.

#combine based on CarID
car_gps_data <- left_join(gps_data, 
                          car_data %>% select(-c("FirstName", "LastName")),
                          by = "CarID")

glimpse(car_gps_data)

Rows: 685,169
Columns: 10
$ timestamp <dttm> 2014-01-06 06:28:01, 2014-01-06 06:28:01, 2014-01~
$ CarID     <fct> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat       <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long      <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
$ date      <dttm> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2~
$ day       <ord> Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, Mon, ~
$ hour      <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,~
$ Deparment <chr> "Executive", "Executive", "Executive", "Executive"~
$ Title     <chr> "Environmental Safety Advisor", "Environmental Saf~
$ FullName  <chr> "Willem Vasco-Pais", "Willem Vasco-Pais", "Willem ~

car_gps_sf <- left_join(gps_sf,
                        car_data %>% select(-c("FirstName", "LastName")),
                        by = "CarID")

car_gps_sf

Simple feature collection with 685169 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS:  WGS 84
# A tibble: 685,169 x 9
   timestamp           CarID date                day    hour
   <dttm>              <fct> <dttm>              <ord> <int>
 1 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 2 2014-01-06 06:28:01 35    2014-01-06 00:00:00 Mon       6
 3 2014-01-06 06:28:03 35    2014-01-06 00:00:00 Mon       6
 4 2014-01-06 06:28:05 35    2014-01-06 00:00:00 Mon       6
 5 2014-01-06 06:28:06 35    2014-01-06 00:00:00 Mon       6
 6 2014-01-06 06:28:07 35    2014-01-06 00:00:00 Mon       6
 7 2014-01-06 06:28:09 35    2014-01-06 00:00:00 Mon       6
 8 2014-01-06 06:28:10 35    2014-01-06 00:00:00 Mon       6
 9 2014-01-06 06:28:11 35    2014-01-06 00:00:00 Mon       6
10 2014-01-06 06:28:12 35    2014-01-06 00:00:00 Mon       6
# ... with 685,159 more rows, and 4 more variables:
#   geometry <POINT [°]>, Deparment <chr>, Title <chr>,
#   FullName <chr>

The joint geospatial data reveals that some CarID cannot be mapped to specific employees. Most probably they are the truck drivers who have no specific car assignment.

Click HERE to view the Visual Detective Assignment Part 2.

Visual Detective Assignment