Visual Detective Assignment Part 2

Visual Detective R Assignment

This assignment attempts to solve the 2021 IEEE Visual Analytics Science and Technology (VAST) Challenge: Mini-Challenge 2 by applying different visual analytics concepts, methods, and techniques with relevant R data visualisation and data analysis packages.

Archie Dolit https://www.linkedin.com/in/adolit/ (School of Computing and Information Systems, Singapore Management University)
07-25-2021

4. Proposed Solutions

Using just the credit and loyalty card data, identify the most popular locations, and when they are popular. What anomalies do you see? What corrections would you recommend to correct these anomalies?

Generate an interactive bar graph in descending order using ggplot and plotly to determine the most popular locations.

popular_combine <- cc_loyalty_data %>%
  group_by(location) %>%
  summarize(total_count=n()) %>%
  ggplot(aes(x=reorder(location, total_count),
             y=total_count,
             text = paste("Location :", location,"\n",
                          "Number of transactions:", total_count))) +
  geom_bar(stat="identity", fill = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions") +
  labs(x = "Locations", y = "Transaction Count") + 
  coord_flip() +
  theme_minimal()

ggplotly(popular_combine, tooltip = "text")

Based on the combined combined credit card and loyalty data, the most popular location is Katerina’s Cafe with a total of 256 transactions, followed by Hippokampos with 213 transactions and Guy’s Gyro with 187 transactions.

Generate an interactive heatmap using ggplot and plotly to determine the date and time when employees visit the locations.

day_location_count <- cc_loyalty_data %>%
  count(location, day) %>%
  rename(count = n)

popular_day_location <- ggplot(data = day_location_count,
                               aes(x=day, y=reorder(location, desc(location)),
                                   fill = count,
                                   text = paste("Location :", location,"\n",
                                                "Day of week:", day,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions by Day") +
  labs(x = "Day of the Week",y = "Locations") + 
  theme_minimal()

ggplotly(popular_day_location, tooltip = "text")

Based on the combined combined credit card and loyalty data, Brew’ve Been Served is popular on weekdays, Monday to Friday, with no transactions on weekend. Probably this location is only open weekday.

Guy’s Gyro, Hippokampos, and Katerina’s Cafe are very popular throughout the week, Sunday to Monday. Katerina’s Cafe is the most popular location on Saturday with a total of 42 transactions.

Some of the interesting transactions are U-Pump with 2 transactions only on Monday and Desafio Golf Course with only 9 transactions only on Sunday.

hour_location_count <- cc_loyalty_data %>%
  count(location, hour) %>%
  rename(count = n)
  
popular_hour_location <- ggplot(data = hour_location_count,
                               aes(x=hour, y=reorder(location, desc(location)),
                                   fill = count,
                                   text = paste("Location :", location,"\n",
                                                "Hour of the Day:", hour,"\n",
                                                "Number of transactions :", count))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Combined Credit Card & Loyalty Transactions by Hour of Day") +
  labs(x = "Hour of the Day",y = "Locations") + 
  theme_minimal()

ggplotly(popular_hour_location, tooltip = "text")

Based on the time of transaction, Brew’ve Been Served and Hallowed Grounds are popular in the morning around 7AM and 8AM. Most probably the employees visit the place before heading to the office.

Abila Zacharo, Bean There Done That, Brewed Awakenings, Gelatogalore, Guy’s Gyro, Hippokampos, Jack’s Magical Beans, Kalami Kafenion, Katerina’s Kafe, and Ouzera Elian are popular during lunch break around 12NN to 1PM.

Guy’s Gyro, Hippokampos, and Katerina’s Kafe are popular during dinner around 7PM and 8PM. Katerina’s Kafe has the highest transactions at 85 purchases at around 8PM.

Generate an interactive boxplot using plotly to determine the outliers and provide clues on some anomalies.

outlier <- plot_ly(data = cc_loyalty_data,
                   x = ~price,
                   color = I("royalblue4"),
                   alpha = 0.5,
                   boxpoints = "suspectedoutliers") %>%
  add_boxplot(y = ~reorder(location, desc(location))) %>%
  layout(title = "Combined Credit Card & Loyalty Transactions Outliers",
         yaxis = list(title = "Locations"),
         xaxis = list(title = "Price"))

outlier

Based on the price of transaction, it seems that there is unusual expensive purchase at Frydos Autosupply n More amount to 10,000. This is highly suspicious since the mean price for this location is only 161.96 with third quartile value of approximately 250.

Generate an interactive linegraph using plot_anomaly_diagnostics() of plotly to diagnose anomalous points in the cc_data purchase prices. Note that only locations with sufficient number of observations were selected for the anomaly diagnostics.

cc_data %>%
  filter(location %in% c("Abila Airport",
                         "Albert's Fine Clothing",
                         "Carlyle Chemical Inc.",
                         "Chostus Hotel",
                         "Frydos Autosupply n' More",
                         "Gelatogalore",
                         "Nationwide Refinery",
                         "Stewart and Sons Fabrication")) %>%
  group_by(location) %>%
  plot_anomaly_diagnostics(timestamp, price, 
                           .facet_ncol = 2,
                           .y_lab = "Price")

Based on the anomaly diagnostics, there are unusual purchases in Gelatogalore, Frydos Autosupply n’ More, Albert’s Fine Clothing, and Chostus Hotel. Again, the most expensive purchase is from Frydos Autosupply n More amounting to 10,000 on 2014-01-13 19:20:00.

The anomalies will not be removed or corrected. It will be kept in the data since it may lead to more clues in solving the challenge.

Q2: Anomalies in Vehicle, Credit Card and Loyalty Card Data

Add the vehicle data to your analysis of the credit and loyalty card data. How does your assessment of the anomalies in question 1 change based on this new data? What discrepancies between vehicle, credit, and loyalty card data do you find?

Generate an interactive heatmap using ggplot and plotly based on the amount of transactions with missing last4ccnum.

missing_last4ccnum <- cc_loyalty_data %>%
   filter(is.na(last4ccnum)) 

na_last4ccnum  <- ggplot(data = missing_last4ccnum,
                         aes(x=date, y=reorder(location, desc(location)),
                                   fill = price,
                                   text = paste("Location :", location,"\n",
                                                "Date:", date,"\n",
                                                "Total Amount of Transaction:", price))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Transactions with Missing Credit Card Data by Date") +
  labs(x = "Date of Transaction", y = "Locations") +
  theme_minimal()

ggplotly(na_last4ccnum, tooltip = "text")

Based on the total amount of transactions with missing credit card, National Refinery has a transaction on 2014-01-08 with a price of 4367.63. Stewart and Sons Fabrication has a transaction on 2014-01-13 with a price of 4071.95 and another one on 2014-01-15 with a price of 4485.38.

The discrepancies may be due to employees who bought the items with cash instead of credit card but still used the loyalty card to redeem points or rewards.

Generate another interactive heatmap using ggplot and plotly based on the amount of transactions with missing loyaltynum.

missing_loyaltynum <- cc_loyalty_data %>%
   filter(is.na(loyaltynum))

na_loyaltynum  <- ggplot(data = missing_loyaltynum,
                         aes(x=date, y=reorder(location, desc(location)),
                                   fill = price,
                                   text = paste("Location :", location,"\n",
                                                "Date:", timestamp,"\n",
                                                "Total Amount of Transaction:", price))) +
  geom_tile()+
  scale_fill_gradient(low = "lightsteelblue1", high = "royalblue4") +
  ggtitle("Transactions with Missing Loyalty Data by Date") +
  labs(x = "Date of Transaction", y = "Locations") +
  theme_minimal()

ggplotly(na_loyaltynum, tooltip = "text")

Based on the total amount of transactions with missing loyalty card, Frydos Autosupply n More has a transaction on 2014-01-13 19:20:00 with a price of 10,000.

The discrepancy is more suspicious since the person who bought the items did not use his loyalty card which may imply possible misuse of the credit card when making the transaction.

Add the gps and car data by creating a movement path from GPS points using the CarIDs as unique identifier. Filter the data around the time of transaction from 2014-01-13 19:00 to 21:00.

gps_path_0113 <- car_gps_sf %>%
  filter(timestamp >= "2014-01-13 19:00" & timestamp <= "2014-01-13 21:00") %>%
  group_by(CarID, date) %>%
  summarize(m = mean(timestamp), 
            do_union=FALSE) %>%
  st_cast("LINESTRING") 

Plot the gps path on the background tourist map and identity which CarIDs are within the vicinity of Frydos Autosupply n More.

gps_path_selected_0113 <- gps_path_0113 %>%
  filter(CarID %in% c("13" , "15", "16", "34")) 

tmap_mode("plot")
tm_shape(bgmap) +
  tm_rgb(bgmap, r = 1,g = 2,b = 3,
       alpha = NA,
       saturation = 1,
       interpolate = TRUE,
       max.value = 255) +
  tm_shape(gps_path_selected_0113) +
  tm_lines() +
  tm_facets(by = "CarID",  ncol = 4)

From the gps paths, CarID “13” , “15”, “16”, “34” are within the vicinity of Frydos Autosupply n More during suspicions transaction amounting to the price of 10,000.

Create an interactive data table based on the joint gps and car and filter the date to 2014-01-13.

car_gps_0113 <- car_gps_data %>%
  filter(timestamp >= "2014-01-13 19:00" & timestamp <= "2014-01-13 21:00") %>%
  filter(CarID %in% c("13" , "15", "16", "34")) %>%
  group_by(CarID, Deparment, Title, FullName) %>%
  summarise()

DT::datatable(car_gps_0113)

From the interactive table, all CarIDs identified are from the Security Department. Possibly, Isia Vann and Edvard Vann are relatives because of the same Last Name and working together as Perimeter Controller.

Click HERE to view the Visual Detective Assignment Part 3.

Citation

For attribution, please cite this work as

Dolit (2021, July 25). Visual Analytics & Applications: Visual Detective Assignment Part 2. Retrieved from https://adolit-vaa.netlify.app/posts/2021-07-26-assignment-2/

BibTeX citation

@misc{dolit2021visual,
  author = {Dolit, Archie},
  title = {Visual Analytics & Applications: Visual Detective Assignment Part 2},
  url = {https://adolit-vaa.netlify.app/posts/2021-07-26-assignment-2/},
  year = {2021}
}