Grocery Shopping & Personality

Last updated on Sep 12, 2021 8 min read

If you’re a data fanatic like me, I like to look at Kaggle for different data sets that I can have fun with in my free time. Since starting Grad school I’ve already learned a whole bunch of great information on regression, psychometric reliability/validity, and getting aquinted with the literature. However, this post is going to be using an analysis I learned in Undergrad called the “Market Basket Analysis” (MB)

I encountered MB Analysis in my Intro to Data and Text Mining class and ever since then I’ve been trying to find ways to implement it within things I’m interested in.

There’s an old saying:

“When you have a hammer, everything looks like a nail.” - Unknown

This is very much my experience with MB analysis. I thought it was a really practical analysis.

Let’s first dive into the under the hood of MB analysis

Market Basket Analysis (MB) or Association Rules

One question you might be asking is: “Wesley, why do they call it the ‘Market Basket’ analysis?” Great question!

If we think of a market basket (or shopping cart) we have different items that we purchase throughout our visit to the store.

The main purpose of a MB analysis is to look at which items are bought together. In other words, find the association between items within a transaction.

We use association rules to mine different “if, then” statements from our data.

Let’s look at an example:

Say I have a data set in which cheese and wine are both bought together. Our algorithm will make a proposition like:

{Wine} => {Cheese}

There are 3 main types of metrics used to evaluate a proposition:

Support
Confidence
Lift

Support

Support is how much an item appears in the data set.

If {Wine} has a support of 0.2, that means it occurs 20% of transactions (or shopping carts)

Formula:

Confidence

Confidence is how often the rule is true.

If {Wine} => {Cheese} has a confidence of 0.5, that means it has been true in 50% of the shopping carts (or transactions).

Formula:

Lift

Lift is the ratio of the support to the independence of the item(s).

If {Wine} => {Cheese} has a lift of 2.5, that means it occurs 2.5 times more than if people were to just buy {Wine} itself.

Formula:

What does this have to do with personality?

Good question!

Kaggle has a data set of the Big Five Personality Test that has ~ 1 million responses to 50 questions (10 per trait).

Let’s take a step back and abstract MB & the BFI Personality test.

Imagine we are in a shopping plaza. There are 5 stores, each for the different personality traits. (Extraversion store, Openness store, …)

I go to the Extraversion store and go through the 10 different aisles and pick out 1 unique item per aisle.

We can use MB analysis to look at my shopping cart for each store I go to.

Let’s start coding

library(arules) # Association Rules
library(dplyr)

Load in data:

(I sampled the original data set for the purpose of this post; n = 1000)

sampled_data <- readr::read_csv(
  here::here(
    "content",
    "post",
    "2021-09-08-arules_mining",
    "sampled_data.csv"
  )
)

Taking a look at our data:

knitr::kable(head(sampled_data))

id	EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10	EST1	EST2	EST3	EST4	EST5	EST6	EST7	EST8	EST9	EST10	AGR1	AGR2	AGR3	AGR4	AGR5	AGR6	AGR7	AGR8	AGR9	AGR10	CSN1	CSN2	CSN3	CSN4	CSN5	CSN6	CSN7	CSN8	CSN9	CSN10	OPN1	OPN2	OPN3	OPN4	OPN5	OPN6	OPN7	OPN8	OPN9	OPN10
664928	3	2	4	2	3	2	3	2	3	3	2	4	3	2	4	3	2	2	2	2	2	4	1	3	3	4	2	4	4	4	3	2	3	2	3	2	3	3	4	3	3	2	3	3	4	1	4	3	3	4
814515	3	1	5	2	4	1	4	2	5	2	4	4	2	3	2	3	4	4	4	3	2	4	1	2	3	5	2	4	3	3	3	5	2	2	2	3	4	3	3	3	3	1	3	2	3	1	4	1	2	3
911060	2	3	3	4	3	3	3	4	2	4	3	3	4	3	3	4	3	3	4	3	1	4	2	5	2	4	2	4	4	3	4	1	4	2	4	2	4	3	4	4	3	4	2	1	3	3	3	3	3	2
358416	3	2	2	4	2	2	2	3	2	3	4	3	4	1	4	4	4	4	5	4	2	3	4	4	3	4	2	4	3	3	3	2	3	2	2	4	4	3	3	3	4	2	5	1	5	2	4	3	4	4
689863	3	3	3	3	4	2	4	2	4	3	1	4	4	3	1	4	3	2	3	3	2	4	1	5	1	3	3	4	5	4	5	2	4	1	1	2	4	3	5	5	3	4	3	2	3	4	5	3	4	2
648593	2	3	4	3	4	2	3	3	3	4	4	2	5	2	4	4	4	4	4	4	1	4	2	5	1	5	2	4	4	4	4	3	4	2	2	2	4	2	4	4	4	2	2	2	3	4	5	4	4	2

Reverse Score

Since we are looking at the responses themselves we need to reverse code the negative items.

neg_questions <- c(
  "EXT2", "EXT4", "EXT6", "EXT8", "EXT10", # 5
  "EST2", "EST4", # 2
  "AGR1", "AGR3", "AGR5", "AGR7", # 4
  "CSN2", "CSN4", "CSN6", "CSN8", # 4
  "OPN2", "OPN4", "OPN6" # 3
)

sampled_datar <-
  sampled_data |>
  mutate(across(where(is.numeric), ~ as.character(.x))) |> 
  mutate(across(all_of(neg_questions), ~ dplyr::recode(.x,
                                                       "5" = "1",
                                                       "4" = "2",
                                                       "3" = "3",
                                                       "2" = "4",
                                                       "1" = "5")))

Careless Responses

When looking at psychometric data it is usually a good idea to implement some sort of removal of careless responses. To do that we can implement 2 simple methods: Long-String and Intra-individual response variability (IRV).

find_longeststring <- function(row_of_data){
  
  dat_rle <- rle(row_of_data)
  longest_string <- max(dat_rle$lengths)
  
  return(longest_string)
}

find_longstring_avg <- function(row_of_data){
  
  dat_rle <- rle(row_of_data)
  longest_string_avg <- mean(dat_rle$lengths)
  
  return(longest_string_avg)
}

Let’s remove our careless responses

# Calculate Long-String and IRV
careless_data <- 
sampled_datar |> 
  rowwise() |>
  mutate(
    longest_string = find_longeststring(c_across(EXT1:OPN10)),
    avg_longstring = find_longstring_avg(c_across(EXT1:OPN10)),
    irv = sd(c_across(EXT1:OPN10))
  ) |>
  ungroup() |>
  mutate(
    scaled_ls = scale(longest_string),
    scaled_avg_ls = scale(avg_longstring),
    scaled_irv = scale(irv)
  )

## Cut off value of 2 standard deviations
long_string_excluded <-
  careless_data |>
  filter(scaled_ls <= 2) |>
  filter(scaled_avg_ls <= 2) |>
  filter(scaled_irv >= -2)

As we can see we are left with 962 results vs 1000 (removed 3.8%)

A little bit of wrangling

Now, one cavet to the algorithm we are going to be using today (the apriori()) is each item needs to be unique in each cart. Not only that, but the function requires a specific format of data.

To do this I am going to factorize each item as such:

q = question number (1-10)
a = answer (1-5)

factorize_item <- function(aisle_number, aisle_item){
  
  item_factor <- paste0("q", aisle_number,"_" ,"a", aisle_item)
  
  return(item_factor)
}

The Extraversion Store

ext_store <-
  long_string_excluded |> 
  select(EXT1:EXT10) 

# Since I was working with sample data
# I noticed sometimes I would get "NULL"
# values, so if you run into that in your
# own analysis:
ext_store <- 
  ext_store |> 
  filter(
    across(
      .cols = everything(),
      .fns = ~ !stringr::str_detect("NULL", .x)
    )
  )

Using map_dfr() we can apply to our Extraversion Store.

for (i in 1:10) {
  
  ext_store[,i] <- purrr::map_dfr(.x = ext_store[,i], ~ factorize_item(i, .x))
  
}

Let’s take a look:

knitr::kable(head(ext_store))

EXT1	EXT2	EXT3	EXT4	EXT5	EXT6	EXT7	EXT8	EXT9	EXT10
q1_a3	q2_a4	q3_a4	q4_a4	q5_a3	q6_a4	q7_a3	q8_a4	q9_a3	q10_a3
q1_a3	q2_a5	q3_a5	q4_a4	q5_a4	q6_a5	q7_a4	q8_a4	q9_a5	q10_a4
q1_a2	q2_a3	q3_a3	q4_a2	q5_a3	q6_a3	q7_a3	q8_a2	q9_a2	q10_a2
q1_a3	q2_a4	q3_a2	q4_a2	q5_a2	q6_a4	q7_a2	q8_a3	q9_a2	q10_a3
q1_a3	q2_a3	q3_a3	q4_a3	q5_a4	q6_a4	q7_a4	q8_a4	q9_a4	q10_a3
q1_a2	q2_a3	q3_a4	q4_a3	q5_a4	q6_a4	q7_a3	q8_a3	q9_a3	q10_a2

Now that I have that I just need to combine every row into it’s own basket.

ext_store_b <- 
  ext_store |> 
  mutate(basket = paste(EXT1,EXT2,EXT3,EXT4,EXT5,
                        EXT6,EXT7,EXT8,EXT9,EXT10,
                         sep = ","))

baskets <-
  ext_store_b |> 
  select(basket)

The apriori() function requires a specific data structure. The code below isn’t pretty but it does the trick!

transaction_list <-
  list()

for (i in seq_along(1:nrow(baskets))) {

  counter <- 1
  
  row <- as.character(ext_store[i,])
  
  transaction_list[[i]] <- c(row[counter],row[counter+1],row[counter+2],
                     row[counter+3],row[counter+4],row[counter+5],
                     row[counter+6],row[counter+7],row[counter+8],
                     row[counter+9])
  
}

names(transaction_list) <- paste("Tr",seq_along(1:nrow(ext_store)), sep = "")

transaction_obj <- transactions(transaction_list)

Now that we have our data in the right format, we can create our rules.

m1 <- apriori(transaction_obj,
              parameter = list(support = 0.007,
                               confidence = .95,
                               minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.95    0.1    1 none FALSE            TRUE       5   0.007      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 962 transaction(s)] done [0.00s].
## sorting and recoding items ... [50 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.01s].
## writing ... [1470 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

rules_df <- DATAFRAME(m1) |> 
  mutate(LHS = as.character(LHS),
         RHS = as.character(RHS))

Let’s look at the results

t <- rules_df |> 
  filter(!stringr::str_detect(RHS, "a1")) |> 
  filter(!stringr::str_detect(RHS, "a5"))
knitr::kable(head(t))

LHS	RHS	support	confidence	coverage	lift	count
{q10_a4,q4_a3,q7_a4}	{q5_a4}	0.0083160	1	0.0083160	3.498182	8
{q10_a4,q3_a5,q7_a2}	{q1_a3}	0.0083160	1	0.0083160	3.460432	8
{q1_a4,q2_a3,q5_a3}	{q4_a3}	0.0072765	1	0.0072765	3.894737	7
{q10_a3,q2_a2,q6_a4}	{q3_a4}	0.0072765	1	0.0072765	3.657795	7
{q2_a2,q7_a1,q9_a4}	{q4_a2}	0.0072765	1	0.0072765	3.772549	7
{q3_a2,q5_a4,q8_a2}	{q4_a2}	0.0072765	1	0.0072765	3.772549	7

References

Saltz, J. S., & Stanton, J. M. (2018). An introduction to data science (First edition). SAGE.

Grocery Shopping & Personality

Market Basket Analysis (MB) or Association Rules

Support

Confidence

Lift

What does this have to do with personality?

Let’s start coding

Reverse Score

Careless Responses

A little bit of wrangling

The Extraversion Store

References

Wesley Gardiner

I-O Psychology Graduate Student