Zombies Apocalypse¶

  • Data Source: Kaggle
  • Tasks: compare humans and zombies to identify differences in supplies
  • Language: R

Context¶

News reports suggest that the impossible has become possible…zombies have appeared on the streets of the US! What should we do? The Centers for Disease Control and Prevention (CDC) zombie preparedness website recommends storing water, food, medication, tools, sanitation items, clothing, essential documents, and first aid supplies. Thankfully, we are CDC analysts and are prepared, but it may be too late for others!

Content¶

Our team decides to identify supplies that protect people and coordinate supply distribution. A few brave data collectors volunteer to check on 200 randomly selected adults who were alive before the zombies. We have recent data for the 200 on age and sex, how many are in their household, and their rural, suburban, or urban location. Our heroic volunteers visit each home and record zombie status and preparedness. Now it's our job to figure out which supplies are associated with safety!

File¶

Because every moment counts when dealing with life and (un)death, we want to get this right! The first task is to compare humans and zombies to identify differences in supplies. We review the data and find the following:

  • zombieid: unique identifier
  • zombie: human or zombie
  • age: age in years
  • sex: male or female
  • rurality: rural, suburban, or urban
  • household: number of people living in household
  • water: gallons of clean water available
  • food: food or no food
  • medication: medication or no medication
  • tools: tools or no tools
  • firstaid: first aid or no first aid
  • sanitation: sanitation or no sanitation
  • clothing: clothing or no clothing
  • documents: documents or no documents

Acknowledgements¶

DataCamp

In [3]:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
In [4]:
theme_set(theme_light())
In [6]:
set.seed(421)
"Sabine Jana Sascha Markus" %>%  str_split(" ") %>% unlist %>% sample %>% str_c(collapse=" → ")
Out[6]:
'Markus → Sabine → Jana → Sascha'

1. Data loading¶

In [8]:
zombie_data=read_csv("zombies.csv")
Rows: 200 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): zombie, sex, rurality, food, medication, tools, firstaid, sanitati...
dbl  (4): zombieid, age, household, water
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
In [11]:
zombie_data |>tail()
Out[11]:
A tibble: 6 × 14
zombieidzombieagesexruralityhouseholdwaterfoodmedicationtoolsfirstaidsanitationclothingdocuments
<dbl><chr><dbl><chr><chr><dbl><dbl><chr><chr><chr><chr><chr><chr><chr>
195Zombie67FemaleSuburban20No foodNo medicationNo toolsNo first aid suppliesNo sanitationNA NA
196Zombie68Male Suburban10Food No medicationNo toolsNo first aid suppliesSanitation ClothingDocuments
197Zombie71Male Suburban18No foodNo medicationtools First aid supplies No sanitationClothingNA
198Zombie76FemaleUrban 10No foodNo medicationtools First aid supplies Sanitation ClothingDocuments
199Zombie82Male Urban 10No foodNo medicationNo toolsNo first aid suppliesNo sanitationNA NA
200Zombie85Male Urban 10No foodMedication No toolsNo first aid suppliesSanitation ClothingNA
In [9]:
zombie_data |>summary()
Out[9]:
    zombieid         zombie               age            sex           
 Min.   :  1.00   Length:200         Min.   :18.00   Length:200        
 1st Qu.: 50.75   Class :character   1st Qu.:29.00   Class :character  
 Median :100.50   Mode  :character   Median :42.00   Mode  :character  
 Mean   :100.50                      Mean   :44.41                     
 3rd Qu.:150.25                      3rd Qu.:58.00                     
 Max.   :200.00                      Max.   :85.00                     
   rurality           household        water           food          
 Length:200         Min.   :1.00   Min.   : 0.00   Length:200        
 Class :character   1st Qu.:2.00   1st Qu.: 0.00   Class :character  
 Mode  :character   Median :2.50   Median : 8.00   Mode  :character  
                    Mean   :2.68   Mean   : 8.75                     
                    3rd Qu.:4.00   3rd Qu.: 8.00                     
                    Max.   :6.00   Max.   :40.00                     
  medication           tools             firstaid          sanitation       
 Length:200         Length:200         Length:200         Length:200        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
   clothing          documents        
 Length:200         Length:200        
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      

split in zombie and non-zombie¶

  • what are the differences in supplies between the two groups?
  • which supply is the best indicator for zombieness?

-> count yes/no per group, bar plot

In [68]:
zombie_bar = function(feature){
    feature <- sym(feature)
    zombie_data %>%
        ggplot(aes(x = !!feature, fill  = zombie)) + geom_bar()
}
In [77]:
zombie_bar2 = function(feature){
    feature <- sym(feature)
    zombie_data %>%
        ggplot(aes(fill = !!feature, x  = zombie)) + geom_bar()
}
In [67]:
#zombie_col = function(feature){
    zombie_data %>% group_by(zombie) %>% count(household) %>%
    ggplot(aes(x = household, y = n/2,fill = zombie)) + geom_col()
#}
Out[67]:
No description has been provided for this image
In [70]:
#map(colnames(zombie_data),zombie_col)
In [69]:
map(colnames(zombie_data),zombie_bar)
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
Out[69]:
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

[[9]]

[[10]]

[[11]]

[[12]]

[[13]]

[[14]]
Out[69]:
No description has been provided for this image
Out[69]:
No description has been provided for this image
In [78]:
map(colnames(zombie_data),zombie_bar2)
Warning message:
“The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?”
Out[78]:
No description has been provided for this image
Warning message:
“The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?”
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Warning message:
“The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?”
Out[78]:
No description has been provided for this image
Warning message:
“The following aesthetics were dropped during statistical transformation: fill.
ℹ This can happen when ggplot fails to infer the correct grouping structure in
  the data.
ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
  variable into a factor?”
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
Out[78]:
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

[[9]]

[[10]]

[[11]]

[[12]]

[[13]]

[[14]]
Out[78]:
No description has been provided for this image
Out[78]:
No description has been provided for this image
In [76]:
lm(as.factor(zombie) ~ as.factor(sanitation), data=zombie_data)
Warning message in model.response(mf, "numeric"):
“using type = "numeric" with a factor response will be ignored”
Warning message in Ops.factor(y, z$residuals):
“‘-’ not meaningful for factors”
Out[76]:
Call:
lm(formula = as.factor(zombie) ~ as.factor(sanitation), data = zombie_data)

Coefficients:
                    (Intercept)  as.factor(sanitation)Sanitation  
                         1.5294                          -0.2743  
In [35]:
zombie_bar("sanitation")
Out[35]:
No description has been provided for this image
In [33]:
zombie_data %>%
    ggplot(aes_string(x = "sanitation", fill  = "zombie")) + geom_bar()
Warning message:
“`aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.”
Out[33]:
No description has been provided for this image
In [27]:
zombie_data %>%
    ggplot(aes(x = sanitation, fill  = zombie)) + geom_bar()
Out[27]:
No description has been provided for this image
In [23]:
library(GGally)
Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

In [24]:
ggpairs(zombie_data)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message:
“Removed 74 rows containing non-finite outside the scale range
(`stat_g_gally_count()`).”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Out[24]:
No description has been provided for this image