Zombies Apocalypse¶
- Data Source: Kaggle
- Tasks: compare humans and zombies to identify differences in supplies
- Language: R
Context¶
News reports suggest that the impossible has become possible…zombies have appeared on the streets of the US! What should we do? The Centers for Disease Control and Prevention (CDC) zombie preparedness website recommends storing water, food, medication, tools, sanitation items, clothing, essential documents, and first aid supplies. Thankfully, we are CDC analysts and are prepared, but it may be too late for others!
Content¶
Our team decides to identify supplies that protect people and coordinate supply distribution. A few brave data collectors volunteer to check on 200 randomly selected adults who were alive before the zombies. We have recent data for the 200 on age and sex, how many are in their household, and their rural, suburban, or urban location. Our heroic volunteers visit each home and record zombie status and preparedness. Now it's our job to figure out which supplies are associated with safety!
File¶
Because every moment counts when dealing with life and (un)death, we want to get this right! The first task is to compare humans and zombies to identify differences in supplies. We review the data and find the following:
- zombieid: unique identifier
- zombie: human or zombie
- age: age in years
- sex: male or female
- rurality: rural, suburban, or urban
- household: number of people living in household
- water: gallons of clean water available
- food: food or no food
- medication: medication or no medication
- tools: tools or no tools
- firstaid: first aid or no first aid
- sanitation: sanitation or no sanitation
- clothing: clothing or no clothing
- documents: documents or no documents
Acknowledgements¶
DataCamp
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ dplyr 1.1.4 ✔ readr 2.1.5 ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ ggplot2 3.5.1 ✔ tibble 3.2.1 ✔ lubridate 1.9.3 ✔ tidyr 1.3.1 ✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
theme_set(theme_light())
set.seed(421)
"Sabine Jana Sascha Markus" %>% str_split(" ") %>% unlist %>% sample %>% str_c(collapse=" → ")
1. Data loading¶
zombie_data=read_csv("zombies.csv")
Rows: 200 Columns: 14
── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (10): zombie, sex, rurality, food, medication, tools, firstaid, sanitati... dbl (4): zombieid, age, household, water
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
zombie_data |>tail()
zombieid | zombie | age | sex | rurality | household | water | food | medication | tools | firstaid | sanitation | clothing | documents |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
<dbl> | <chr> | <dbl> | <chr> | <chr> | <dbl> | <dbl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> |
195 | Zombie | 67 | Female | Suburban | 2 | 0 | No food | No medication | No tools | No first aid supplies | No sanitation | NA | NA |
196 | Zombie | 68 | Male | Suburban | 1 | 0 | Food | No medication | No tools | No first aid supplies | Sanitation | Clothing | Documents |
197 | Zombie | 71 | Male | Suburban | 1 | 8 | No food | No medication | tools | First aid supplies | No sanitation | Clothing | NA |
198 | Zombie | 76 | Female | Urban | 1 | 0 | No food | No medication | tools | First aid supplies | Sanitation | Clothing | Documents |
199 | Zombie | 82 | Male | Urban | 1 | 0 | No food | No medication | No tools | No first aid supplies | No sanitation | NA | NA |
200 | Zombie | 85 | Male | Urban | 1 | 0 | No food | Medication | No tools | No first aid supplies | Sanitation | Clothing | NA |
zombie_data |>summary()
zombieid zombie age sex Min. : 1.00 Length:200 Min. :18.00 Length:200 1st Qu.: 50.75 Class :character 1st Qu.:29.00 Class :character Median :100.50 Mode :character Median :42.00 Mode :character Mean :100.50 Mean :44.41 3rd Qu.:150.25 3rd Qu.:58.00 Max. :200.00 Max. :85.00 rurality household water food Length:200 Min. :1.00 Min. : 0.00 Length:200 Class :character 1st Qu.:2.00 1st Qu.: 0.00 Class :character Mode :character Median :2.50 Median : 8.00 Mode :character Mean :2.68 Mean : 8.75 3rd Qu.:4.00 3rd Qu.: 8.00 Max. :6.00 Max. :40.00 medication tools firstaid sanitation Length:200 Length:200 Length:200 Length:200 Class :character Class :character Class :character Class :character Mode :character Mode :character Mode :character Mode :character clothing documents Length:200 Length:200 Class :character Class :character Mode :character Mode :character
split in zombie and non-zombie¶
- what are the differences in supplies between the two groups?
- which supply is the best indicator for zombieness?
-> count yes/no per group, bar plot
zombie_bar = function(feature){
feature <- sym(feature)
zombie_data %>%
ggplot(aes(x = !!feature, fill = zombie)) + geom_bar()
}
zombie_bar2 = function(feature){
feature <- sym(feature)
zombie_data %>%
ggplot(aes(fill = !!feature, x = zombie)) + geom_bar()
}
#zombie_col = function(feature){
zombie_data %>% group_by(zombie) %>% count(household) %>%
ggplot(aes(x = household, y = n/2,fill = zombie)) + geom_col()
#}
#map(colnames(zombie_data),zombie_col)
map(colnames(zombie_data),zombie_bar)
[[1]] [[2]] [[3]] [[4]] [[5]] [[6]] [[7]] [[8]] [[9]] [[10]] [[11]] [[12]] [[13]] [[14]]
map(colnames(zombie_data),zombie_bar2)
Warning message: “The following aesthetics were dropped during statistical transformation: fill. ℹ This can happen when ggplot fails to infer the correct grouping structure in the data. ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
Warning message: “The following aesthetics were dropped during statistical transformation: fill. ℹ This can happen when ggplot fails to infer the correct grouping structure in the data. ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
Warning message: “The following aesthetics were dropped during statistical transformation: fill. ℹ This can happen when ggplot fails to infer the correct grouping structure in the data. ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
Warning message: “The following aesthetics were dropped during statistical transformation: fill. ℹ This can happen when ggplot fails to infer the correct grouping structure in the data. ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?”
[[1]] [[2]] [[3]] [[4]] [[5]] [[6]] [[7]] [[8]] [[9]] [[10]] [[11]] [[12]] [[13]] [[14]]
lm(as.factor(zombie) ~ as.factor(sanitation), data=zombie_data)
Warning message in model.response(mf, "numeric"): “using type = "numeric" with a factor response will be ignored”
Warning message in Ops.factor(y, z$residuals): “‘-’ not meaningful for factors”
Call: lm(formula = as.factor(zombie) ~ as.factor(sanitation), data = zombie_data) Coefficients: (Intercept) as.factor(sanitation)Sanitation 1.5294 -0.2743
zombie_bar("sanitation")
zombie_data %>%
ggplot(aes_string(x = "sanitation", fill = "zombie")) + geom_bar()
Warning message: “`aes_string()` was deprecated in ggplot2 3.0.0. ℹ Please use tidy evaluation idioms with `aes()`. ℹ See also `vignette("ggplot2-in-packages")` for more information.”
zombie_data %>%
ggplot(aes(x = sanitation, fill = zombie)) + geom_bar()
library(GGally)
Registered S3 method overwritten by 'GGally': method from +.gg ggplot2
ggpairs(zombie_data)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning message: “Removed 74 rows containing non-finite outside the scale range (`stat_g_gally_count()`).”
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.