Netflix Titles

TidyTuesday Netflix Titles Datensatz. Original auf Kaggle von Shivam Bansal.

Teilnehmer*innen Reihenfolge losen

#> sample(c("Franzi", "Selina", "Anne", "Helmut", "Markus"))
#  [1] "Franzi" "Selina" "Markus" "Helmut" "Anne"  

Daten laden

library(tidyverse)
# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2021-04-20')
## --- Compiling #TidyTuesday Information for 2021-04-20 ----
## --- There is 1 file available ---
## --- Starting Download ---
## 
##  Downloading file 1 of 1: `netflix_titles.csv`
## --- Download complete ---
netflix <- tuesdata$netflix

A first look at the data

head(netflix) %>% knitr::kable()
show_id type title director cast country date_added release_year rating duration listed_in description
s1 TV Show 3% NA João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi Brazil August 14, 2020 2020 TV-MA 4 Seasons International TV Shows, TV Dramas, TV Sci-Fi & Fantasy In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join the 3% saved from squalor.
s2 Movie 7:19 Jorge Michel Grau Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato Mexico December 23, 2016 2016 TV-MA 93 min Dramas, International Movies After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued while trying desperately to stay alive.
s3 Movie 23:59 Gilbert Chan Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim Singapore December 20, 2018 2011 R 78 min Horror Movies, International Movies When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that’s haunting their jungle island training camp.
s4 Movie 9 Shane Acker Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasciore, Alan Oppenheimer, Tom Kane United States November 16, 2017 2009 PG-13 80 min Action & Adventure, Independent Movies, Sci-Fi & Fantasy In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until a brave newcomer joins the group.
s5 Movie 21 Robert Luketic Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira, Jacob Pitts, Laurence Fishburne, Jack McGee, Josh Gad, Sam Golzari, Helen Carey, Jack Gilpin United States January 1, 2020 2008 PG-13 123 min Dramas A brilliant group of students become card-counting experts with the intent of swindling millions out of Las Vegas casinos by playing blackjack.
s6 TV Show 46 Serdar Akar Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan, Saygın Soysal, Berkan Şal, Metin Belgin, Ayça Eren, Selin Uludoğan, Özay Fecht, Suna Yıldızoğlu Turkey July 1, 2017 2016 TV-MA 1 Season International TV Shows, TV Dramas, TV Mysteries A genetics professor experiments with a treatment for his comatose sister that blends medical and shamanic cures, but unlocks a shocking side effect.
dim(netflix)
## [1] 7787   12
count(netflix, country, sort = TRUE) %>% head()
## # A tibble: 6 x 2
##   country            n
##   <chr>          <int>
## 1 United States   2555
## 2 India            923
## 3 <NA>             507
## 4 United Kingdom   397
## 5 Japan            226
## 6 South Korea      183
count(netflix, country, sort = TRUE) %>% tail()
## # A tibble: 6 x 2
##   country                    n
##   <chr>                  <int>
## 1 Uruguay, Guatemala         1
## 2 Uruguay, Spain, Mexico     1
## 3 Venezuela                  1
## 4 Venezuela, Colombia        1
## 5 West Germany               1
## 6 Zimbabwe                   1

Welches Land produziert die längsten Filme?

Manche Filme haben in der country Spalte mehrere Länder aufgezählt. Wir wollen uns zunächst auf die Filme beschrängen die ein eindeutigem Land zuzuordnen sind. Außerdem ist die duration Spalte noch nicht numerisch da sie aus Zahl und Einheit (min) besteht.

singleCountryEntry <- netflix %>% filter(!str_detect(country, ","))
singleCountryEntry %>% count(country, sort=TRUE) 
## # A tibble: 69 x 2
##    country            n
##    <chr>          <int>
##  1 United States   2555
##  2 India            923
##  3 United Kingdom   397
##  4 Japan            226
##  5 South Korea      183
##  6 Canada           177
##  7 Spain            134
##  8 France           115
##  9 Egypt            101
## 10 Mexico           100
## # … with 59 more rows
singleCountryEntry %>%
    group_by(country) %>%
    filter(type=="Movie") %>% 
    separate(duration,c("duration","time_unit"),sep=" ") %>%
    mutate(duration = as.integer(duration)) %>%
    summarise(longestFilm = max(duration), n = n()) %>%
    arrange(-n)
## # A tibble: 62 x 3
##    country        longestFilm     n
##    <chr>                <int> <int>
##  1 United States          312  1850
##  2 India                  228   852
##  3 United Kingdom         146   193
##  4 Canada                 140   118
##  5 Egypt                  253    89
##  6 Spain                  163    89
##  7 Turkey                 153    73
##  8 Philippines            150    70
##  9 France                 133    69
## 10 Japan                  151    69
## # … with 52 more rows
netflix %>% separate(duration,c("duration","time_unit"),sep=" ") %>% mutate(duration = as.integer(duration)) %>%
    filter(duration == max(duration)) %>% select(title)
## # A tibble: 1 x 1
##   title                     
##   <chr>                     
## 1 Black Mirror: Bandersnatch
# All durations are in min
singleCountryEntry %>% group_by(country) %>% filter(type=="Movie") %>% filter(!str_detect(duration,"min"))
## # A tibble: 0 x 12
## # Groups:   country [0]
## # … with 12 variables: show_id <chr>, type <chr>, title <chr>, director <chr>, cast <chr>, country <chr>, date_added <chr>, release_year <dbl>, rating <chr>,
## #   duration <chr>, listed_in <chr>, description <chr>
singleCountryEntry %>% separate(duration,c("duration","time_unit"),sep=" ") %>% mutate(duration = as.integer(duration))  %>% filter(type=="Movie")%>%
    group_by(country) %>% mutate(n=n()) %>% filter(duration == max(duration)) %>% select(country, title, duration, n) %>% arrange(-duration) %>%
    #filter(country %in% c("Germany", "West Germany", "East Germany")) %>%
    identity
## # A tibble: 62 x 4
## # Groups:   country [62]
##    country       title                      duration     n
##    <chr>         <chr>                         <int> <int>
##  1 United States Black Mirror: Bandersnatch      312  1850
##  2 Egypt         The School of Mischief          253    89
##  3 India         Sangam                          228   852
##  4 Netherlands   Elephants Dream 4 Hour          196    12
##  5 Nigeria       King of Boys                    182    62
##  6 Indonesia     This Earth of Mankind           180    68
##  7 Kuwait        Bye Bye London                  177     4
##  8 Pakistan      Ho Mann Jahaan                  170    14
##  9 Denmark       A Fortunate Man                 168     5
## 10 Spain         Palm Trees in the Snow          163    89
## # … with 52 more rows

Offenbar stammt der längste Film “Black Mirror: Bandersnatch” mit 312 Minuten Laufzeit aus den USA. Die Angaben auf IMDB geben die Filmlänge jedoch nur mit 90 Minuten an. Hier könnte also ein Fehler im Datensatz vorliegen.

Frage 2: Wie haben sich die dominanten Genres über die Zeit verändert?

Die Genres sind eine Komma-getrennte Liste pro Film und müssen zunächst separiert werden. Da es zu viele Genres gibt werden alle außer den häufigsten 8 in “Other” zusammengefasst (lumped).

#netflix$listed_in

netflix %>% separate_rows(listed_in, sep = ", ") %>% count(release_year, listed_in, sort = TRUE) %>% 
    mutate(listed_in=fct_lump(listed_in, 8, w=n)) %>%
    ggplot(aes(x = release_year, y = n, fill = listed_in)) +
    geom_col() + facet_wrap(~ listed_in, scale="free_y")

Welcher Film nach 2000 ist bereits ein Classic Movie?

netflix %>% separate_rows(listed_in, sep = ", ") %>% count(release_year, listed_in, sort = TRUE) %>% 
    filter(listed_in=="Classic Movies") %>%
    ggplot(aes(x = release_year, y = n)) +
    geom_col()

netflix %>% filter(str_detect(listed_in, "Classic Movie")) %>% arrange(-release_year) %>% select(title, release_year, listed_in, country)
## # A tibble: 103 x 4
##    title                       release_year listed_in                                        country                    
##    <chr>                              <dbl> <chr>                                            <chr>                      
##  1 The Other Side of the Wind          2018 Classic Movies, Dramas, Independent Movies       France, Iran, United States
##  2 Four Weddings and a Funeral         1994 Classic Movies, Comedies, International Movies   United Kingdom             
##  3 Hum Aapke Hain Koun                 1994 Classic Movies, Dramas, International Movies     India                      
##  4 Pulp Fiction                        1994 Classic Movies, Cult Movies, Dramas              United States              
##  5 Philadelphia                        1993 Classic Movies, Dramas, LGBTQ Movies             United States              
##  6 Schindler's List                    1993 Classic Movies, Dramas                           United States              
##  7 Searching for Bobby Fischer         1993 Children & Family Movies, Classic Movies, Dramas United States              
##  8 Tango Feroz                         1993 Classic Movies, Dramas, International Movies     Argentina, Spain           
##  9 What's Eating Gilbert Grape         1993 Classic Movies, Dramas, Independent Movies       United States              
## 10 Basic Instinct                      1992 Classic Movies, Thrillers                        United States, France      
## # … with 93 more rows

It is “The Other Side of the Wind”.