Netflix Titles

TidyTuesday Netflix Titles Datensatz. Original auf Kaggle von Shivam Bansal.

Teilnehmer*innen Reihenfolge losen

#> sample(c("Franzi", "Selina", "Anne", "Helmut", "Markus"))
#  [1] "Franzi" "Selina" "Markus" "Helmut" "Anne"

Daten laden

library(tidyverse)
# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2021-04-20')

## --- Compiling #TidyTuesday Information for 2021-04-20 ----

## --- There is 1 file available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 1: `netflix_titles.csv`

## --- Download complete ---

netflix <- tuesdata$netflix

A first look at the data

head(netflix) %>% knitr::kable()

show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
s1	TV Show	3%	NA	João Miguel, Bianca Comparato, Michel Gomes, Rodolfo Valente, Vaneza Oliveira, Rafael Lozano, Viviane Porto, Mel Fronckowiak, Sergio Mamberti, Zezé Motta, Celso Frateschi	Brazil	August 14, 2020	2020	TV-MA	4 Seasons	International TV Shows, TV Dramas, TV Sci-Fi & Fantasy	In a future where the elite inhabit an island paradise far from the crowded slums, you get one chance to join the 3% saved from squalor.
s2	Movie	7:19	Jorge Michel Grau	Demián Bichir, Héctor Bonilla, Oscar Serrano, Azalia Ortiz, Octavio Michel, Carmen Beato	Mexico	December 23, 2016	2016	TV-MA	93 min	Dramas, International Movies	After a devastating earthquake hits Mexico City, trapped survivors from all walks of life wait to be rescued while trying desperately to stay alive.
s3	Movie	23:59	Gilbert Chan	Tedd Chan, Stella Chung, Henley Hii, Lawrence Koh, Tommy Kuan, Josh Lai, Mark Lee, Susan Leong, Benjamin Lim	Singapore	December 20, 2018	2011	R	78 min	Horror Movies, International Movies	When an army recruit is found dead, his fellow soldiers are forced to confront a terrifying secret that’s haunting their jungle island training camp.
s4	Movie	9	Shane Acker	Elijah Wood, John C. Reilly, Jennifer Connelly, Christopher Plummer, Crispin Glover, Martin Landau, Fred Tatasciore, Alan Oppenheimer, Tom Kane	United States	November 16, 2017	2009	PG-13	80 min	Action & Adventure, Independent Movies, Sci-Fi & Fantasy	In a postapocalyptic world, rag-doll robots hide in fear from dangerous machines out to exterminate them, until a brave newcomer joins the group.
s5	Movie	21	Robert Luketic	Jim Sturgess, Kevin Spacey, Kate Bosworth, Aaron Yoo, Liza Lapira, Jacob Pitts, Laurence Fishburne, Jack McGee, Josh Gad, Sam Golzari, Helen Carey, Jack Gilpin	United States	January 1, 2020	2008	PG-13	123 min	Dramas	A brilliant group of students become card-counting experts with the intent of swindling millions out of Las Vegas casinos by playing blackjack.
s6	TV Show	46	Serdar Akar	Erdal Beşikçioğlu, Yasemin Allen, Melis Birkan, Saygın Soysal, Berkan Şal, Metin Belgin, Ayça Eren, Selin Uludoğan, Özay Fecht, Suna Yıldızoğlu	Turkey	July 1, 2017	2016	TV-MA	1 Season	International TV Shows, TV Dramas, TV Mysteries	A genetics professor experiments with a treatment for his comatose sister that blends medical and shamanic cures, but unlocks a shocking side effect.

dim(netflix)

## [1] 7787   12

count(netflix, country, sort = TRUE) %>% head()

## # A tibble: 6 x 2
##   country            n
##   <chr>          <int>
## 1 United States   2555
## 2 India            923
## 3 <NA>             507
## 4 United Kingdom   397
## 5 Japan            226
## 6 South Korea      183

count(netflix, country, sort = TRUE) %>% tail()

## # A tibble: 6 x 2
##   country                    n
##   <chr>                  <int>
## 1 Uruguay, Guatemala         1
## 2 Uruguay, Spain, Mexico     1
## 3 Venezuela                  1
## 4 Venezuela, Colombia        1
## 5 West Germany               1
## 6 Zimbabwe                   1

Welches Land produziert die längsten Filme?

Manche Filme haben in der country Spalte mehrere Länder aufgezählt. Wir wollen uns zunächst auf die Filme beschrängen die ein eindeutigem Land zuzuordnen sind. Außerdem ist die duration Spalte noch nicht numerisch da sie aus Zahl und Einheit (min) besteht.

singleCountryEntry <- netflix %>% filter(!str_detect(country, ","))
singleCountryEntry %>% count(country, sort=TRUE)

## # A tibble: 69 x 2
##    country            n
##    <chr>          <int>
##  1 United States   2555
##  2 India            923
##  3 United Kingdom   397
##  4 Japan            226
##  5 South Korea      183
##  6 Canada           177
##  7 Spain            134
##  8 France           115
##  9 Egypt            101
## 10 Mexico           100
## # … with 59 more rows

singleCountryEntry %>%
    group_by(country) %>%
    filter(type=="Movie") %>% 
    separate(duration,c("duration","time_unit"),sep=" ") %>%
    mutate(duration = as.integer(duration)) %>%
    summarise(longestFilm = max(duration), n = n()) %>%
    arrange(-n)

## # A tibble: 62 x 3
##    country        longestFilm     n
##    <chr>                <int> <int>
##  1 United States          312  1850
##  2 India                  228   852
##  3 United Kingdom         146   193
##  4 Canada                 140   118
##  5 Egypt                  253    89
##  6 Spain                  163    89
##  7 Turkey                 153    73
##  8 Philippines            150    70
##  9 France                 133    69
## 10 Japan                  151    69
## # … with 52 more rows

netflix %>% separate(duration,c("duration","time_unit"),sep=" ") %>% mutate(duration = as.integer(duration)) %>%
    filter(duration == max(duration)) %>% select(title)

## # A tibble: 1 x 1
##   title                     
##   <chr>                     
## 1 Black Mirror: Bandersnatch

# All durations are in min
singleCountryEntry %>% group_by(country) %>% filter(type=="Movie") %>% filter(!str_detect(duration,"min"))

## # A tibble: 0 x 12
## # Groups:   country [0]
## # … with 12 variables: show_id <chr>, type <chr>, title <chr>, director <chr>, cast <chr>, country <chr>, date_added <chr>, release_year <dbl>, rating <chr>,
## #   duration <chr>, listed_in <chr>, description <chr>

singleCountryEntry %>% separate(duration,c("duration","time_unit"),sep=" ") %>% mutate(duration = as.integer(duration))  %>% filter(type=="Movie")%>%
    group_by(country) %>% mutate(n=n()) %>% filter(duration == max(duration)) %>% select(country, title, duration, n) %>% arrange(-duration) %>%
    #filter(country %in% c("Germany", "West Germany", "East Germany")) %>%
    identity

## # A tibble: 62 x 4
## # Groups:   country [62]
##    country       title                      duration     n
##    <chr>         <chr>                         <int> <int>
##  1 United States Black Mirror: Bandersnatch      312  1850
##  2 Egypt         The School of Mischief          253    89
##  3 India         Sangam                          228   852
##  4 Netherlands   Elephants Dream 4 Hour          196    12
##  5 Nigeria       King of Boys                    182    62
##  6 Indonesia     This Earth of Mankind           180    68
##  7 Kuwait        Bye Bye London                  177     4
##  8 Pakistan      Ho Mann Jahaan                  170    14
##  9 Denmark       A Fortunate Man                 168     5
## 10 Spain         Palm Trees in the Snow          163    89
## # … with 52 more rows

Offenbar stammt der längste Film “Black Mirror: Bandersnatch” mit 312 Minuten Laufzeit aus den USA. Die Angaben auf IMDB geben die Filmlänge jedoch nur mit 90 Minuten an. Hier könnte also ein Fehler im Datensatz vorliegen.

Frage 2: Wie haben sich die dominanten Genres über die Zeit verändert?

Die Genres sind eine Komma-getrennte Liste pro Film und müssen zunächst separiert werden. Da es zu viele Genres gibt werden alle außer den häufigsten 8 in “Other” zusammengefasst (lumped).

#netflix$listed_in

netflix %>% separate_rows(listed_in, sep = ", ") %>% count(release_year, listed_in, sort = TRUE) %>% 
    mutate(listed_in=fct_lump(listed_in, 8, w=n)) %>%
    ggplot(aes(x = release_year, y = n, fill = listed_in)) +
    geom_col() + facet_wrap(~ listed_in, scale="free_y")

Welcher Film nach 2000 ist bereits ein Classic Movie?

netflix %>% separate_rows(listed_in, sep = ", ") %>% count(release_year, listed_in, sort = TRUE) %>% 
    filter(listed_in=="Classic Movies") %>%
    ggplot(aes(x = release_year, y = n)) +
    geom_col()

netflix %>% filter(str_detect(listed_in, "Classic Movie")) %>% arrange(-release_year) %>% select(title, release_year, listed_in, country)

## # A tibble: 103 x 4
##    title                       release_year listed_in                                        country                    
##    <chr>                              <dbl> <chr>                                            <chr>                      
##  1 The Other Side of the Wind          2018 Classic Movies, Dramas, Independent Movies       France, Iran, United States
##  2 Four Weddings and a Funeral         1994 Classic Movies, Comedies, International Movies   United Kingdom             
##  3 Hum Aapke Hain Koun                 1994 Classic Movies, Dramas, International Movies     India                      
##  4 Pulp Fiction                        1994 Classic Movies, Cult Movies, Dramas              United States              
##  5 Philadelphia                        1993 Classic Movies, Dramas, LGBTQ Movies             United States              
##  6 Schindler's List                    1993 Classic Movies, Dramas                           United States              
##  7 Searching for Bobby Fischer         1993 Children & Family Movies, Classic Movies, Dramas United States              
##  8 Tango Feroz                         1993 Classic Movies, Dramas, International Movies     Argentina, Spain           
##  9 What's Eating Gilbert Grape         1993 Classic Movies, Dramas, Independent Movies       United States              
## 10 Basic Instinct                      1992 Classic Movies, Thrillers                        United States, France      
## # … with 93 more rows

It is “The Other Side of the Wind”.