Here, we are going to use the Lemurs data set provided by the TidyTuesday project to illustrate:

Packages and basic set up

R packages

In this analysis, we will rely heavily on packages from the Tidyverse. This is a collection of packages that were collectively designed specifically for data science. This means that these packages share data structures, function syntax, and can be easily combined in a slick, yet comprehensible workflow.

library(tidyverse)
library(lubridate) # easy manipulation of date objects

More specifically, we will work with the following packages:

  • readr: functions to easily, yet reliably read in rectangular data (e.g. csv, tsv) containing multiple data types (e.g. numeric, logical). By reliably, I mean that it can recognize errors in that table formatting that require checking by the user (e.g. the occurrence of numeric values in a seemingly logical column).

  • tidyr: functions to create and manipulate “tidy data”, i.e., data where each column is a variable, each row is an observation, and each is unique. The other functions in tidyverse are optimized to work with this type of data.

  • dplyr: functions for data manipulation (e.g. filtering, summarizing). One of features that make this package particularly good is the fact that functions are names as verbs, indicating the type of data transformation that it does. This makes reading the code considerably easy.

  • stringr: functions for string manipulation, considering that this is not one of base R strengths.

  • gglot2: functions to code graphs following the (“Grammar of Graphics”)[https://cfss.uchicago.edu/notes/grammar-of-graphics/]. Simply put, the grammar of graphics is a system of rules that allows coding data into visual elements - reading the article above and other precise definitions is highly recommendable, though.

Project set up

The files in this project are organized as such:

data_crunch_wue
|--README.md
|--lemurs.Rmd
|--figures
|--results
|  |--data
|  |  |--raw 
|  |  |--processed
|  |--figures
|  |--tables
|  |--scripts

This file structure adapts the minimal set up I propose for scientific computational projects. The idea is organizing the project around the .Rmd file (the .html version of which you are reading right now). By combining descriptive text code and results of the analysis, this “computational notebook” facilitates communication and reproducibility of the work it reports. As part of this set up, inputs and outputs can be accessed with relative paths:

raw_dir <- file.path("results", "data", "raw")
processed_dir <- file.path("results", "data", "processed")
scripts_dir <- file.path("reults", "scripts")
figures_dir <- file.path("results", "figures")
tables_dir <- file.path("results", "tables")

Original data

As shown in the dataset page, the original data can be downloaded the git repository:

lemurs_df <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   taxon = col_character(),
##   dlc_id = col_character(),
##   hybrid = col_character(),
##   sex = col_character(),
##   name = col_character(),
##   current_resident = col_character(),
##   stud_book = col_character(),
##   dob = col_date(format = ""),
##   estimated_dob = col_character(),
##   birth_type = col_character(),
##   birth_institution = col_character(),
##   estimated_concep = col_date(format = ""),
##   dam_id = col_character(),
##   dam_name = col_character(),
##   dam_taxon = col_character(),
##   dam_dob = col_date(format = ""),
##   sire_id = col_character(),
##   sire_name = col_character(),
##   sire_taxon = col_character(),
##   sire_dob = col_date(format = "")
##   # ... with 8 more columns
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 29130 parsing failures.
##  row             col           expected actual                                                                                                       file
## 1324 age_of_living_y 1/0/T/F/TRUE/FALSE  23.77 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv'
## 1325 age_of_living_y 1/0/T/F/TRUE/FALSE  23.77 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv'
## 1326 age_of_living_y 1/0/T/F/TRUE/FALSE  23.77 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv'
## 1327 age_of_living_y 1/0/T/F/TRUE/FALSE  23.77 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv'
## 1328 age_of_living_y 1/0/T/F/TRUE/FALSE  23.77 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv'
## .... ............... .................. ...... ..........................................................................................................
## See problems(...) for more details.

Already at loading, it seems we come across a parsing error: read_csv identified the column age_of_living_y as containing characters of type logical (it cites the expected values as `1/0/T/F/TRUE/FALSE), but it seems that at line 1324, the value is a double. The simplest reason why this error can occur is that, with default settings, the read_csv function identifies the types of objects (character, logical, etc.) in each column of the data frame based on the first 1000 rows. We can verify whether the contents of the first 1000 rows:

unique(lemurs_df[1:1000, "age_of_living_y"])
## # A tibble: 1 x 1
##   age_of_living_y
##   <lgl>          
## 1 NA

Thus, we see that missing data in these rows lead to the issue with its identification. We can fix it by explicitly identifying the types of objects in the columns of the data frame. Before we do this, however, let’s verify that age_of_living_y was the only column that raised an issue, with the problems attribute of objects read with the read_* functions from the readr package. This attribute stores parsing problems in a data frame containing the row and col where expected and actual values differ.

unique(problems(lemurs_df)$col)
## [1] "age_of_living_y"

When specifying the columns types, it’s all or nothing: we either identify all of them, or none at all. With 54 columns, this would be a lot, but this is where the magic starts: the spec() function lists all column types, and we just need to fixed the ones that were read in wrong.

spec(lemurs_df)
## cols(
##   taxon = col_character(),
##   dlc_id = col_character(),
##   hybrid = col_character(),
##   sex = col_character(),
##   name = col_character(),
##   current_resident = col_character(),
##   stud_book = col_character(),
##   dob = col_date(format = ""),
##   birth_month = col_double(),
##   estimated_dob = col_character(),
##   birth_type = col_character(),
##   birth_institution = col_character(),
##   litter_size = col_double(),
##   expected_gestation = col_double(),
##   estimated_concep = col_date(format = ""),
##   concep_month = col_double(),
##   dam_id = col_character(),
##   dam_name = col_character(),
##   dam_taxon = col_character(),
##   dam_dob = col_date(format = ""),
##   dam_age_at_concep_y = col_double(),
##   sire_id = col_character(),
##   sire_name = col_character(),
##   sire_taxon = col_character(),
##   sire_dob = col_date(format = ""),
##   sire_age_at_concep_y = col_double(),
##   dod = col_date(format = ""),
##   age_at_death_y = col_double(),
##   age_of_living_y = col_logical(),
##   age_last_verified_y = col_double(),
##   age_max_live_or_dead_y = col_double(),
##   n_known_offspring = col_double(),
##   dob_estimated = col_character(),
##   weight_g = col_double(),
##   weight_date = col_date(format = ""),
##   month_of_weight = col_double(),
##   age_at_wt_d = col_double(),
##   age_at_wt_wk = col_double(),
##   age_at_wt_mo = col_double(),
##   age_at_wt_mo_no_dec = col_double(),
##   age_at_wt_y = col_double(),
##   change_since_prev_wt_g = col_double(),
##   days_since_prev_wt = col_double(),
##   avg_daily_wt_change_g = col_double(),
##   days_before_death = col_double(),
##   r_min_dam_age_at_concep_y = col_double(),
##   age_category = col_character(),
##   preg_status = col_character(),
##   expected_gestation_d = col_double(),
##   concep_date_if_preg = col_date(format = ""),
##   infant_dob_if_preg = col_date(format = ""),
##   days_before_inf_birth_if_preg = col_double(),
##   pct_preg_remain_if_preg = col_double(),
##   infant_lit_sz_if_preg = col_double()
## )
lemurs_rawdf <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-24/lemur_data.csv',
                          col_types = cols(
                            .default = col_double(),
                            taxon = col_character(),
                            dlc_id = col_character(),
                            hybrid = col_character(),
                            sex = col_character(),
                            name = col_character(),
                            current_resident = col_character(),
                            stud_book = col_character(),
                            dob = col_date(format = ""),
                            birth_month = col_double(),
                            estimated_dob = col_character(),
                            birth_type = col_character(),
                            birth_institution = col_character(),
                            estimated_concep = col_date(format = ""),
                            dam_id = col_character(),
                            dam_name = col_character(),
                            dam_taxon = col_character(),
                            dam_dob = col_date(format = ""),
                            dam_age_at_concep_y = col_double(),
                            sire_id = col_character(),
                            sire_name = col_character(),
                            sire_taxon = col_character(),
                            sire_dob = col_date(format = ""),
                            dod = col_date(format = ""),
                            age_of_living_y = col_double(), ## the column that was tyoed wrong by default
                            dob_estimated = col_character(),
                            weight_date = col_date(format = ""),
                            age_category = col_character(),
                            preg_status = col_character(),
                            concep_date_if_preg = col_date(format = ""),
                            infant_dob_if_preg = col_date(format = "")
                            )
)

Let’s also load a data frame with the species full names and abbreviations, to use later for more understandable graphs and tables:

lemurs_sppnames_df <- readr::read_csv("https://raw.githubusercontent.com/ludmillafigueiredo/computational_notebooks/master/examples/datastudy_r/results/data/raw/lemurs_sppnames.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   taxon = col_character(),
##   species = col_character(),
##   common_name = col_character()
## )

Pre-processing

We have a couple of things to work on in the original table, to make it originally more digestible:

  1. The individuals names are capitalized, but it can be annoying to read. Let’s capitalize the first letter, only.
lemurs_smallts <- dplyr::mutate_at(lemurs_rawdf, vars(name, dam_name, sire_name), stringr::str_to_title)
  1. One of the most notable features of this data set is that it started being collected in the 1960’s. However, at first glance, there is no clear column stating the date at which the data was collected (I myself expected it to be one of the first columns). After reading the data description in the original repo, we learn that the weight_date and month_of_weight variables report the full date and the month when the weight was measured, respectively. It would be good to have those two easily accessible.
lemurs_smallts <- dplyr::mutate(lemurs_smallts, year = lubridate::year(weight_date))
lemurs_smallts <- dplyr::rename(lemurs_smallts, month = month_of_weight)
  1. In total, 52 variables describing each individual are available in this data set. To start processing it, we do not need all of those, so let’s select the ones that are more relevant (chosen based on the column description). Also, we want to simplify some of the columns names.
lemurs_smallts <- dplyr::select(lemurs_smallts,
                                c(year, month, ## time variables
                                  taxon, dlc_id, ## id variables
                                  hybrid, sex, name, birth_month, litter_size, concep_month, ## birth variables
                                  dam_id, dam_name, dam_taxon, ## name of mother
                                  sire_id, sire_name, sire_taxon, ## name of father
                                  age_at_death_y, age_of_living_y, age_last_verified_y, ## age variables
                                  age_max_live_or_dead_y, age_at_wt_y, age_category, ## age variables
                                  weight_g, avg_daily_wt_change_g, ## weight variables
                                  preg_status, 
                                  n_known_offspring, infant_lit_sz_if_preg))

lemurs_smallts <- dplyr::rename(lemurs_smallts,
                                weight = weight_g,
                                avg_d_wt_chg = avg_daily_wt_change_g,
                                n_offspring = n_known_offspring)

However, having that many iterations redefining the same object (the lemurs_smallts in this case) is not good practice, because if you forget one of them for some reason, it can lead to errors down the line (e.g. you do transformations on one of the “intermediate” stages). Having multiple objects is also not great, because one would have to name them, and it would be a waste of creativity on temporary files. With that in mind, let’s try some magic: We will put all the transformations together in a pipeline, where the transformations are chained in a readable form, any only one data frame is created at the end:

lemurs_smallts <- lemurs_rawdf %>%
  # 1. capitalizing the first letter, only
  dplyr::mutate_at(vars(name, dam_name, sire_name), stringr::str_to_title) %>%
  # 2. extract the year of the measure, and give a simpler name to the column containing the 
  dplyr::mutate(year = lubridate::year(weight_date)) %>%
  dplyr::rename(month = month_of_weight) %>%
  # 3. select the most relevant 
  dplyr::select(c(year, month, ## time variables
                  taxon, dlc_id, ## id variables
                  hybrid, sex, name, birth_month, litter_size, concep_month, ## birth variables
                  dam_id, dam_name, dam_taxon, ## name of mother
                  sire_id, sire_name, sire_taxon, ## name of father
                  age_at_death_y, age_of_living_y, age_last_verified_y, ## age variables
                  age_max_live_or_dead_y, age_at_wt_y, age_category, ## age variables
                  weight_g, avg_daily_wt_change_g, ## weight variables
                  preg_status, 
                  n_known_offspring, infant_lit_sz_if_preg)) %>%
  ## simplify the names
  dplyr::rename(weight = weight_g,
                avg_d_wt_chg = avg_daily_wt_change_g,
                n_offspring = n_known_offspring)

Exploring the data

Reproduction

If we are trying to protect a species, reproduction is one of the most important aspects to understand. With the DLC lemur data, we can estimate fertility rates, reproductive seasons, and the relationship between age, sizes and offspring production.

Fertility rates per taxon

Let’s have a look at the fertility rates of the species we are trying to save:

fertiilty_df <- lemurs_smallts %>%
  dplyr::right_join(lemurs_sppnames_df,., by = "taxon") %>% ## add species names, so we have a complete table
  dplyr::filter(!is.na(infant_lit_sz_if_preg)) %>% ## filter the animals for which this information was available
  dplyr::group_by(dlc_id, species) %>%
  dplyr::summarize(inflt_mean_ind = mean(infant_lit_sz_if_preg)) %>%
  ungroup() %>%
  dplyr::group_by(species) %>%
  dplyr::summarize(inflt_mean = mean(inflt_mean_ind),
                   inflt_sd = sd(inflt_mean_ind),
                   n = n()) %>%
  dplyr::rename(Species = species,
                "Infant litter size (mean)" = inflt_mean,
                "Infant litter size (sd)" = inflt_sd)

We can save this table

readr::write_csv(fertiilty_df, file = file.path(tables_dir, "fertility_rates.csv"))

Or have it nicely displayed in our html file:

fertiilty_df %>%
   kableExtra::kbl(caption = "Fertility rates (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.")%>%
  kableExtra::kable_styling(c("striped", "hover")) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
Fertility rates (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.
Species Infant litter size (mean) Infant litter size (sd) n
Cheirogaleus medius 1.912775 0.8543182 13
Daubentonia madagascariensis 1.000000 0.0000000 7
Eulemur albifrons 1.000000 NA 1
Eulemur collaris 1.000000 0.0000000 7
Eulemur coronatus 1.108333 0.2486072 10
Eulemur Eulemur 1.000000 0.0000000 5
Eulemur flavifrons 1.051079 0.1270108 13
Eulemur fulvus 1.000000 0.0000000 4
Eulemur macaco 1.125000 0.3535534 8
Eulemur mongoz 1.010417 0.0416667 16
Eulemur rubriventer 1.080000 0.1788854 5
Eulemur rufus 1.145833 0.2736801 8
Eulemur sanfordi 1.000000 NA 1
Galago moholi 1.357143 0.4738035 4
Hapalemur griseus griseus 1.123457 0.2055889 9
Lemur catta 1.330122 0.3794544 32
Loris tardigradus 1.000000 0.0000000 10
Mircocebus murinus 2.326840 0.8419061 22
Mirza coquereli 1.733333 0.4346135 5
Nycticebus coucang 1.000000 0.0000000 14
Nycticebus pygmaeus 1.728566 0.2791717 7
Otolemur garnettii garnettii 1.136842 0.3336840 19
Perodicticus potto 1.000000 0.0000000 3
Propithecus coquereli 1.000000 0.0000000 22
Varecia rubra 1.998291 0.7132938 13
Varecia Varecia 2.000000 0.0000000 2
Varecia variegata variegata 1.959375 0.7996520 16

Question: one could argue that this summary is hiding some valuable information. Any guesses?

Main text figures/tables

Seasonality of species

Many species of lemurs are seasonal breeders, meaning that are specific times of the year when animal will look for partners and reproduce. Let’s see if we can detect it in data.

First, I know that the data contains the dates in numeric form, but it would be nice to have the names of each month in data, for later plotting Remember, we are obeying the Grammar of Graphics, so we cannot simply paste tags with the names of months later on.

So, we start with a simple data frame with the relevant information: individuals id, taxon, and month of birth.

births_df <- lemurs_smallts %>%
  dplyr::select(dlc_id, taxon, birth_month) %>%
  dplyr::filter(!is.na(birth_month)) %>%
  dplyr::mutate_at(vars(birth_month), 
                   lubridate::month, label = TRUE, 
                   locale = Sys.getlocale(category = "LC_CTYPE")) %>% ## id months
  dplyr::right_join(lemurs_sppnames_df,., by = "taxon") %>% ## id species
  dplyr::arrange(taxon, birth_month) 

Now, let’s count the number of births that happened per species, per month:

birth_season_countdf <- births_df %>%
  unique() %>%
  dplyr::group_by(species, common_name, taxon, birth_month) %>%
  dplyr::summarize(n_births = n()) %>%
  ungroup() %>%
  dplyr::arrange(species, common_name, taxon, birth_month) 
## `summarise()` regrouping output by 'species', 'common_name', 'taxon' (override with `.groups` argument)

Let’s say we would like to plot this:

birth_season_countdf  %>%
  ggplot(aes(x = birth_month, y = n_births, fill = species)) +
  geom_bar(alpha=0.6, stat = "identity") + 
  facet_wrap(~species, ncol = 3) +
  labs(x = "Month", y = "Number of births (mean)") +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 45))

I can define specific aesthetic values to be applied to my plot:

source("results/scripts/custom_aesthetics.R")
birth_season_countdf  %>%
  ggplot(aes(x = birth_month, y = n_births, fill = species)) +
  geom_bar(alpha=0.6, stat = "identity") + 
  theme_lemurs() + 
  facet_wrap(~species, ncol = 3) +
  labs(x = "Month", y = "Number of births (mean)") +
  theme(legend.position = "none", 
        axis.text.x = element_text(angle = 45))

Challenge: there is a mistake in this summary. What is it?

Question: If we were talking to a larger audience, we could include the common names of the species in this graph, how would we go about ?

Offspring production

offspring_df <- lemurs_smallts %>%
  dplyr::right_join(lemurs_sppnames_df,., by = "taxon") %>% ## id species
  dplyr::select(year, month, species, taxon, dlc_id, sex,
                litter_size, ## size of litter it was born into
                age_at_wt_y, weight, ## age and weight
                preg_status, ## pregnancy status
                n_offspring, ## total number of offspring produced until that day
                infant_lit_sz_if_preg) ## size of litter, if pregnant

Individual female weight and offspring production

Let’s see if we can find some relationship between pregnant female weight and the size of the litter it is carrying.

First, let’s look into species separately:

offspring_df  %>%
  ## filter only the pregnant females
  dplyr::filter(preg_status == "P") %>%
  ## get their last measurement  while pregnant
  dplyr::group_by(dlc_id) %>%
  dplyr::filter(age_at_wt_y == max(age_at_wt_y)) %>%
  ungroup() %>%
  ggplot(aes(x = weight, y = litter_size))+
  geom_point(alpha = 0.2) +
  facet_wrap(~species, ncol = 3, scales = "free") +
  labs(x = "Weight (log(g))", y = "Species", size = "Infant litter size") +
  theme_lemurs()
## Warning: Removed 95 rows containing missing values (geom_point).

This was not very informative, but let’s see if we can have a summarized graph, at least

offspring_df  %>%
  ## filter only the pregnant females
  dplyr::filter(preg_status == "P") %>%
  ## get their last measurement  while pregnant
  dplyr::group_by(dlc_id) %>%
  dplyr::filter(age_at_wt_y == max(age_at_wt_y)) %>%
  ungroup() %>%
  ggplot(aes(x = log(weight), y = species))+
  geom_point(aes(size = infant_lit_sz_if_preg), alpha = 0.2) +
  labs(x = "Weight (log(g))", y = "Species", size = "Infant litter size") +
  theme_lemurs()
## Warning: Removed 7 rows containing missing values (geom_point).

### Individual weight and litter size Is individual’s weight affected by the size of the litter it was in? Get individual’s weight at its younger age and plot it against against the litter it came from (separate males and females differently)

litterweight_df <- offspring_df %>%
  dplyr::group_by(dlc_id) %>%
  dplyr::filter(age_at_wt_y == min(age_at_wt_y)) %>% ## filter for the younger age of a single individual
  ungroup()

Mean infant weight

litterweight_df %>%
  dplyr::group_by(species) %>%
  dplyr::summarize(weight_mean = mean(weight),
                   weight_sd = sd(weight)) %>%
  dplyr::arrange(weight_mean)  %>%
   kableExtra::kbl(caption = "Infant size (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.")%>%
  kableExtra::kable_styling(c("striped", "hover")) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
## `summarise()` ungrouping output (override with `.groups` argument)
Infant size (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.
species weight_mean weight_sd
Mircocebus murinus 46.55263 40.60969
Galago moholi 56.44510 65.37406
Loris tardigradus 88.04091 71.92052
Nycticebus pygmaeus 140.01418 207.20038
Cheirogaleus medius 140.57398 112.29466
Mirza coquereli 224.13714 108.32224
Hapalemur griseus griseus 300.79592 345.94202
Daubentonia madagascariensis 409.59273 779.91112
Perodicticus potto 544.68182 476.33900
Eulemur flavifrons 551.80286 815.14909
Nycticebus coucang 585.27612 511.38793
Propithecus coquereli 611.64595 1307.95988
Otolemur garnettii garnettii 624.86850 451.39680
Eulemur rubriventer 651.52000 855.25426
Eulemur mongoz 756.73619 696.07625
Lemur catta 849.66735 990.33820
Eulemur coronatus 896.58644 693.56773
Varecia Varecia 943.37879 1297.88274
Eulemur Eulemur 1014.67419 970.02145
Eulemur rufus 1227.22245 1014.64928
Eulemur macaco 1238.90137 1091.10120
Varecia rubra 1248.20649 1327.63775
Eulemur collaris 1424.04828 1006.32052
Eulemur sanfordi 1596.78421 696.43490
Eulemur fulvus 1617.04054 1008.32868
Varecia variegata variegata 1621.11523 1567.93244
Eulemur albifrons 1756.97059 969.45562

Mean litter size

litterweight_df %>%
  dplyr::group_by(species) %>%
  dplyr::summarize(litter_mean = mean(litter_size),
                   litter_sd = sd(litter_size)) %>%
  dplyr::arrange(litter_mean)  %>%
   kableExtra::kbl(caption = "Litter size (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.")%>%
  kableExtra::kable_styling(c("striped", "hover")) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
## `summarise()` ungrouping output (override with `.groups` argument)
Litter size (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.
species litter_mean litter_sd
Varecia Varecia 2.333333 0.595119
Cheirogaleus medius NA NA
Daubentonia madagascariensis NA NA
Eulemur albifrons NA NA
Eulemur collaris NA NA
Eulemur coronatus NA NA
Eulemur Eulemur NA NA
Eulemur flavifrons NA NA
Eulemur fulvus NA NA
Eulemur macaco NA NA
Eulemur mongoz NA NA
Eulemur rubriventer NA NA
Eulemur rufus NA NA
Eulemur sanfordi NA NA
Galago moholi NA NA
Hapalemur griseus griseus NA NA
Lemur catta NA NA
Loris tardigradus NA NA
Mircocebus murinus NA NA
Mirza coquereli NA NA
Nycticebus coucang NA NA
Nycticebus pygmaeus NA NA
Otolemur garnettii garnettii NA NA
Perodicticus potto NA NA
Propithecus coquereli NA NA
Varecia rubra NA NA
Varecia variegata variegata NA NA
litterweight_df %>%
  dplyr::group_by(species) %>%
  dplyr::summarize(litter_mean = mean(litter_size, na.rm = TRUE),
                   litter_sd = sd(litter_size, na.rm = TRUE)) %>%
  dplyr::arrange(litter_mean)  %>%
   kableExtra::kbl(caption = "Litter size (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.")%>%
  kableExtra::kable_styling(c("striped", "hover")) %>%
  kableExtra::scroll_box(width = "100%", height = "300px")
## `summarise()` ungrouping output (override with `.groups` argument)
Litter size (mean +- sd) of the species housed at the Duke Lemur Center, in North Carolina, USA.
species litter_mean litter_sd
Daubentonia madagascariensis 1.000000 0.0000000
Nycticebus coucang 1.000000 0.0000000
Perodicticus potto 1.000000 0.0000000
Propithecus coquereli 1.000000 0.0000000
Eulemur mongoz 1.051282 0.2220001
Loris tardigradus 1.057143 0.2355041
Eulemur rubriventer 1.100000 0.3077935
Hapalemur griseus griseus 1.100000 0.3038218
Eulemur flavifrons 1.131868 0.3402219
Eulemur sanfordi 1.133333 0.3518658
Eulemur rufus 1.151163 0.3603084
Eulemur fulvus 1.160000 0.3741657
Otolemur garnettii garnettii 1.196078 0.3989892
Eulemur coronatus 1.282609 0.4552432
Eulemur Eulemur 1.288591 0.4972261
Eulemur collaris 1.380000 0.4903144
Galago moholi 1.543478 0.5036102
Lemur catta 1.548673 0.5331858
Eulemur macaco 1.573770 0.4986320
Eulemur albifrons 1.642857 0.4972452
Mirza coquereli 1.818182 0.3892495
Nycticebus pygmaeus 1.918367 0.4931504
Varecia variegata variegata 2.314516 0.8101685
Varecia Varecia 2.333333 0.5951190
Cheirogaleus medius 2.567251 0.9074094
Varecia rubra 2.602837 0.7547888
Mircocebus murinus 2.625000 0.8087458
litterweight_df  %>%
  dplyr::filter(sex != "ND") %>%
  ggplot(aes(x = log(weight), y = species))+
  geom_point(aes(size = litter_size), alpha = 0.2) +
  facet_wrap(~sex, ncol = 2) +
  labs(x = "Weight (log(g))", y = "Species", size = "Litter size") + 
  theme_lemurs()
## Warning: Removed 374 rows containing missing values (geom_point).

Let’s try a more summarized version of it, this time differentiating males and females.

litterweight_df  %>%
  dplyr::filter(sex != "ND") %>%
  ggplot(aes(x = log(weight), y = species))+
  geom_point(aes(size = litter_size), alpha = 0.2) +
  facet_wrap(~sex, ncol = 2) +
  labs(x = "Weight (log(g))", y = "Species", size = "Litter size") + 
  theme_lemurs()
## Warning: Removed 374 rows containing missing values (geom_point).

Next steps

Try exploring the flights data set, included with the basic R download.

install.packages("nycfilghts13")
## Installing package into '/home/ludmilla/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
## Warning: package 'nycfilghts13' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
library(nycflights13)

R version, the OS and attached or loaded packages:

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=de_DE.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] nycflights13_1.0.2 lubridate_1.7.9    forcats_0.5.0      stringr_1.4.0     
##  [5] dplyr_1.0.2        purrr_0.3.4        readr_1.4.0        tidyr_1.1.2       
##  [9] tibble_3.1.1       ggplot2_3.3.3      tidyverse_1.3.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6        assertthat_0.2.1  digest_0.6.27     utf8_1.2.1       
##  [5] R6_2.5.0          cellranger_1.1.0  backports_1.1.10  reprex_0.3.0     
##  [9] evaluate_0.14     httr_1.4.2        highr_0.9         pillar_1.6.0     
## [13] rlang_0.4.11      curl_4.3          readxl_1.3.1      rstudioapi_0.13  
## [17] jquerylib_0.1.4   blob_1.2.1        rmarkdown_2.11    labeling_0.4.2   
## [21] webshot_0.5.2     munsell_0.5.0     broom_0.7.2       compiler_4.0.3   
## [25] modelr_0.1.8      xfun_0.22         pkgconfig_2.0.3   htmltools_0.5.1.1
## [29] tidyselect_1.1.0  fansi_0.4.2       viridisLite_0.4.0 crayon_1.4.1     
## [33] dbplyr_1.4.4      withr_2.4.2       grid_4.0.3        jsonlite_1.7.2   
## [37] gtable_0.3.0      lifecycle_1.0.0   DBI_1.1.1         magrittr_2.0.1   
## [41] scales_1.1.1      pals_1.7          cli_2.5.0         stringi_1.5.3    
## [45] farver_2.1.0      mapproj_1.2.7     fs_1.5.0          xml2_1.3.2       
## [49] bslib_0.2.4       ellipsis_0.3.2    generics_0.0.2    vctrs_0.3.8      
## [53] kableExtra_1.3.1  tools_4.0.3       dichromat_2.0-0   glue_1.4.2       
## [57] maps_3.4.0        hms_0.5.3         yaml_2.2.1        colorspace_2.0-2 
## [61] rvest_0.3.6       knitr_1.33        haven_2.3.1       sass_0.3.1