Global Power Plant Database

https://github.com/wri/global-power-plant-database/

Initial Exploration with Power BI

Thanks to Timo for the nice introduction and demonstration of Power BI. Power BI is freely available for personal use: https://www.microsoft.com/en-US/download/details.aspx?id=58494

Questions:

  • Generic
    • What kind of information is stored in the table?
    • How much data is missing?
    • Is the dataset clean or are there any clear outliers?
  • Specific
    • Which country has the highest number of powerplants (how many, top 10)?
    • Which is the country with the highest power preduction (MW) per continent (how much, top 5)
    • What is the overall proportion of each energy source (by plant number, by MW produced)
    • Which is the greenest country (highest fraction of power from renewable sources)?
    • How does the distribution of power plants look on a worldmap (by type, interactive)?

Rotation

  1. Markus
  2. Stefan
  3. Timo
In [1]:
library(tidyverse)
Warning message in system("timedatectl", intern = TRUE):
“running command 'timedatectl' had status 1”
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

 ggplot2 3.3.5      purrr   0.3.4
 tibble  3.1.2      dplyr   1.0.7
 tidyr   1.1.3      stringr 1.4.0
 readr   1.4.0      forcats 0.5.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 dplyr::filter() masks stats::filter()
 dplyr::lag()    masks stats::lag()

In [2]:
power_plants <- readr::read_csv('https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv')
── Column specification ────────────────────────────────────────────────────────
cols(
  .default = col_character(),
  capacity_mw = col_double(),
  latitude = col_double(),
  longitude = col_double(),
  other_fuel3 = col_logical(),
  commissioning_year = col_double(),
  wepp_id = col_double(),
  year_of_capacity_data = col_double(),
  generation_gwh_2013 = col_double(),
  generation_gwh_2014 = col_double(),
  generation_gwh_2015 = col_double(),
  generation_gwh_2016 = col_double(),
  generation_gwh_2017 = col_double(),
  estimated_generation_gwh = col_double()
)
 Use `spec()` for the full column specifications.


Warning message:
“194 parsing failures.
  row     col               expected                    actual                                                                                                                       file
 6041 wepp_id no trailing characters 1071984|1082331           'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'
13964 wepp_id no trailing characters 1013223|1085384           'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'
13992 wepp_id no trailing characters 1057928|1064452           'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'
14567 wepp_id no trailing characters 1024905|1060668           'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'
14733 wepp_id no trailing characters 1074142| 1074143| 1030932 'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'
..... ....... ...................... ......................... ..........................................................................................................................
See problems(...) for more details.
”

What kind of information is in the table?

In [3]:
str(power_plants)
spec_tbl_df [33,643 × 25] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country                 : chr [1:33643] "AFG" "AFG" "AFG" "AFG" ...
 $ country_long            : chr [1:33643] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ name                    : chr [1:33643] "Kajaki Hydroelectric Power Plant Afghanistan" "Kandahar DOG" "Kandahar JOL" "Mahipar Hydroelectric Power Plant Afghanistan" ...
 $ gppd_idnr               : chr [1:33643] "GEODB0040538" "WKS0070144" "WKS0071196" "GEODB0040541" ...
 $ capacity_mw             : num [1:33643] 33 10 10 66 100 ...
 $ latitude                : num [1:33643] 32.3 31.7 31.6 34.6 34.6 ...
 $ longitude               : num [1:33643] 65.1 65.8 65.8 69.5 69.7 ...
 $ primary_fuel            : chr [1:33643] "Hydro" "Solar" "Solar" "Hydro" ...
 $ other_fuel1             : chr [1:33643] NA NA NA NA ...
 $ other_fuel2             : chr [1:33643] NA NA NA NA ...
 $ other_fuel3             : logi [1:33643] NA NA NA NA NA NA ...
 $ commissioning_year      : num [1:33643] NA NA NA NA NA ...
 $ owner                   : chr [1:33643] NA NA NA NA ...
 $ source                  : chr [1:33643] "GEODB" "Wiki-Solar" "Wiki-Solar" "GEODB" ...
 $ url                     : chr [1:33643] "http://globalenergyobservatory.org" "https://www.wiki-solar.org" "https://www.wiki-solar.org" "http://globalenergyobservatory.org" ...
 $ geolocation_source      : chr [1:33643] "GEODB" "Wiki-Solar" "Wiki-Solar" "GEODB" ...
 $ wepp_id                 : num [1:33643] 1009793 NA NA 1009795 1009797 ...
 $ year_of_capacity_data   : num [1:33643] 2017 NA NA 2017 2017 ...
 $ generation_gwh_2013     : num [1:33643] NA NA NA NA NA NA NA NA NA NA ...
 $ generation_gwh_2014     : num [1:33643] NA NA NA NA NA NA NA NA NA NA ...
 $ generation_gwh_2015     : num [1:33643] NA NA NA NA NA NA NA NA NA NA ...
 $ generation_gwh_2016     : num [1:33643] NA NA NA NA NA NA NA NA NA NA ...
 $ generation_gwh_2017     : num [1:33643] NA NA NA NA NA NA NA NA NA NA ...
 $ generation_data_source  : chr [1:33643] NA NA NA NA ...
 $ estimated_generation_gwh: num [1:33643] NA NA NA NA NA ...
 - attr(*, "problems")= tibble [194 × 5] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:194] 6041 13964 13992 14567 14733 14896 14925 16074 16137 16327 ...
  ..$ col     : chr [1:194] "wepp_id" "wepp_id" "wepp_id" "wepp_id" ...
  ..$ expected: chr [1:194] "no trailing characters" "no trailing characters" "no trailing characters" "no trailing characters" ...
  ..$ actual  : chr [1:194] "1071984|1082331" "1013223|1085384" "1057928|1064452" "1024905|1060668" ...
  ..$ file    : chr [1:194] "'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'" "'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'" "'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'" "'https://raw.githubusercontent.com/wri/global-power-plant-database/master/output_database/global_power_plant_database.csv'" ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   country_long = col_character(),
  ..   name = col_character(),
  ..   gppd_idnr = col_character(),
  ..   capacity_mw = col_double(),
  ..   latitude = col_double(),
  ..   longitude = col_double(),
  ..   primary_fuel = col_character(),
  ..   other_fuel1 = col_character(),
  ..   other_fuel2 = col_character(),
  ..   other_fuel3 = col_logical(),
  ..   commissioning_year = col_double(),
  ..   owner = col_character(),
  ..   source = col_character(),
  ..   url = col_character(),
  ..   geolocation_source = col_character(),
  ..   wepp_id = col_double(),
  ..   year_of_capacity_data = col_double(),
  ..   generation_gwh_2013 = col_double(),
  ..   generation_gwh_2014 = col_double(),
  ..   generation_gwh_2015 = col_double(),
  ..   generation_gwh_2016 = col_double(),
  ..   generation_gwh_2017 = col_double(),
  ..   generation_data_source = col_character(),
  ..   estimated_generation_gwh = col_double()
  .. )

Missing Values

In [4]:
power_plants %>% select(everything()) %>%
  summarise_all(list(~sum(is.na(.)))) %>%
  pivot_longer(country:estimated_generation_gwh)%>%filter(value<1000)
Out[4]:
A tibble: 11 × 2
namevalue
<chr><int>
country 0
country_long 0
name 0
gppd_idnr 0
capacity_mw 0
latitude 0
longitude 0
primary_fuel 0
source 15
url 18
geolocation_source419

Outliers? Distribution

In [5]:
ggplot(data=power_plants) + geom_density(aes(x=capacity_mw)) + scale_x_log10()
Out[5]:

Top power plant and producers (country)

In [6]:
power_plants %>% filter(capacity_mw==max(capacity_mw))
Out[6]:
A spec_tbl_df: 1 × 25
countrycountry_longnamegppd_idnrcapacity_mwlatitudelongitudeprimary_fuelother_fuel1other_fuel2geolocation_sourcewepp_idyear_of_capacity_datageneration_gwh_2013generation_gwh_2014generation_gwh_2015generation_gwh_2016generation_gwh_2017generation_data_sourceestimated_generation_gwh
<chr><chr><chr><chr><dbl><dbl><dbl><chr><chr><chr><chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><dbl>
CHNChinaThree Gorges DamWRI10004522250030.8235111.0032HydroNANANA1012216NANANANANANANA92452.57
In [7]:
power_plants %>% count(country_long, sort=T) %>% head(10)
Out[7]:
A tibble: 10 × 2
country_longn
<chr><int>
United States of America8688
China 4235
United Kingdom 2603
Brazil 2360
France 2155
India 1590
Germany 1309
Canada 1159
Spain 829
Russia 545
In [8]:
power_plants %>% group_by(country_long,)  %>% summarize(capacity = sum(capacity_mw))  %>% arrange(-capacity)  %>% head(20)
Out[8]:
A tibble: 20 × 2
country_longcapacity
<chr><dbl>
China 1415067.38
United States of America1193670.75
India 311713.71
Russia 228220.05
Japan 215365.85
Brazil 147589.27
Canada 143578.70
Germany 112040.37
France 110615.93
South Korea 99472.68
United Kingdom 87946.78
Saudi Arabia 84341.55
Spain 79035.07
Italy 73095.06
Australia 68841.97
Mexico 62365.90
Iran 61362.10
Turkey 51444.70
South Africa 50422.70
Indonesia 48751.42

Include primary fuel type

In [9]:
unique(power_plants$primary_fuel)
Out[9]:
  1. 'Hydro'
  2. 'Solar'
  3. 'Gas'
  4. 'Other'
  5. 'Oil'
  6. 'Wind'
  7. 'Nuclear'
  8. 'Coal'
  9. 'Waste'
  10. 'Biomass'
  11. 'Wave and Tidal'
  12. 'Petcoke'
  13. 'Geothermal'
  14. 'Cogeneration'
  15. 'Storage'
In [10]:
power_plants %>% 
    group_by(country_long,primary_fuel)  %>% 
    summarize(capacity = sum(capacity_mw), number_powerplants=n()) %>% arrange(-capacity) %>% head
`summarise()` has grouped output by 'country_long'. You can override using the `.groups` argument.

Out[10]:
A grouped_df: 6 × 4
country_longprimary_fuelcapacitynumber_powerplants
<chr><chr><dbl><int>
China Coal 955718.0 946
United States of AmericaGas 525687.91741
United States of AmericaCoal 282990.1 338
China Hydro259025.6 947
India Coal 200503.2 254
Russia Gas 106722.6 258
In [11]:
options(repr.plot.res = 250, repr.plot.width = 14)
In [12]:
power_plants %>% 
    mutate(
        country_long = fct_lump(country_long, 9),
        primary_fuel = fct_lump(primary_fuel, 8, w=capacity_mw)
    ) %>%
    group_by(country_long,primary_fuel)  %>% 
    summarize(capacity = sum(capacity_mw), number_powerplants=n()) %>% arrange(-capacity) %>%
    ungroup() %>%
    mutate(country_long = fct_reorder(country_long, capacity, .fun=sum)) %>%
    ggplot(aes(y=country_long, x=capacity, fill=primary_fuel)) + 
       geom_col() +
       scale_fill_viridis_d(option = "F") +
       theme_minimal() +
       labs(y="Country", x = "Capacity") #+ facet_wrap(~primary_fuel, scale="free_x")
`summarise()` has grouped output by 'country_long'. You can override using the `.groups` argument.

Out[12]:
In [13]:
power_plants %>% 
    mutate(
        country_long = fct_lump(country_long, 9),
        #primary_fuel = fct_lump(primary_fuel, 8, w=capacity_mw)
    ) %>%
    group_by(country_long,primary_fuel)  %>% 
    summarize(capacity = sum(capacity_mw), number_powerplants=n()) %>% arrange(-capacity) %>%
    ungroup() %>%
    mutate(primary_fuel = fct_reorder(primary_fuel, capacity, .fun=sum)) %>%
    ggplot(aes(y=primary_fuel, x=capacity, fill=primary_fuel)) + 
       geom_col() +
       scale_fill_viridis_d(option = "F", direction = -1) +
       theme_minimal() +
       labs(y="Country", x = "Capacity")
`summarise()` has grouped output by 'country_long'. You can override using the `.groups` argument.

Out[13]:
In [14]:
power_plants %>% 
    group_by(country_long,primary_fuel)  %>% 
    summarize(number_powerplants=n()) %>%
    pivot_wider(names_from=primary_fuel, values_from=number_powerplants, values_fill=0) %>%
    identity %>% head
`summarise()` has grouped output by 'country_long'. You can override using the `.groups` argument.

Out[14]:
A grouped_df: 6 × 16
country_longGasHydroSolarOtherOilWindCoalNuclearBiomassWasteWave and TidalPetcokeGeothermalCogenerationStorage
<chr><int><int><int><int><int><int><int><int><int><int><int><int><int><int><int>
Afghanistan 1 6 20 0 0000000000
Albania 0 7 01 0 0000000000
Algeria 31 1270 0 0000000000
Angola 3 5 00 6 0000000000
Antarctica 0 0 00 1 1000000000
Argentina 5750 729612930000000