German Electricity Data :electric_plug:¶

  • Data Source: SMARD Web Portal
  • Task: forecasting (using the past to predict the future)
  • Language: julia

First steps:¶

  • load the data (with appropriate column types) :heavy_check_mark:
  • split the data into train/val/test sets
  • plot the time series (at different temporal resolution)
  • calculate the mean production per month
  • is there a seasonal pattern (what kinds of seasonalities)
  • make a prediction for the validation set
  • evaluate the quality of the prediction
In [1]:
ENV["COLUMNS"] = 1000; # print more columns of tables
In [2]:
using Random
In [3]:
Random.seed!(42)
"Andi Kerstin Chris Caro Jana" |> split |> shuffle |> x -> join(x," → ")
Out[3]:
"Andi → Chris → Caro → Kerstin → Jana"

1. Data loading¶

  • load the hourly data from the file Realisierte_Erzeugung_201501010000_202305301300_Stunde.csv as a table into julia :heavy_check_mark:
  • make sure, that data is interpreted and typed correctly :heavy_check_mark:
  • clean header names :heavy_check_mark:
  • create column Time containing the combined info from Datum and Anfang
In [6]:
using Dates
using CSV
using DataFrames
In [7]:
df = CSV.read("Realisierte_Erzeugung_201501010000_202305301300_Stunde.csv",
    DataFrame,
    delim=";",
    decimal=',',
    groupmark='.',
    missingstring="-",
    dateformat=Dict("Datum"=>"dd.mm.yyyy"),
    types=Dict("Datum"=>Date)
);
In [8]:
names(df)
Out[8]:
15-element Vector{String}:
 "Datum"
 "Anfang"
 "Ende"
 "Biomasse [MWh] Berechnete Auflösungen"
 "Wasserkraft [MWh] Berechnete Auflösungen"
 "Wind Offshore [MWh] Berechnete Auflösungen"
 "Wind Onshore [MWh] Berechnete Auflösungen"
 "Photovoltaik [MWh] Berechnete Auflösungen"
 "Sonstige Erneuerbare [MWh] Berechnete Auflösungen"
 "Kernenergie [MWh] Berechnete Auflösungen"
 "Braunkohle [MWh] Berechnete Auflösungen"
 "Steinkohle [MWh] Berechnete Auflösungen"
 "Erdgas [MWh] Berechnete Auflösungen"
 "Pumpspeicher [MWh] Berechnete Auflösungen"
 "Sonstige Konventionelle [MWh] Berechnete Auflösungen"
In [9]:
clean_colname(x) = replace(split(x, " [")[1], " " => "_")
clean_names = clean_colname.(names(df))
Out[9]:
15-element Vector{String}:
 "Datum"
 "Anfang"
 "Ende"
 "Biomasse"
 "Wasserkraft"
 "Wind_Offshore"
 "Wind_Onshore"
 "Photovoltaik"
 "Sonstige_Erneuerbare"
 "Kernenergie"
 "Braunkohle"
 "Steinkohle"
 "Erdgas"
 "Pumpspeicher"
 "Sonstige_Konventionelle"
In [10]:
rename!(df, clean_names)
Out[10]:
73716×15 DataFrame
73691 rows omitted
RowDatumAnfangEndeBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
DateTimeTimeFloat64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
12015-01-0100:00:0001:00:004024.251158.25516.58128.00.0133.010710.515687.23219.751226.251525.754909.25
22015-01-0101:00:0002:00:003982.751188.0516.258297.50.0122.511086.215321.82351.25870.751079.254932.75
32015-01-0102:00:0003:00:004019.51139.25514.08540.00.093.011026.214817.52227.0809.5787.05041.75
42015-01-0103:00:0004:00:004040.751122.5517.758552.00.086.511027.814075.02339.75821.0287.755084.0
52015-01-0104:00:0005:00:004037.751112.0519.758643.50.086.510962.214115.02461.5831.25346.755070.75
62015-01-0105:00:0006:00:004028.251107.75520.08711.750.086.7510696.013474.22217.75851.0765.55096.75
72015-01-0106:00:0007:00:004013.251111.75521.59167.250.087.010299.512403.82373.25868.25414.55153.0
82015-01-0107:00:0008:00:004012.751113.75520.259811.00.087.010035.212062.52491.0876.0582.55161.0
92015-01-0108:00:0009:00:003999.751107.5525.259683.053.087.010245.812405.02530.25888.25750.55393.5
102015-01-0109:00:0010:00:004016.251121.0527.09501.75773.2585.7510060.212798.82386.25891.5387.05884.0
112015-01-0110:00:0011:00:004007.751122.0525.010025.02116.7582.010082.212728.52557.0888.25176.256064.0
122015-01-0111:00:0012:00:004011.751146.25527.2510862.53364.2582.09949.012451.02657.75876.5884.55914.75
132015-01-0112:00:0013:00:004014.01139.75527.7511575.24198.2582.010119.012464.52937.75874.25705.05778.25
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
737052023-05-3001:00:0002:00:004581.52033.753891.7511488.50.0119.50.03523.51288.53984.75769.251146.0
737062023-05-3002:00:0003:00:004551.01963.53725.510141.00.0119.00.03484.251263.753592.0604.251141.75
737072023-05-3003:00:0004:00:004503.252090.53635.259170.250.0119.00.03490.01272.54030.5578.751132.5
737082023-05-3004:00:0005:00:004454.252105.253635.758735.752.25120.00.03678.251275.04895.5433.251136.0
737092023-05-3005:00:0006:00:004406.752134.752940.08419.75407.75120.250.04087.51289.255723.75404.251142.25
737102023-05-3006:00:0007:00:004413.252227.52272.57531.02888.75128.250.05010.751362.06525.252803.751144.5
737112023-05-3007:00:0008:00:004468.02276.51916.255507.258343.25137.00.05275.51173.756905.755712.751156.75
737122023-05-3008:00:0009:00:004503.52152.51643.253647.7515893.2138.00.05181.751196.256946.55079.51132.0
737132023-05-3009:00:0010:00:004544.52156.251285.02900.523612.5135.00.04707.51114.756384.52122.751115.0
737142023-05-3010:00:0011:00:004568.252085.251241.252729.530341.8121.250.03978.51095.54802.51279.01096.25
737152023-05-3011:00:0012:00:004618.752068.251053.253115.035196.2119.00.03759.751151.53762.25774.01096.5
737162023-05-3012:00:0013:00:004612.251967.25781.253496.2537030.0119.00.03765.01296.753240.0196.751099.75

Todo 2nd session¶

  • Create Time column
  • Split into training/validation/test set
  • Plot the training set with hourly resolution
  • Aggregate by day/month
  • Plot aggregated data
Bonus task
  • did you observe something peculiar with dates and times?
  • look at the hourly data of 2017-10-29 for an example
  • what happened there?
  • are there more days with a similar "problem"?
  • is this a problem?
  • how can we solve this?
In [11]:
using DataFramesMeta
In [63]:
datetime_vec = Dates.DateTime.(df.Datum, df.Anfang)
datetime_data = insertcols!(df,1, :Datetime => datetime_vec ) 
first(datetime_data,2)
ArgumentError: Column Datetime is already present in the data frame which is not allowed when `makeunique=true`

Stacktrace:
 [1] insertcols!(df::DataFrame, col::Int64, name_cols::Pair{Symbol, Vector{DateTime}}; after::Bool, makeunique::Bool, copycols::Bool)
   @ DataFrames C:\Users\ank10ki\.julia\packages\DataFrames\LteEl\src\abstractdataframe\abstractdataframe.jl:3012
 [2] insertcols!(df::DataFrame, col::Int64, name_cols::Pair{Symbol, Vector{DateTime}})
   @ DataFrames C:\Users\ank10ki\.julia\packages\DataFrames\LteEl\src\abstractdataframe\abstractdataframe.jl:2978
 [3] top-level scope
   @ In[63]:2
In [28]:
datetime_clean = select(datetime_data, Not([:Datum, :Anfang, :Ende]))
first(datetime_clean)
Out[28]:
DataFrameRow (13 columns)
RowDatetimeBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
DateTimeFloat64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
12015-01-01T00:00:004024.251158.25516.58128.00.0133.010710.515687.23219.751226.251525.754909.25
In [29]:
last(datetime_clean)
Out[29]:
DataFrameRow (13 columns)
RowDatetimeBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
DateTimeFloat64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
737162023-05-30T12:00:004612.251967.25781.253496.2537030.0119.00.03765.01296.753240.0196.751099.75
In [31]:
using CairoMakie
using Chain
In [48]:
data_y = @chain datetime_clean begin
    @transform(:Year = Dates.year.(:Datetime))
    groupby(:Year)
    @combine(:Biomasse = sum(:Biomasse), :Wasserkraft = sum(:Wasserkraft), :Wind_Offshore = sum(:Wind_Offshore), :Wind_Onshore = sum(:Wind_Onshore), :Photovoltaik = sum(:Photovoltaik), :Sonstige_Erneuerbare = sum(:Sonstige_Erneuerbare), :Kernenergie = sum(:Kernenergie), :Braunkohle = sum(:Braunkohle) ,:Steinkohle = sum(:Steinkohle), :Erdgas = sum(:Erdgas), :Pumpspeicher = sum(:Pumpspeicher), :Sonstige_Konventionelle = sum(:Sonstige_Konventionelle))
end
Out[48]:
9×13 DataFrame
RowYearBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
Int64Float64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
120153.48479e71.37193e78.13438e66.90421e73.48794e79.4563e58.43635e71.33798e88.11387e71.52259e79.93722e65.29686e7
220163.97455e71.80759e71.20931e76.52747e73.45389e71.40346e68.02425e71.30376e88.089e72.28888e78.58761e64.73436e7
320174.03284e71.55901e71.74142e78.51901e73.58832e71.75439e67.2214e71.29287e86.60055e72.55804e79.29078e64.87431e7
420184.00905e71.52577e71.90684e78.92706e74.12343e71.53113e67.18417e71.28362e87.15453e74.28782e79.13598e61.2838e7
520194.03876e71.58305e72.43828e79.98789e74.19148e71.46438e67.10422e71.02729e84.78153e75.46197e79.02058e61.27306e7
620204.08223e71.55834e72.68832e71.03104e84.58216e71.60774e66.09239e78.3374e73.4872e75.9126e71.15287e71.29299e7
720213.94602e71.44572e72.40103e78.94079e74.66069e71.54281e66.5406e79.82022e75.18418e75.24051e78.72144e61.29722e7
820223.94697e71.2382e72.47478e71.00563e85.52988e71.22348e63.28245e71.03526e86.28894e75.34899e71.06033e79.76583e6
92023missingmissing1.00579e75.17461e71.96874e7missing6.74074e6missingmissingmissingmissingmissing
In [64]:
fig1 =lines(data_y.Year, data_y.Biomasse, label = "Biomasse")
for i in names(data_y)[3:end]
    lines!(fig1.axis, data_y.Year, data_y[!,i], label =i)
end
axislegend(fig1.axis, position = :lb)
fig1
Out[64]:
In [50]:
?
Out[50]:
13-element Vector{String}:
 "Year"
 "Biomasse"
 "Wasserkraft"
 "Wind_Offshore"
 "Wind_Onshore"
 "Photovoltaik"
 "Sonstige_Erneuerbare"
 "Kernenergie"
 "Braunkohle"
 "Steinkohle"
 "Erdgas"
 "Pumpspeicher"
 "Sonstige_Konventionelle"
In [0]: