German Electricity Data :electric_plug:¶

  • Data Source: SMARD Web Portal
  • Task: forecasting (using the past to predict the future)
  • Language: julia

First steps:¶

  • load the data (with appropriate column types) :heavy_check_mark:
  • split the data into train/val/test sets
  • plot the time series (at different temporal resolution)
  • calculate the mean production per month
  • is there a seasonal pattern (what kinds of seasonalities)
  • make a prediction for the validation set
  • evaluate the quality of the prediction
In [0]:
ENV["COLUMNS"] = 1000; # print more columns of tables
In [0]:
using Random
In [0]:
Random.seed!(42)
"Andi Kerstin Chris Caro Jana" |> split |> shuffle |> x -> join(x," → ")

1. Data loading¶

  • load the hourly data from the file Realisierte_Erzeugung_201501010000_202305301300_Stunde.csv as a table into julia :heavy_check_mark:
  • make sure, that data is interpreted and typed correctly :heavy_check_mark:
  • clean header names :heavy_check_mark:
  • create column Time containing the combined info from Datum and Anfang
In [6]:
using Dates
using CSV
using DataFrames
using DataFramesMeta
using Chain
using Statistics
using CairoMakie
In [7]:
df = CSV.read("../data_dojo_22/Realisierte_Erzeugung_201501010000_202305301300_Stunde.csv",
    DataFrame,
    delim=";",
    decimal=',',
    groupmark='.',
    missingstring="-",
    dateformat=Dict("Datum"=>"dd.mm.yyyy"),
    types=Dict("Datum"=>Date)
);
In [8]:
clean_colname(x) = replace(split(x, " [")[1], " " => "_")
clean_names = clean_colname.(names(df));
In [9]:
rename!(df, clean_names)
Out[9]:
73716×15 DataFrame
73691 rows omitted
RowDatumAnfangEndeBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
DateTimeTimeFloat64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
12015-01-0100:00:0001:00:004024.251158.25516.58128.00.0133.010710.515687.23219.751226.251525.754909.25
22015-01-0101:00:0002:00:003982.751188.0516.258297.50.0122.511086.215321.82351.25870.751079.254932.75
32015-01-0102:00:0003:00:004019.51139.25514.08540.00.093.011026.214817.52227.0809.5787.05041.75
42015-01-0103:00:0004:00:004040.751122.5517.758552.00.086.511027.814075.02339.75821.0287.755084.0
52015-01-0104:00:0005:00:004037.751112.0519.758643.50.086.510962.214115.02461.5831.25346.755070.75
62015-01-0105:00:0006:00:004028.251107.75520.08711.750.086.7510696.013474.22217.75851.0765.55096.75
72015-01-0106:00:0007:00:004013.251111.75521.59167.250.087.010299.512403.82373.25868.25414.55153.0
82015-01-0107:00:0008:00:004012.751113.75520.259811.00.087.010035.212062.52491.0876.0582.55161.0
92015-01-0108:00:0009:00:003999.751107.5525.259683.053.087.010245.812405.02530.25888.25750.55393.5
102015-01-0109:00:0010:00:004016.251121.0527.09501.75773.2585.7510060.212798.82386.25891.5387.05884.0
112015-01-0110:00:0011:00:004007.751122.0525.010025.02116.7582.010082.212728.52557.0888.25176.256064.0
122015-01-0111:00:0012:00:004011.751146.25527.2510862.53364.2582.09949.012451.02657.75876.5884.55914.75
132015-01-0112:00:0013:00:004014.01139.75527.7511575.24198.2582.010119.012464.52937.75874.25705.05778.25
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
737052023-05-3001:00:0002:00:004581.52033.753891.7511488.50.0119.50.03523.51288.53984.75769.251146.0
737062023-05-3002:00:0003:00:004551.01963.53725.510141.00.0119.00.03484.251263.753592.0604.251141.75
737072023-05-3003:00:0004:00:004503.252090.53635.259170.250.0119.00.03490.01272.54030.5578.751132.5
737082023-05-3004:00:0005:00:004454.252105.253635.758735.752.25120.00.03678.251275.04895.5433.251136.0
737092023-05-3005:00:0006:00:004406.752134.752940.08419.75407.75120.250.04087.51289.255723.75404.251142.25
737102023-05-3006:00:0007:00:004413.252227.52272.57531.02888.75128.250.05010.751362.06525.252803.751144.5
737112023-05-3007:00:0008:00:004468.02276.51916.255507.258343.25137.00.05275.51173.756905.755712.751156.75
737122023-05-3008:00:0009:00:004503.52152.51643.253647.7515893.2138.00.05181.751196.256946.55079.51132.0
737132023-05-3009:00:0010:00:004544.52156.251285.02900.523612.5135.00.04707.51114.756384.52122.751115.0
737142023-05-3010:00:0011:00:004568.252085.251241.252729.530341.8121.250.03978.51095.54802.51279.01096.25
737152023-05-3011:00:0012:00:004618.752068.251053.253115.035196.2119.00.03759.751151.53762.25774.01096.5
737162023-05-3012:00:0013:00:004612.251967.25781.253496.2537030.0119.00.03765.01296.753240.0196.751099.75

Todo 2nd session¶

  • Create Time column
  • Split into training/validation/test set
  • Plot the training set with hourly resolution
  • Aggregate by day/month
  • Plot aggregated data
Bonus task
  • did you observe something peculiar with dates and times?
  • look at the hourly data of 2017-10-29 for an example
  • what happened there?
  • are there more days with a similar "problem"?
  • is this a problem?
  • how can we solve this?
In [10]:
datetime_vec = Dates.DateTime.(df.Datum, df.Anfang);
datetime_data = insertcols!(df,1, :Datetime => datetime_vec );
datetime_clean = select(datetime_data, Not([:Datum, :Anfang, :Ende]));
In [11]:
train = filter(:Datetime => <(Dates.Date(2021)), datetime_clean);
val = filter(:Datetime => d -> Dates.year(d)==2021, datetime_clean);
test = filter(:Datetime => >(Dates.Date(2021)), datetime_clean);
In [12]:
size(train), size(val), size(test)
Out[12]:
((52608, 13), (8760, 13), (21107, 13))
In [13]:
data_y = @chain train begin
    @transform(:Year = Dates.year.(:Datetime))
    groupby(:Year)
    @combine(:Biomasse = sum(:Biomasse), :Wasserkraft = sum(:Wasserkraft), :Wind_Offshore = sum(:Wind_Offshore), :Wind_Onshore = sum(:Wind_Onshore), :Photovoltaik = sum(:Photovoltaik), :Sonstige_Erneuerbare = sum(:Sonstige_Erneuerbare), :Kernenergie = sum(:Kernenergie), :Braunkohle = sum(:Braunkohle) ,:Steinkohle = sum(:Steinkohle), :Erdgas = sum(:Erdgas), :Pumpspeicher = sum(:Pumpspeicher), :Sonstige_Konventionelle = sum(:Sonstige_Konventionelle))
end
Out[13]:
6×13 DataFrame
RowYearBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
Int64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64Float64
120153.48479e71.37193e78.13438e66.90421e73.48794e79.4563e58.43635e71.33798e88.11387e71.52259e79.93722e65.29686e7
220163.97455e71.80759e71.20931e76.52747e73.45389e71.40346e68.02425e71.30376e88.089e72.28888e78.58761e64.73436e7
320174.03284e71.55901e71.74142e78.51901e73.58832e71.75439e67.2214e71.29287e86.60055e72.55804e79.29078e64.87431e7
420184.00905e71.52577e71.90684e78.92706e74.12343e71.53113e67.18417e71.28362e87.15453e74.28782e79.13598e61.2838e7
520194.03876e71.58305e72.43828e79.98789e74.19148e71.46438e67.10422e71.02729e84.78153e75.46197e79.02058e61.27306e7
620204.08223e71.55834e72.68832e71.03104e84.58216e71.60774e66.09239e78.3374e73.4872e75.9126e71.15287e71.29299e7
In [14]:
fig1 =lines(data_y.Year, data_y.Biomasse, label = "Biomasse")
for i in names(data_y)[3:end]
    lines!(fig1.axis, data_y.Year, data_y[!,i], label =i)
end
axislegend(fig1.axis, position = :lb)
fig1
Out[14]:

Todos 3rd Session¶

  • Select one of the types (e.g. Photovoltaik, Erdgas or Wind_Onshore)
  • What kinds of seasonality are detectable?
  • Is there a general trend on top of the seasonal pattern(s)?
  • Can you use this information to manually craft a prediction model?

Bonus:

  • How accurately does your manually created model predict the values in the validation set?

See the results in the R notebook

Todos 4th Session¶

  • Create a manual prediction model (with hard-coded rules) for "Photovoltaik"
  • How accurately does your manually created model predict the values in the validation set?

Bonus:

  • Employ a more sophisticated prediction model from an existing package for time series analysis
  • How does the prediction of that model compare to your manual model?
In [15]:
photo_test = test.Photovoltaik
Out[15]:
21107-element Vector{Float64}:
     0.0
     0.0
     0.0
     0.0
     0.0
     0.0
     0.0
   194.0
  1363.75
  2986.5
  4237.5
  4826.75
  4234.25
     ⋮
     0.0
     0.0
     0.0
     2.25
   407.75
  2888.75
  8343.25
 15893.25
 23612.5
 30341.75
 35196.25
 37030.0
In [16]:
first(train)
Out[16]:
DataFrameRow (13 columns)
RowDatetimeBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
DateTimeFloat64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
12015-01-01T00:00:004024.251158.25516.58128.00.0133.010710.515687.23219.751226.251525.754909.25
In [17]:
data = select(train, :Datetime, :Photovoltaik);
In [0]:

In [18]:
md=monthday.(data.Datetime)
h=hour.(data.Datetime)
insertcols!(data, 3, :monthday => md)
insertcols!(data, 4, :hour => h)
Out[18]:
52608×4 DataFrame
52583 rows omitted
RowDatetimePhotovoltaikmonthdayhour
DateTimeFloat64Tuple…Int64
12015-01-01T00:00:000.0(1, 1)0
22015-01-01T01:00:000.0(1, 1)1
32015-01-01T02:00:000.0(1, 1)2
42015-01-01T03:00:000.0(1, 1)3
52015-01-01T04:00:000.0(1, 1)4
62015-01-01T05:00:000.0(1, 1)5
72015-01-01T06:00:000.0(1, 1)6
82015-01-01T07:00:000.0(1, 1)7
92015-01-01T08:00:0053.0(1, 1)8
102015-01-01T09:00:00773.25(1, 1)9
112015-01-01T10:00:002116.75(1, 1)10
122015-01-01T11:00:003364.25(1, 1)11
132015-01-01T12:00:004198.25(1, 1)12
⋮⋮⋮⋮⋮
525972020-12-31T12:00:008398.0(12, 31)12
525982020-12-31T13:00:007127.5(12, 31)13
525992020-12-31T14:00:004330.0(12, 31)14
526002020-12-31T15:00:001372.5(12, 31)15
526012020-12-31T16:00:0052.25(12, 31)16
526022020-12-31T17:00:000.0(12, 31)17
526032020-12-31T18:00:000.0(12, 31)18
526042020-12-31T19:00:000.0(12, 31)19
526052020-12-31T20:00:000.0(12, 31)20
526062020-12-31T21:00:000.0(12, 31)21
526072020-12-31T22:00:000.0(12, 31)22
526082020-12-31T23:00:000.0(12, 31)23
In [20]:
nicename = groupby(data, [:monthday,:hour])
Out[20]:

GroupedDataFrame with 8784 groups based on keys: monthday, hour

First Group (6 rows): monthday = (1, 1), hour = 0
RowDatetimePhotovoltaikmonthdayhour
DateTimeFloat64Tuple…Int64
12015-01-01T00:00:000.0(1, 1)0
22016-01-01T00:00:000.0(1, 1)0
32017-01-01T00:00:000.0(1, 1)0
42018-01-01T00:00:000.0(1, 1)0
52019-01-01T00:00:000.0(1, 1)0
62020-01-01T00:00:000.0(1, 1)0

⋮

Last Group (4 rows): monthday = (3, 29), hour = 2
RowDatetimePhotovoltaikmonthdayhour
DateTimeFloat64Tuple…Int64
12016-03-29T02:00:000.0(3, 29)2
22017-03-29T02:00:000.0(3, 29)2
32018-03-29T02:00:000.0(3, 29)2
42019-03-29T02:00:000.0(3, 29)2
In [21]:
mean_nicename = @combine(nicename, :photo = mean(:Photovoltaik));
mean_nicename = filter(df-> df.monthday != (2,29), mean_nicename)
Out[21]:
8760×3 DataFrame
8735 rows omitted
Rowmonthdayhourphoto
Tuple…Int64Float64
1(1, 1)00.0
2(1, 1)10.0
3(1, 1)20.0
4(1, 1)30.0
5(1, 1)40.0
6(1, 1)50.0
7(1, 1)60.0
8(1, 1)70.125
9(1, 1)8182.792
10(1, 1)91512.79
11(1, 1)103582.58
12(1, 1)115287.5
13(1, 1)125991.92
⋮⋮⋮⋮
8749(12, 31)134837.54
8750(12, 31)143085.29
8751(12, 31)151080.5
8752(12, 31)1654.7083
8753(12, 31)170.0
8754(12, 31)180.0
8755(12, 31)190.0
8756(12, 31)200.0
8757(12, 31)210.0
8758(12, 31)220.0
8759(12, 31)230.0
8760(3, 29)20.0
In [22]:
first(mean_nicename)
Out[22]:
DataFrameRow (3 columns)
Rowmonthdayhourphoto
Tuple…Int64Float64
1(1, 1)00.0
In [23]:
first(val)
Out[23]:
DataFrameRow (13 columns)
RowDatetimeBiomasseWasserkraftWind_OffshoreWind_OnshorePhotovoltaikSonstige_ErneuerbareKernenergieBraunkohleSteinkohleErdgasPumpspeicherSonstige_Konventionelle
DateTimeFloat64?Float64?Float64Float64Float64Float64?Float64Float64?Float64?Float64?Float64?Float64?
12021-01-01T00:00:004597.251205.0383.03934.50.0227.08144.7511608.53443.755649.25347.251638.25
In [24]:
fig = Figure()
ax = Axis(fig[1,1])
scatter!(ax,val.Photovoltaik, color = "red", label = "val", markersize = 4)
scatter!(ax, mean_nicename.photo, color = "blue", label = "predict",markersize = 4)
axislegend(ax)
fig
Out[24]:
In [25]:
using Metrics
In [26]:
mse_own(y_pred, y_true) = sum((y_true .- y_pred).^2) / length(y_true)
Out[26]:
mse_own (generic function with 1 method)
In [27]:
mse_own(mean_nicename.photo,val.Photovoltaik )
Out[27]:
1.1335880219992626e7
In [28]:
mse(mean_nicename.photo,val.Photovoltaik )
Out[28]:
1.1335880219992626e7
In [29]:
mae(mean_nicename.photo,val.Photovoltaik)
Out[29]:
1633.652064307458

QC: do values in prediction and validation set match?

In [30]:
plot(mean_nicename.hour .== hour.(val.Datetime))
Out[30]:
In [32]:
val_data = select(val, :Datetime, :Photovoltaik);
insertcols!(val_data, 3, :monthday => monthday.(val.Datetime))
insertcols!(val_data, 4, :hour => hour.(val.Datetime))
insertcols!(val_data, 5, :predmd => mean_nicename.monthday)
insertcols!(val_data, 5, :predh => mean_nicename.hour)
val_data
Out[32]:
8760×6 DataFrame
8735 rows omitted
RowDatetimePhotovoltaikmonthdayhourpredhpredmd
DateTimeFloat64Tuple…Int64Int64Tuple…
12021-01-01T00:00:000.0(1, 1)00(1, 1)
22021-01-01T01:00:000.0(1, 1)11(1, 1)
32021-01-01T02:00:000.0(1, 1)22(1, 1)
42021-01-01T03:00:000.0(1, 1)33(1, 1)
52021-01-01T04:00:000.0(1, 1)44(1, 1)
62021-01-01T05:00:000.0(1, 1)55(1, 1)
72021-01-01T06:00:000.0(1, 1)66(1, 1)
82021-01-01T07:00:000.0(1, 1)77(1, 1)
92021-01-01T08:00:00194.0(1, 1)88(1, 1)
102021-01-01T09:00:001363.75(1, 1)99(1, 1)
112021-01-01T10:00:002986.5(1, 1)1010(1, 1)
122021-01-01T11:00:004237.5(1, 1)1111(1, 1)
132021-01-01T12:00:004826.75(1, 1)1212(1, 1)
⋮⋮⋮⋮⋮⋮⋮
87492021-12-31T12:00:0010006.8(12, 31)1213(12, 31)
87502021-12-31T13:00:007878.75(12, 31)1314(12, 31)
87512021-12-31T14:00:004845.0(12, 31)1415(12, 31)
87522021-12-31T15:00:001990.75(12, 31)1516(12, 31)
87532021-12-31T16:00:00154.25(12, 31)1617(12, 31)
87542021-12-31T17:00:000.0(12, 31)1718(12, 31)
87552021-12-31T18:00:000.0(12, 31)1819(12, 31)
87562021-12-31T19:00:000.0(12, 31)1920(12, 31)
87572021-12-31T20:00:000.0(12, 31)2021(12, 31)
87582021-12-31T21:00:000.0(12, 31)2122(12, 31)
87592021-12-31T22:00:000.0(12, 31)2223(12, 31)
87602021-12-31T23:00:000.0(12, 31)232(3, 29)
In [0]: