Data Dojo Würzburg 22

DataDojo@Lunch - live

May 2023

When: Wednesday, May 31^st, 2023 at 11:00am until 12:30pm (90 minutes)
Where: CCTB or online (CCTB Seminar Zoom Link)
Info: DataDojo Website, Repo

Dataset

German Electricity Data :electric_plug:

Data Source: SMARD Web Portal
Task: forecasting (using the past to predict the future)
Language: julia

First steps:

load the data
split the data into train/val/test sets
plot the time series (at different temporal resolution)
calculate the mean production per month
is there a seasonal pattern (what kinds of seasonalities)
make a prediction for the validation set
evaluate the quality of the prediction

Retro on the machine learning series

In the first dojo of the series, we filtered the full set to 3 species with reasonable overlap (Fagus sylvatica, Pinus pinaster, Quercus ilex). Now we want to try different Machine Learning methods to classify tree species from traits.

In the second dojo we created our first models. A very simple “Majority Vote” model and some K-Nearest-Neighbor (KNN) models with scikit-learn.

In the third dojo we explored the effect of scaling on the performance of the KNN models.

In the fourth dojo we explored Decision Trees as models for classification

In the fifth dojo we used Support Vector Machines as models for classification

In the sixth dojo we used ensemble models, including Ada boosting and random forests.

In the seventh dojo we used imputation methods to also make predictions for cases with missing data.

In the eighth dojo we trained some neural networks.

Collaborative Tools and Workflow

For Notebooks (R, python, julia, js, …) with real time collaboration CoCalc seems to be the best option right now. It worked great the last couple of times so we’ll stick to it for now. You need to register an account there (it is free).

Future Suggestions

Add your suggestions to the list and :+1: to the end of a line you are interested in

Data Sets

Würzburg Baumkataster, Würzburger Klimabäume
Bee Varroa Image Classification :bee:
German Electricity Data :electric_plug:
Mattermost Chat History - e.g. analyze the messages and reactions from the lunch channel
Wordbank - data of children learning to talk
All Birds :bird:
Results of the Bundestagswahl 2021
Weather data throughout Germany over time (incl. temperature, precipitation, …): https://www.dwd.de/DE/leistungen/cdc_portal/cdc_portal.html
German Mikrozensus
Kaggle Titanic or Tabular Playground or Meta Kaggle
World Trade Data (Open Trade Statistics)
Open Citation Data
Top 100 charts + Audio Features
Emoji Usage :hugging_face::heart::laughing:
Observable Curated Datasets
Abgeordnetenwatch - Data on German elected officials in EU-Parlament, Bundes- and Landtag (Election History, Ausschusszugehörigkeit, Side jobs, etc)
Button Men Game Results

Tools/Languages

R/tidyverse
python
Power BI
Tableau
KNIME
javascript
julia
visidata

Skills

interactive maps
dashboards
animations

Data Sources

all data types are welcome, including tables, images, videos, sounds, DNA, …

OpenData Bayern
TidyTuesday
Our World in Data (R package: owidR), Sustainable Development Goals
Open Data Initiatives (Würzburg, Germany, Statistisches Bundesamt, Europe, APIs)
Data is plural
Awesome Public Datasets
Kaggle Datasets or Competitions, e.g. SLICED
tsibbledata: Time Series Datasets
R-text-data: Text Datasets, ready to use in R
data.world
Statista - the University of Würzburg has a campus license
Open Legal Data
Bundestag Data (e.g. poll results, deputies, wahl-o-mat, inspirational blog post)
Deutsche Digitale Bibliothek (API, old newspapers from Germany)
Earth Observation: Satellite Image Time Series
Machine Learning Datasets
Internation (Student) Assessment Data (TIMSS, PIRLS, PISA, …)
(Medical) Imaging Datasets, MedMNIST
Inspirational Notebooks on Observable
Ski resort statistics :skier:

Data Dojo Würzburg 22

Let's practice our data analytics skills together!

Data Dojo Würzburg 22

DataDojo@Lunch - live

May 2023

Dataset

German Electricity Data :electric_plug:

Retro on the machine learning series

Collaborative Tools and Workflow

Future Suggestions

Data Sets

Tools/Languages

Skills

Data Sources

Cross Links