Data Dojo Würzburg 14
DataDojo@Lunch - live
July 2022
- When: Wednesday, July 13th, 2022 at 12:00pm until 13:15pm (75 minutes)
- Where: CCTB
- Info: DataDojo Website, Repo
Participants
Please add your name to the list (click the pen icon at the top left to edit) if you plan to come. And please remove it if you can not make it. Feel free to add your preferred tool or programming language.
- Markus (R, julia)
- Robin R. (R, julia)
- Andi (julia)
Dataset
We want to start a series of Data Dojos on machine learning. The task will be to classify tree species by their traits (e.g. height, stem diameter, geographic location). :deciduous_tree::evergreen_tree::palm_tree: We’ll use the recently published database: Tallo
It contains measurements for almost 500k individual trees from more than 5k species.
The task for the first episode is: Explore and visualize the data. Filter the full set to ~5 species/genera with enough individuals and reasonable overlap to have an interesting classification problem.
Question Pool:
- Generic
- What kind of information is stored in the table(s)?
- How much data is missing?
- Is the dataset clean or are there any clear outliers?
- How can the different datasets be combined?
- How to visualize the results in a suitable way?
- Specific
- TBD
- Add your own questions
- Further Ideas
- TBD
- Add your own ideas
Collaborative Tools and Workflow
For Notebooks (R, python, julia, js, …) with real time collaboration CoCalc seems to be the best option right now. It worked great the last couple of times so we’ll stick to it for now. You need to register an account there (it is free).
Future Suggestions
Add your suggestions to the list and :+1: to the end of a line you are interested in
Data Sets
- Tree Sizes :deciduous_tree::evergreen_tree::palm_tree:
- All Birds :bird:
- Results of the Bundestagswahl 2021
- Weather data throughout Germany over time (incl. temperature, precipitation, …): https://www.dwd.de/DE/leistungen/cdc_portal/cdc_portal.html
- German Mikrozensus
- Kaggle Titanic or Tabular Playground or Meta Kaggle
- World Trade Data (Open Trade Statistics)
- Open Citation Data
- Top 100 charts + Audio Features
- Emoji Usage :hugging_face::heart::laughing:
- Observable Curated Datasets
Tools/Languages
Skills
- interactive maps
- dashboards
- animations
Data Sources
all data types are welcome, including tables, images, videos, sounds, DNA, …
- TidyTuesday
- Our World in Data (R package: owidR), Sustainable Development Goals
- Open Data Initiatives (Würzburg, Germany, Statistisches Bundesamt, Europe, APIs)
- Data is plural
- Awesome Public Datasets
- Kaggle Datasets or Competitions, e.g. SLICED
- tsibbledata: Time Series Datasets
- R-text-data: Text Datasets, ready to use in R
- data.world
- Statista - the University of Würzburg has a campus license
- Open Legal Data
- Bundestag Data (e.g. poll results, deputies, wahl-o-mat, inspirational blog post)
- Deutsche Digitale Bibliothek (API, old newspapers from Germany)
- Earth Observation: Satellite Image Time Series
- Machine Learning Datasets
- Internation (Student) Assessment Data (TIMSS, PIRLS, PISA, …)
- (Medical) Imaging Datasets, MedMNIST
- Inspirational Notebooks on Observable
- Ski resort statistics :skier: