Why datasets are crucial to data science: the key to informed decisions

Henk van der Duim
4 min readApr 18, 2023
Data Scientist

When looking at the world of business and data, learning data science cannot be overemphasized. It’s a field that has dramatically changed the way businesses operate. We are in an era of big data and to succeed, companies need data-driven insights to drive decisions.

But data science isn’t just about programming and algorithms, it’s also about understanding the data being analyzed. And that’s where datasets come into play. They are crucial for anyone wanting to learn data science.

Below is an infographic to indicate the Big Data:

Big Data (bron: European Commission (2020), EPRS (2016))

Interlude

Disclaimer: I am a Data Engineer and not a Data Scientist. My work does mean that I understand what a Data Scientist does and can expect from a Data Engineer.

Let me explore how these two fields relate to each other using Monica Rogati’s The Data Science Hierarchy Of Needs. I borrowed the red areas in the image from Christopher Bolard and adapted them to my own situation.

The Data Science Hierarchy Of Needs by Monica Rogati

I deal, in my specific case, with the things in the red area with Data Engineer behind it. The Data Scientist/Analyst is concerned with the red area above. I’ll write an article about the how and what of me being a Data Engineer soon.

In other cases, you will find that the Data Scientist/Analyst is also involved in the Explore/Transform section in the image. See the article: Data Engineer VS Data Scientist.

DATA SETS

Datasets are collections of (semi-)structured data specially composed for analysis and appear in various forms, such as spreadsheets, CSV files, JSON files and databases. Datasets are essential for data science. They enable you to work with real-world data and gain insights that would otherwise be impossible to discover. Learning data science with datasets is therefore essential because it gives you the necessary skills to turn data into actionable insights. This is something I did at an earlier stage, to understand the: I amsterdam City Card database.

Advantages

One of the benefits of using datasets in data science is that they provide a real-world context for analysis. Unlike simulated data, datasets are based on real scenarios and are therefore more representative of the issues facing businesses. This means you can learn how to apply analysis techniques to real-world problems. This way you can develop solutions that are relevant and applicable.

Another benefit of learning data science with datasets is that it helps you develop a data-driven mindset. Datasets provide an opportunity to understand the nuances of data, such as data quality issues and data bias.

These are important considerations because they can have a significant impact on the insights generated from data. By working with datasets, you learn to identify these issues, which is critical for decision-making.

In addition, datasets provide a basis for collaboration and knowledge sharing. In the world of data science, people often work together on the same problems and data sets provide a common basis for collaboration. They also allow you to share your insights with others, which can lead to new discoveries and breakthroughs.

Competitions & Datasets

Below is an overview of the competitions and datasets that I found:

Conclusion

In short, learning data science with datasets is essential for anyone who wants to develop the necessary skills to turn raw data into actionable insights. Datasets provide a real-world context for analysis, help you develop a data-driven mindset, and promote collaboration and knowledge sharing. As data continues to play a critical role in business operations, learning data science with datasets is no longer an option but a necessity. By investing in data science education, companies can equip their workforce with the skills needed to make data-driven decisions and drive growth and success.

About me

I’m Henk van der Duim, Data Engineer at Stichting amsterdam&partners.

Follow me on Medium for regular updates on Data, AI and other tech topics:

I am open to connecting all data enthusiasts across the globe on LinkedIn:

https://www.linkedin.com/in/henkvanderduim/

--

--

Henk van der Duim

Data Engineer @ Stichting amsterdam&partners, author of 'Twitter en Personal Branding', sharing stories about data, AI and tech.