Conducting data science projects

Symbolic picture: conducting data science projects @ Gorodenkoff -
Symbolic picture: conducting data science projects

Published on: 05.07.2022
Author: Alex Lavrynets

Each data science process (e.g. innovation project) can be divided into a series of distinct steps, carried out by professionals from different backgrounds. A particularly careful approach to each stage of this iterative process within a team ensures that the model obtained at the end of each process is more effective.

Data science advocates a bottom-up methodological procedure. This scientific approach aims to explain a general phenomenon based on the observable data collected.

The data science process is an iterative set of steps in the completion of a project or analysis. Each data science project is unique, as are the data needed to initiate and run it. The most important steps are described below:

Problem formulation

The formulation of the problem is an essential step in initiating a data science process. It is a matter of understanding and appreciating the issue at hand. The project initiator must therefore have a cross-sectional business understanding to formulate the problem in a targeted way. A project steering document (e.g. the Data Science Project Canvas) enables an inventory of expectations, needs, resources and risks related to any data science project to be made in this initial phase.

Data collection

Data are essential to any data science process. The quantity and quality of this data has an impact on the results obtained at the end of each project. The massive amount of data needed for the project, also known as big data, can be collected from various sources. This data can be structured or unstructured and their size and format may also vary. They can be, for example, webpages, images, text, geodata, medical data or data from various connected or isolated sensors. Data sets are then built up from the collected data. The processing of sensitive data is subject to the Confederation’s legal framework for data protection.

Data selection

All raw data collected must be explored and carefully selected to ensure their quality for the rest of the process. Outliers, which may be non-compliant due to human error or a faulty sensor, can be filtered out according to the selected methodology.

Data preparation

After careful selection, data should be prepared following a defined structure. This step makes them accessible and readable for an algorithm. Big data are usually prepared following the FAIR principles: Findable, Accessible, Interoperable, Reusable.

Data analysis

The analysis of the data makes it possible to identify among the variety of data presented those that are significant for the issue, and to establish relationships between them. Quantitative or qualitative causal analysis methods can be supported by statistical and mathematical analysis tools, but also by more modern tools from machine learning and artificial intelligence. This approach is considered a data scientist’s core business.

Data evaluation and interpretation

The results of the data analysis usually take the form of new linked and aggregated data. These can be evaluated to ensure that the chosen model works and meets the need determined at the initial stage. Their interpretation also allows new insights to be gained and the process to be adapted to needs that were possibly previously unknown.

Provision of the findings

The results are finally made available and enable the work carried out to be traced. These results can be used as a starting point for a new data science research project or to improve the existing process. A new data science process can then begin.

In carrying out its projects, the Data Science Competence Centre (DSCC) limits itself to minimum viable product (MVP) solutions, usually in the form of code in the R or Python programming languages. The implementation of this code in production, requiring access to and understanding of existing IT systems, is not part of its area of expertise.

Figure legend:

Data science is a rigorous and documented process of data-driven problem-solving and continuous improvement.

Last update 05.07.2022

Top of page


Federal Statistical Office
Data Science Competence Center DSCC

Espace de l'Europe 10
CH-2010 Neuchâtel


Comments on the blog

For feedback on our blog, please use the form below.
Thank you very much!

Blog Formular DSCC