Druckansicht der Internetadresse:

Faculty of Mathematics, Physics & Computer Science

Chair for Databases and Information Systems – Prof. Dr.-Ing. Stefan Jablonski

Print page

Data Science

Selected ProjectsContact person
Analysis of massive stock data to predict stock pricesDr. Schönig
Analysing influences of external factors to marketing campaignsProf. Dr-Ing. Jablonski
Web-Shop recommender system based on neural networksDr. Schönig
Analysis of massive social media dataDr. Schönig

Data Science as the field that “makes the most out of Big Data” is one of the fastest growing fields in computer science. We focus on two major aspects of Data Science: the conceptual design of large data sets and business intelligence as technique for transforming raw data into meaningful information. We explore these techniques specifically from the perspective of process mining and process analysis. Most of our laboratories and student projects address these research areas.

Data Science is a term established for describing the new challenges with the generation and storage of massive data. Hence, it is correlated with the trends of Digitalization, Big Data or IoT-Technologies.

The need for the creation of a new term implies the existence of groundbreaking differences in the way of computing and handling the data as well as an evolution of the data it self. As mentioned, one key challenge lies in the amount of available data. 40 years ago, the input data was well defined and algorihtms were designed for certain kinds of input data. Nowadays, research must focus on Giga- and Petabytes of input data where tranaparancy can no longer be guaranteed and a clear understanding of the contents is more or less impossible. Data Science helps to explore these massive data sets whilst giving valuable insights. Therefore, as the characteristics of data changed, proposed algorithms and methods were also underwent a transformation towards the ability of handling these multivariate and high-dimensional data.

ETL-Processes (Extract, Transformation and Load) and StorageHide

Initially, the input of the analysis processing pipeline must be defined. This includes the detection and definition of data sources, as well as the extraction of the data followed by a desirable step of transforming the data into a uniform format. Finally, the data is loaded into storage systems where different forms can be distinguished: SQL or NoSQL-Databases, Data Warehouses, flat files or distributed file systems.

According to the IDC, a global player in market research and consulting, the generation of data will rise from 20 Zettabytes in 2017 up to 160 Zettabytes in 2025. Thus, Data Science must be able to handle a vast volume of data with respect to a consistent storage and replication issues. Different databases and algorithms exist for storing or recovering whereof the Data Scientist must choose a suitable one for a given application case.

Besides technical challenges, the data quality can represent another serious challenge. In this context, operators must be aware of missing data, incomplete data, or even corrupted data.

Data Analysis: Processing, Mining and Knowledge GenerationHide

Ironically the core of Data Science can rather be constituted as an art than as science. The main challenge is often the selection of suitable questions to ask based on the available data sets. To consult the metaphor in the introduction, everyone may hold the same vocabulary expansion as Goethe, but exceedingly few of us are capable of writing literary masterpieces.

Data Science aims at creating masterpieces out of data in terms of providing valuable insights into the data. These insights differ from dataset to dataset, from application case to application case and it is hard to define them generically. However, Data Science provides lots of different approaches and algorithms which helps to investigate a given data set. As stated, the art is to choose the right input data as well as the appropriate algorithm to generate profitable results. These algorithms can be obtained from the fields of data mining, machine learning or neuronal nets, mathematics and statistics. One can distinguish between Classification (Supervised Learning), Clustering (Unsupervised Learning), Association Rule Mining, Time Series Analysis.

Data Visualization and Result RepresentationHide

The best results are useless when they do not find the right way to a responsible decision-maker. Most of the time, results must be communicated within a group of supervisors which are not trained on working with raw mathematical formulas or statistics which we obtain from the analysis. They rather bare skills like working with several visual representations of data sets.

The challenges Data Scientists are faced with include selecting the most relevant aspects and build convincing data charts, easy to understand for executives as they do not want to waste time for working through excessive CSV-files for instance. Therefore, interactive representations with different levels of details are helpful as well as a skilful use of colours and various presentation forms like heat maps, tables or charts. Most of conventional charts are not able to assimilate that huge amount of data. Thus Data Science must also provide innovative representation forms that covers oversized data sets in an acceptable space-saving design.

Webmaster: Dr. Stefan Schönig

Facebook Twitter Youtube-Kanal Instagram Blog Contact