Coverage-based rewriting for data preparation

Accinelli, C.; Minisi, S.; Catania, B.

The development of technological solutions satisfying non discriminating requirements is currently one of the main challenges for data processing. Concepts like fairness, i.e., lack of bias, and diversity, i.e., the degree to which different kinds of objects are represented in a dataset, have been recently taken into account in designing non-discriminating set selection, ranking, and OLAP approaches. Information extraction is however also at the basis of back-end data processing, for preparing, e.g., extracting and transforming data, usually based on SQL queries, before loading them inside a data warehouse for further front-end processing. The impact of an unfair data preparation process might have a relevant impact on front-end analysis. As an example, an underrepresented category in the warehouse might lead to an underrepresentation of that category in most of the following processes. This kind of guarantee is known as coverage. In this paper, we start from this consideration and we propose an approach for automatically rewriting back-end queries, whose results do not guarantee some coverage constraints, into the "closest" queries satisfying those constraints. Through rewriting, coverage-based modifications of data preparation steps are traced for further processing. We also present some preliminary experimental results and we identify some directions for future works.

Coverage-based rewriting for data preparation

Accinelli C.;Minisi S.;Catania B.

2020-01-01

Abstract

The development of technological solutions satisfying non discriminating requirements is currently one of the main challenges for data processing. Concepts like fairness, i.e., lack of bias, and diversity, i.e., the degree to which different kinds of objects are represented in a dataset, have been recently taken into account in designing non-discriminating set selection, ranking, and OLAP approaches. Information extraction is however also at the basis of back-end data processing, for preparing, e.g., extracting and transforming data, usually based on SQL queries, before loading them inside a data warehouse for further front-end processing. The impact of an unfair data preparation process might have a relevant impact on front-end analysis. As an example, an underrepresented category in the warehouse might lead to an underrepresentation of that category in most of the following processes. This kind of guarantee is known as coverage. In this paper, we start from this consideration and we propose an approach for automatically rewriting back-end queries, whose results do not guarantee some coverage constraints, into the "closest" queries satisfying those constraints. Through rewriting, coverage-based modifications of data preparation steps are traced for further processing. We also present some preliminary experimental results and we identify some directions for future works.