Several research papers accepted for BTW 2025
We are very pleased that several papers have been accepted for publication at the Conference on Database Systems for Business, Technology and Web (BTW) 2025. The 21st BTW will take place from March 3 to March 7, 2025 at the University of Bamberg.
The following papers have been accepted:
“Practical Problems in Customer Data – A Use-Case-Driven Classification” (Jan-Lucas Deinhard, Richard Lenz)
This article presents a comprehensive analysis of data quality issues encountered in customer data at large enterprises. This analysis is based on data collected at a large medical technology manufacturer, and the problems observed there are clustered into distinct classes. Through this classification, nine key prevention requirements can be identified which are essential for improving data fitness. These include changes to data governance and to data architecture, among others. An evaluation of existing tools against these requirements furthermore highlights notable solutions. Despite the availability of numerous tools, gaps remain, especially regarding integration of all functionalities. Our findings suggest that while industry-standard solutions are accessible, integrating them into a cohesive framework posed significant challenges in our use case, necessitating continual adjustments to data architecture and processes to enable and maintain high quality of data.
“Relationship Discovery for Heterogeneous Time Series Integration: A Comparative Analysis for Industrial and Building Data” (Lucas Weber, Richard Lenz)
Cyber-physical systems like buildings and power plants are monitored with ever-increasing numbers of sensors, gathering massive and heterogeneous time-series datasets collected in data lakes. Appropriate meta-data, describing both the function and location of each sensor, is essential for any profitable use of the data but is often not available or incomplete. Particularly, information about related sensors, meaning sensors belonging to the same functional subsystem, might be hard to derive if appropriate meta-data is unavailable. While various approaches exist for automatic meta-data extraction from relational databases, the unique characteristics of heterogeneous time-series data necessitate specialized algorithms. Among the general algorithms developed for time-series meta-data inference, only a few are concerned with relationship discovery despite the critical importance of this information in many meta-data formats. Nevertheless, other domains offer a variety of measures for pairwise relationship discovery in homogeneous time-series collections.
This paper consolidates these measures and evaluates their performance for identifying related but heterogeneous time series from the same functional subsystem within industrial facilities. We evaluate the methods on a collection of different datasets to extract promising relationship measures from the literature and show that there are other better-performing candidates than the common Pearson Correlation Coefficient.
“Graph-based QSS: A Graph-based Approach to Quantifying Semantic Similarity for Automated Linear SQL Grading” (Leo Köberlein, Dominik Probst, Richard Lenz)
Determining the Quantified Semantic Similarity (QSS) between database queries is a critical challenge with broad applications, from query log analysis to automated SQL skill assessment. Traditional methods often rely solely on syntactic comparisons or are limited to checking for semantic equivalence.
This paper introduces Graph-based QSS, a novel graph-based approach to measure the semantic dissimilarity between SQL queries. Queries are represented as nodes in an implicit graph, while the transitions between nodes are called edits, which are weighted by semantic dissimilarity. We employ shortest path algorithms to identify the lowest-cost edit sequence between two given queries, thereby defining a quantifiable measure of semantic distance. An empirical study of our prototype suggests that our method provides more accurate and comprehensible grading compared to existing techniques. Moreover, the results indicate that our approach comes close to the quality of manual grading, making it a robust tool for diverse database query comparison tasks.
Furthermore the results of the ReProVide project, which was funded as part of the priority program SPP 2037 “Scalable Data Management for Future Hardware”, will be presented at the demo reception:
“ReProVide: Query Optimisation and Near-Data Processing on Reconfigurable SoCs for Big Data Analysis” (Tobias Hahn, Maximilian Langohr, Stefan Meißner, Benedikt Döring, Stefan Wildermann, Klaus Meyer-Wegener, Jürgen Teich)
The goal of ReProVide is to provide novel hardware and optimisation techniques for scalable, high-performance processing of Big Data. The Programmable System-on-Chip (PSoC) architecture of ReProVide includes a reconfigurable FPGA for the support of hardware accelerators for various operators on relational and streaming data. Such PSoCs can be used to process data directly at the source, such as data from attached NVMes, using application-specific accelerators. For example, compute-intensive tasks such as JSON parsing can be offloaded to the hardware accelerators, reducing CPU load. In addition, reducing the volume of data at an early stage avoids unnecessary data movements, resulting in lower energy consumption. This demo illustrates the opportunities and benefits of hardware-reconfigurable, FPGA-based PSoCs for near-data processing. The demo allows users to run two queries and select which operations should be pushed onto the SoC for near-data hardware acceleration. From no acceleration to maximum acceleration, a 52x improvement in throughput and 67x lower energy consumption can be observed.