SmartDataLake

SmartDataLake: A framework for making data lakes smart

Data lakes are raw data ecosystems, where large amounts of diverse data are retained and coexist. They facilitate self-service analytics for flexible, fast, ad hoc decision making. SmartDataLake enables extreme-scale analytics over sustainable big data lakes.

SmartDataLake provides an adaptive, scalable and elastic data lake management system that offers: (a) data virtualization for abstracting and optimizing access and queries over heterogeneous data, (b) data synopses for approximate query answering and analytics to enable interactive response times, and (c) automated placement of data in different storage tiers based on data characteristics and access patterns to reduce costs.

The data lake’s contents are modelled and organised as a heterogeneous information network, containing multiple types of entities and relations. Efficient and scalable algorithms are provided for: (a) similarity search and exploration for discovering relevant information, (b) entity resolution and ranking for identifying and selecting important and representative entities across sources, (c) link prediction and clustering for unveiling hidden associations and patterns among entities, and (d) change detection and incremental update of analysis results to enable faster analysis of new data.

Interactive and scalable visual analytics are provided to include and empower the data scientist in the knowledge extraction loop. This includes functionalities for: (a) visually exploring and tuning the space of features, models and parameters, and (b) enabling large-scale visualizations of spatial, temporal and network data. The results of the project are evaluated in real-world use cases from the business intelligence domain, including scenarios for portfolio recommendation, production planning and pricing, and investment decision making.

SDL-Virt

SDL-Virt: This layer is responsible for delivering efficient data access over a data lake containing large volumes of heterogeneous data. It encapsulates and abstracts all issues related to the efficient placement, distribution, management, and retrieval of data, providing upper layers homogeneous access to data through an SQL-like query language. Specifically, the main offered functionalities include: (i) data virtualization over different data types and formats, (ii) automated placement of data over different storage tiers to optimize the trade-off between storage cost and speed of retrieval, (iii) approximate query processing that can speed up data analysis by enabling approximate answers with theoretical guarantees for accuracy.

SDL-HIN

SDL-HIN: This layer is dedicated to searching and analysing the contents of a data lake, being represented in the form of a Heterogeneous Information Network (HIN), i.e., a graph consisting of entities and relations of different types. The offered functionalities include: (i) discovering similar or near-duplicate entities under different similarity criteria and matching conditions, (ii) ranking of entities based on the structure of the graph, as well as on other domain-specific criteria such as properties of geospatial regions, (iii) predicting or suggesting links between entities based on their attributes and position in the network, (iv) detecting communities of entities, including potentially overlapping or hierarchical communities, and (iv) detecting changes in evolving communities and in data represented as time series.

SDL-Vis

SDL-Vis: This layer includes the human in the data analysis loop through offering visual analytics capabilities building on top of the functionalities implemented in the two lower layers. Using a visual analytics model at its core to determine user interactions, it offers visualizations for supporting data profiling and parameter tuning. It also offers custom visualizations for specific types of data, including graph, spatial, and temporal. Further, individual components typically also include some sort of UI which enables the user to interact with the component directly. This is primarily to support the idea of keeping the component of being capable of both working in unity, but also individually, thus reducing the interdependencies between the components.

Links

https://smartdatalake.eu/

https://github.com/smartdatalake

Keywords

Big data, virtualization, data visualization, heterogenous graphs, meta-path, scalability

BOND: Outcomes in Advancing Education, Tolerance and Heritage Preservation to combat Antisemitism

January 1, 2025

OpenMusE: Live Music Census tunes into health of European music scene

FU-TOURISM: Acceleration Programme – 20.000 Euros for Innovative SMEs

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.

Necessary

Always Enabled

Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.