
SmartDataLake: Project on sustainable data lakes for extreme-scale analytics
In the era of Big Data, data-driven decision-making processes are becoming increasingly data-intensive. The data lake approach refers to assembling large amounts of diverse data from a multitude of data sources, allowing the data collected to retain their original model and format, and allowing users to query and analyse them in situ. Thus, it promises to enable ad hoc, self-service analytics and to reduce the required time from data to insights and pushing premature design choices to stages where they belong.
Introduction and overview
SmartDataLake targets at designing, developing and evaluating novel approaches, techniques and tools for extreme-scale analytics over Big Data Lakes. It tackles the challenges of reducing costs and extracting value from Big Data Lakes by providing solutions for virtualized and adaptive data access, automated and adaptive data storage tiering, smart data discovery, exploration and mining, monitoring and assessing the impact of changes as well as empowering the data scientist in the loop through scalable and interactive data visualizations.
The results of the SmartDataLake project will be evaluated in real-world use cases from the Business Intelligence domain, including scenarios for portfolio recommendation, production planning and pricing, and investment decision making, thus concluding the relevance and applicability of the developed solution both within and beyond the scientific community.
Project objectives
The SmartDataLake project tackles the data lake object through three primary layers, each addressed by expert partners within their respective fields;
- Adaptive data virtualization and storage tiering improving the efficiency of working and accessing large datasets.
- Heterogenous information network mining to support the user in rapidly gaining insights into the data, though entity ranking, evaluation of similarity between entities and identifying communities.
- Visual analytic over spatial, temporal and graph data in order to provide intuitive representation of the outcomes and results of the above data processing algorithms, and to ensure that the user is kept in the loop.
These high-level domains are further broken down into five specific challenges, namely;
- Challenge 1: Handling data heterogeneity – How to operate on data from different sources and in different formats.
- Challenge 2: Reducing storage costs – How to autonomously optimize how and where infrequently used data is stored.
- Challenge 3: Making sense of the data – How to combine datasets and how to learn from them.
- Challenge 4: Monitoring changes – How to identify how insights change over time.
- Challenge 5: Support the human in the loop – How to intuitively keep the user in the loop.
Project structure
The SmartDataLake consortium consists of a total of eight partners from five European countries; Switzerland, The Netherlands, German, Italy and Austria. Furthermore, the consortium is a combination of five research organisations, focusing on the research activities; Athena RC, who is also the coordinator, EPFL, Technical University Eindhoven and University of Konstance, as well as three industry partners, SpazioDati, Spring Techno and SYNYO, who are providing the requirements derived from their day to day data analytics and data processing challenges, and, at a later stage of the project, by providing feedback on the various components developed during the project based on hands on experience, when applied to solved the previously mentioned challenges.
The project, which is funded by the European Commission through the Horizon 2020 Framework Programme, will be running for 36 months, starting from January 1st, 2019, until December, 2021. In order to achieve its goals, the project is divided into three main phases; a requirements elicitation and elaboration phase, a research and development phase, and finally a long-term piloting phase during which the developed components are evaluated and improved based on iterative feedback from, primarily, the three industry partners.
From a project structure point of view, the project is broken down accordingly top the main challenges. The vertical requirements engineering is focused in WP1 – Requirements, Architecture and Integration, the key functional layers are each developed in WP2 – Adaptive Data Virtualization and Storage Tiering, WP3 – Heterogeneous Information Network Mining, and WP4 – Scalable and Interactive Visual Analytics, respectively. WP5 – Pilot Testing and Evaluation is responsible for the planning, execution and evaluation piloting actives, while WP6 – Communication, Dissemination and Exploitation and WP7 – Management and Coordination are responsible for the project dissemination and project administration, respectively.
Links
SmartDataLake project website: https://smartdatalake.eu/
Keywords
Big data, data lake, entity resolution, graph analysis, scalability, data visualization