

SmartDataLake: Validation of the system and individual components through long term piloting
Throughout the SmartDataLake project a strong focus has been on the combination of driving research results forward, while at the same time collecting industry-oriented feedback. In order to expand on this feedback and to proactively validate the solution, the project includes a ten months long piloting phase, during which all the individual components are applied to use-cases provided by the project’s industry partner.
Overview
In short, the pilot consists of the various components of the SmartDataLake project that are used by the three pilot partners; SpazioDati, SPRING TECHNO and SYNYO, to solve their data related challenges. In line with the rest of the project structure, the pilot focuses on the three main functionalities of the SmartDataLake project, namely; data management, data exploration and analysis and interactive visual analytics.
Overall, the main use-cases have been defined and detailed by specifying the challenges and the relevant key performance indicators as well as been mapped to the relevant SmartDataLake component that may be used to achieve these use-cases. The intermediate use-cases are as follows:
- Assembling entity profiles – As even structured and properly annotated data may be distributed across different systems and instances, this functionality allows to operate on distributed data as if it was provided as a single instance of a database.
- Matching company profiles across sources – When working on uncorrelated datasets, the challenge is to accurately combine these datasets into a single one, by matching the individual entities.
- Computing descriptive analytics – helpful when working on a newly acquired data to gain insights into the data quality.
- Finding similar entity profiles – Contrary to identifying exact matches, similar entities are helpful when identifying alternatives or just expanding the selection based on certain known parameters.
- Predicting potential links – Similar to finding similar entities, but applies the relation of entities rather than the attributes of a given entity as the decisive metric.
- Storage tiering for historical data – While not all data is always relevant, this use-case supports the autonomous decision-making for when what segment of data may be stored in cold storage.
- Detection of similar and correlated time series – This use-case identifies similar trends in time series data.
- Detection of seasonal patterns – Building on top of the previous use-case, the focus is to be able to add additional meta-information such as seasonality, when performing the analysis.
- Detection of changes – As a lot of useful data is constantly evolving, monitoring the changes over time provides insights into which entities are being impacted.
- Community detection and ranking of different types of entities – While investigating large data set, a helpful functionality is to be able to identify the most prominent entities, and the dominating communities. This is particular interesting when combined with change detection, in order to monitor how these evolve.
- Visual analytics accompanying all the above use-cases to provide visual feedback during the analysis.
Throughout the pilots the project partners will be collaborating on executing and evaluating each of the above use-case and the components that support their implementation. The final outcomes, insights and results will be summarized and published in a project report by the end of the project, specifically in the beginning of January 2022, together with the revised repositories, where applicable, on GitHub, containing the developed components.
Links
https://github.com/smartdatalake
Keywords
Big data, data lake, smart, piloting, feedback