Service Based Data Stacks for querying and analysing data at scale: from specification to deployment
Date: May 30, 2023 - 10:30-12:30
Room: 504, Via Carlo Valvassori Peroni 21
Presenter: Genoveva Vargas-Solar, CNRS, LIRIS, France
Point of contact: Claudio Ardagna
Data science (DS) pipelines are specified and executed for processing data using different Mathematical tools, including numerical, statistical, and probabilistic methods and artificial intelligence models to answer research questions stated to answer transdisciplinary questions. Beyond the challenge of designing the DS pipelines and their associated complexity, to execute (i.e., enact) the pipelines, it is necessary to configure execution environments and have dispatching strategies for assigning computing, memory, and storage resources.
Depending on the project's specific needs, various frameworks are available for enacting data science pipelines. Some popular frameworks include Apache Airflow, Prefect, Kubeflow, and MLFlow. The enactment of data science pipelines must balance the delivery of different types of services, such as:
1. Hardware (computing, storage, and memory)
2. Communication (bandwidth and reliability) and scheduling
3. Greedy analytics and mining with high in-memory and computing cycles requirements
DS pipeline's execution environments often run on service-oriented platforms, for example, the cloud, clusters and even just-in-time architectures. Creating a harmonious context to ensure DS pipelines' execution and other life cycle phases is still a challenge and often a "manual" task.
This talk will browse existing solutions that fulfil wholly or partially specific challenges. We will also explain the challenges and the implications of the preparation and configuration of those environments. We will show some examples of tools and compare their principles, their scope and "limitations'.