Horizon 2020: IOSTACK Project

horizon2020

 

IOSTACK is an European research project funded by the H2020 initiative. The project is a consortium of several European industrial and research partners including IBM, MPSTOR, Eurocom, BARCELONA SUPERCOMPUTING CENTER and the University of Rovira. A number of the partners will act as users of the system including IDIADA (Automotive), GRIDPOCKET (IOT) and ARCTUR (HPC). 

 


IOSTACK is designed for deployments of data analytics in virtual environments. Virtual environments allow very flexible deployment of analytics frameworks but have less performance than bare metal deployments. IOSTACK will focus on how Software Defined storage can use its knowledge of the cloud topology and the real time dynamic characteristics of the cloud to deploy analytics jobs that will run and complete within guaranteed SLAs and timescales.
  

The SDS knowledge of the static topology allows compute and storage locality to be optimised, understanding the dynamic load of the cloud allows a further optimisation of which resources, paths and devices should be used in a given workload.

The main objective is to create IOStack: a Software Defined Storage toolkit for Big Data on top of the OpenStack platform. IOStack will enable efficient execution of virtualized analytics applications over virtualized storage resources thanks to flexible, automated, and low cost data management models based on software defined storage (SDS). Major challenges are: 

1) Storage and compute disaggregation and virtualization.  Virtualizing data analytics to reduce costs implies disaggregation of existing hardware resources.  This requires the creation a virtual model for compute, storage and networking that allows orchestration tools to manage resources in an efficient manner.  We will provide policy-based provisioning tools so that the provisioning of virtual components for the analytics platform is made according to the set of QoS policies.

2) SDS Services for Analytics.   The objective is to define, design, and build a stack of SDS data services enabling virtualized analytics services with improved performance and usability. Among these services we include native object store analytics that will allow running analytics close to the data without taxing initial migration, data reduction services, specialized persistent caching mechanisms, advanced prefetching, and data placement.

3) Orchestration and deployment of big data analytics services.  The objective is to design and build efficient deployment strategies for virtualized analytic-as-a-service instances (both ephemeral and permanent). In particular, the focus of this work is on data-intensive systems such as Apache Hadoop and Apache Spark, which enable users to define both batch and latency-sensitive analytics. This objective includes the design of scalable algorithms that strive at optimizing a service-wide objective function (e.g., optimize performance, minimize cost) under different workloads. 

The Cloud can be the answer to reduce the costs in the whole life cycle of big data management, ranging from storage to analysis, including the transport and transformation of data to interoperable data formats. The unit cost of storage has been decreased by several orders of magnitude thanks to the Cloud, significantly lowering the IT investment costs for all businesses. However, when dealing with large data volumes, it is necessary to distribute data and workload over many servers, which requires efficient ways to support parallel processing. This has led big players to enter the scene and offer “analytics as a service” through services like Amazon EMR and Microsoft HDInsight, along with massive object stores like Amazon S3 and Azure Storage, to name a few.

These examples highlight that European companies and SMEs lag behind big players in taking up the Big Data challenges, especially compared to the US, and show the absence of Analytical Big Data services for SMEs within Europe. Currently, all of these services are run by US-based enterprises, preventing the benefits from staying in the EU, with European users, developers and entrepreneurs. Although those solutions have partly migrated into open source software products, these products do not yet offer a unified Big Data management infrastructure that covers the entire lifecycle. Con- cretely, open source projects like OpenStack Sahara, which aims to provide users with simple means to provision a Hadoop cluster, are still in a very early stage to realize the promise of delivering “ana- lytics as a service” at a reasonable cost.