Name: Distributed Data Provenance for Large-Scale Data-Intensive Computing
Start: 2013-09-25T13:30:00-0400
End: 2013-09-25T13:55:00-0400

The attendees list includes all authors (even thought they may not be attending), speakers, artists, etc.

View the full conference website here: IEEE Cluster 2013 Conference

Back To Schedule

Distributed Data Provenance for Large-Scale Data-Intensive Computing

It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality---whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied similar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and provides distributed module for querying provenance. Our results on a 32-node cluster show that FusionFS+SPADE is a promising prototype with negligible provenance overhead and has promise to scale to petascale and beyond. Furthermore, FusionFS with its own storage layer for provenance capture is able to scale up to 1K nodes on BlueGene/P supercomputer.

Speakers

IEEE Cluster 13 Conference

Dongfang Zhao

Tanu Malik

Ioan Raicu

Chen Shou

Dongfang Zhao

Attendees (0)

IEEE Cluster 13 Conference

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Dongfang Zhao

Tanu Malik

Ioan Raicu

Chen Shou

Dongfang Zhao

Attendees (0)