Gathering Our Thoughts on Large Data Sets

Scary Numbers

80 GB/s was a data rate quoted for the 100-feed W-band array. That rate would fill the 2.4 PB disk array of the Kraken supercomputer at Oak Ridge in 30,000 seconds. That's a little over 8 hours. (RC; Thanks to Mike for pointing this out.)

Current bandwidth in/out of GB is 45 Mbit/s. As for the future, according to Gene Runion:

Our hope was/is to use the earmarked WVa stimulus money to bring high speed connectivity to GB and community. If this were to happen and for it to be useful WVU will need to upgrade their link to the outside world as their current link is saturated.

It would take 17 minutes to move 1 Tbyte of data at 10Gbps. Other people are facing similar problems , e.g. moving very large data sets. They are "solving" the problem by moving the compute/visualization machines to the data.

NCAR claims it took them 1 year to move their archive over to their new mass storage system. They have about 6 Petabytes worth of data in it, but room for expansion. They still use some form of tape for their archive (but tapes are automatically mounted) because it is cost effective.

If the WVa stimulus initiative doesn't pan out we could fairly easily provision multiple 45Mb links or a 155Mb link. Certainly even higher speed connections are possible but the build out cost and MRC are very high.

LHC Computing Grid

What is the Scope?

Our political/technical approaches must change or we risk solving irrelevant problems (Microsoft SC09 presentation)
- Scary numbers are only scary if we don't have money.
- Since our budgets are finite, policies like data volume budgets need to be established as part of the observing proposal process.
- Computational resources need to be proposed in tandem with observing proposals.
Is NRAO considering providing the computational facilities for data reduction, or (more likely) just working to develop software users will use to reduce data? (N.B: HPC software often requires 'tuning to the architecture', so it would be to our advantage to develop a relationship with a 'target' processing center.)
- Although running a "supercomputing facility" may not be the traditional role of NRAO, we should at least evaluate the NCAR model. They are the supercomputing facility for the atmospheric community and are funded through NSF (150 million year). They service over 60 universities through their backbone and supercomputers. They also have 1 system that is part of the teragrid. IF supercomputing becomes a necessity for astronomy then NRAO as "the" national facility should host it. Of course we need to research how this can be funded. This must be a separate enterprise from running the telescopes.

Archiving

How much to save? Where?
The Data Intensive Cyber Environments Center (DICE Center) at the University of North Carolina at Chapel Hill has a open source project called the Integrated Rule-Oriented Data System (iRODS).

Quick Look/Visualization

IEEE Visualization Conference Would be a good place to start.
Journals like Information Visualization
Parallel Visualization packages like Visit from LLNL and Paraview
People like Paul Navratil (Texas Advanced Computing Center), Sean Ahern (Remote Data Analysis and Visualization Center)
SPIE conference proceedings from Visualization and Data Analysis
IEEE Computer Graphics and Applications proceedings
AIM Project and the University of Florida's HCS Research Laboratory
Virginia tech has a center and several human interaction studies. This seems like a good place to consider a university collaboration
During Paul Navratil (sp?) visualization talk, he claimed the 95 % of the data is thrown away and that they knew they were losing the "unknown unknowns" (e.g. the double pulsar problem). He did not elaborate on this but I think we need to contact him and open a dialogue concerning current research in this area.
There are several programs out there that support Netcdf, HDF5 and others but not FITS. Both Netcdf and HDF5 are build on MPI-IO, we need to seriously consider our format.

Pipelines

Algorithms

Explicit message passing models like MPI has been 'the' parallel environment. However, there is much interest in the Partitioned Global Address Space (PGAS) model. One implementation of PGAS is Unified Parallel C (UPC) and its analog Co-array Fortran.
For the most part, scientists are writing the parallel processing code. One must be a domain expert, understanding the data and the required processing in order to partition the data that is crucial to parallel processing. Who are the scientists that we need to be involved? Have Scott and Paul already done ground breaking work with GUPPI? Bill Cotton with Mustang? What has E-E already explored in this area? Has Nicole, Dana, Gareth and ? already explored some of these problems?
We have a few projects where we could cut our teeth on parallel programming
- kfpa pipeline
- OOF
- mustang
- ?
Something to consider is getting a TeraGrid allocation for Algorithmic RnD. This might help with finding the right type of machine, getting experience with HPC, getting concrete numbers on data processing needs, etc. Getting TG allocation would probably make it easier to move around within the HPC community and meet people. The current TG is really just a group of centers offering CPU cycles for allocation through the TG Resource Allocation Committee (TRAC). Its not really a "grid" in terms of global job submission. At some centers, if the machine you're allocated on is not a good fit they may be able to move you to a machine that is better for your applications even if that machine is not on TG. This way NRAO could get started with Algorithmic RnD without a machine at NRAO. (MTM)

Post Processing

Ties into SKA and Becoming a Pathfinder Project

Similar Efforts

Allen Telescope Array
LOFAR
Pan-Starrs (Tim)
LSST
CERN (LHC) (Non-astronomy, but dealing with massive datasets). The LHC Computing Grid is very nicely summarized here.

Possible Collaborators

The RSVP program at NCAR: "A staff member at another national computational facility visits CISL colleagues to share expertise on specific aspects of provisioning HPC to a scientific community."

Grant Opportunities

Fall Workshop

LargeDataWorkshop

This topic: Software > PlanOfRecordQ42010 > LargeData
Topic revision: 2010-08-16, AmyShelton

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback