Gathering Our Thoughts on Large Data Sets


Scary Numbers

  • 80 GB/s was a data rate quoted for the 100-feed W-band array. That rate would fill the 2.4 PB disk array of the Kraken supercomputer at Oak Ridge in 30,000 seconds. That's a little over 8 hours. (RC; Thanks to Mike for pointing this out.)

  • Current bandwidth in/out of GB is 45 Mbit/s. As for the future, according to Gene Runion:
    Our hope was/is to use the earmarked WVa stimulus money to bring high speed connectivity to GB and community. If this were to happen and for it to be useful WVU will need to upgrade their link to the outside world as their current link is saturated.
      • It would take 17 minutes to move 1 Tbyte of data at 10Gbps. Other people are facing similar problems , e.g. moving very large data sets. They are "solving" the problem by moving the compute/visualization machines to the data.
      • NCAR claims it took them 1 year to move their archive over to their new mass storage system. They have about 6 Petabytes worth of data in it, but room for expansion. They still use some form of tape for their archive (but tapes are automatically mounted) because it is cost effective.

    If the WVa stimulus initiative doesn't pan out we could fairly easily provision multiple 45Mb links or a 155Mb link. Certainly even higher speed connections are possible but the build out cost and MRC are very high.
    Compare these numbers to the links used by the LHC Computing Grid (RC; thanks Chris and Gene)

What is the Scope?

  • Our political/technical approaches must change … or we risk solving irrelevant problems (Microsoft SC09 presentation)
    • Scary numbers are only scary if we don't have money.
    • Since our budgets are finite, policies like data volume budgets need to be established as part of the observing proposal process.
    • Computational resources need to be proposed in tandem with observing proposals.
  • Is NRAO considering providing the computational facilities for data reduction, or (more likely) just working to develop software users will use to reduce data? (N.B: HPC software often requires 'tuning to the architecture', so it would be to our advantage to develop a relationship with a 'target' processing center.)
    • Although running a "supercomputing facility" may not be the traditional role of NRAO, we should at least evaluate the NCAR model. They are the supercomputing facility for the atmospheric community and are funded through NSF (150 million year). They service over 60 universities through their backbone and supercomputers. They also have 1 system that is part of the teragrid. IF supercomputing becomes a necessity for astronomy then NRAO as "the" national facility should host it. Of course we need to research how this can be funded. This must be a separate enterprise from running the telescopes.

Archiving

  • How much to save? Where?
  • The Data Intensive Cyber Environments Center (DICE Center) at the University of North Carolina at Chapel Hill has a open source project called the Integrated Rule-Oriented Data System (iRODS).

Quick Look/Visualization

Pipelines

Algorithms

  • Explicit message passing models like MPI has been 'the' parallel environment. However, there is much interest in the Partitioned Global Address Space (PGAS) model. One implementation of PGAS is Unified Parallel C (UPC) and its analog Co-array Fortran.
  • For the most part, scientists are writing the parallel processing code. One must be a domain expert, understanding the data and the required processing in order to partition the data that is crucial to parallel processing. Who are the scientists that we need to be involved? Have Scott and Paul already done ground breaking work with GUPPI? Bill Cotton with Mustang? What has E-E already explored in this area? Has Nicole, Dana, Gareth and ? already explored some of these problems?
  • We have a few projects where we could cut our teeth on parallel programming
    • kfpa pipeline
    • OOF
    • mustang
    • ?
  • Something to consider is getting a TeraGrid allocation for Algorithmic RnD. This might help with finding the right type of machine, getting experience with HPC, getting concrete numbers on data processing needs, etc. Getting TG allocation would probably make it easier to move around within the HPC community and meet people. The current TG is really just a group of centers offering CPU cycles for allocation through the TG Resource Allocation Committee (TRAC). Its not really a "grid" in terms of global job submission. At some centers, if the machine you're allocated on is not a good fit they may be able to move you to a machine that is better for your applications even if that machine is not on TG. This way NRAO could get started with Algorithmic RnD without a machine at NRAO. (MTM)

Post Processing

Ties into SKA and Becoming a Pathfinder Project

Similar Efforts

Possible Collaborators

Grant Opportunities

Fall Workshop

This topic: Software > PlanOfRecordQ42010 > LargeData
Topic revision: 2010-08-16, AmyShelton
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback