A Lightweight System for Flexible VO-Compatible Data Access

Modification Request #9 (C1-C4 2009)

1. Introduction

This MR describes our proposed approach for phasing in a lightweight system for accessing data via the web (data delivery), to be shared by multiple end to end systems across NRAO. The purpose of this MR is to describe how the data access system is structured, how we can move from the current system to the new structure, and identifies potential interfaces with other systems and web pages (e.g. NRAO Papers, proposals.nrao.edu, VO DataScope, project/observation home pages especially for Large Projects, Data Vault).

2. Background

This MR was motivated by some issues with the flexibility of the Data Vault at Data Vault at http://archive.cv.nrao.edu, as well as data integrity of the FITS headers in the archive. Specifically: links from the NRAO Papers system (at https://safe.nrao.edu/php/library/search.shtml) system to the Data Vault frequently 1) take up to a minute, and 2) turn up no search results, because the proposal code known by NRAOPapers as a key can't be matched to project codes in the original FITS files. This is a shortcoming of the data integrity of the FITS files; the Data Vault search interface has very tight integrity requirements for matching search terms to data records. Many FITS files do not meet the current integrity requirements, and unfortunately, this can't be changed for all the data already in the archive. So we look for a solution which 1) does not require any changes to the datasets in the archive, 2) does not loosen important integrity constaints, 3) returns results quickly, and 4) returns actual results where data is known to be in the archive.

After a long time working on the archiving problem, we realized that the one common thread unifying all of our stakeholders was data access. Everyone wants to be able to get quick access to the raw data, either for an entire project or for a part of a project. The most important aspect of end to end data access, then, would be to quickly and easily be able to download raw data based on a key (such as the proposal/project code). To accomplish this, we would have to place data access (not search, nor data mining) at the center of the system. This presents a distinct contrast with the current Data Vault system, which places metadata querying and data mining at the center of the system.

3. Requirements

What are the implementation-independent requirements for this MR?

  • The Core: Given a project ID, provide a REST-based interface to return the corresponding data set(s).
  • The Data Cover Sheet:
    • Phase 1: Display the project ID with Prepare/Download links, and a choice of archive compression options to return. Do not allow download of proprietary data.
    • Phase 2: Add a list of URLs that correspond to other services or data related to that dataset.
    • Phase 3: Add the project title, abstract and PI/Co-I information. (At this point, the Data Cover Sheets could potentially become the displayed unit of information for proposals.nrao.edu.)
  • At all phases, provide at least some mechanism for mapping project IDs to at least observers.
  • How the external services use the core and data cover sheet:
    • Exposing Data Cover Sheets to VO DataScope
      • We want users to be able to get this information from DataScope results - how to do it?
      • Direct downloader is prerequisite.
    • Things that Large Projects Need (Tony Remijan knows this well for PRIMOS)
    • How does NRAO Papers - Data Vault relationship change?
  • Proprietary Data Download: integrate with the observatory-wide authenticate system once ready.

4. Design

The primary design concern of Data Vault Lite is to get back to the basics. While Data Vault works primarily from a MySQL database, Data Vault Lite aims to revolve around data on disk. A prerequisite of this: an organized archive machine, where data is easily and systematically found by project ID. The very first cut of Data Vault Lite will work with the archive storage machine at Charlottesville Edgemont Rd to find a data set on the machine by formula (not database pointer) given a project ID, and understanding the directory tree of the archive.

Once data is indexed through such an interface, the downloader will be staged to collect all of a project into a single archive file. With this functionality, the goal will be to increase interaction with a given project's data set. For starters, Data Vault Lite will provide a simple API to give an XML result informing the caller if data is available for a particular project ID, and where users can download that data.

With a simple API, the design of Data Vault Lite will shift from getting data to getting data services. Users will have access to a data set given a project ID, and over time, users will have more and more linked services to the project ID cover sheet. The focus will be to get various systems talking to each other through simple API (and API-like) interactions.

(We have a photo of a whiteboard drawing. We need to digitize it and place it here.)

If a service can key off project ID, it should be included in the data cover sheet.

5. Supporting Even More User Interaction

Once we have a landing page for each GBT dataset by project ID, and we link in all available information we can find systematically (through Phase 3), we can look for more opportunities to increase user interaction tools with the data sets. Possibilities include:

  • embed a FITS viewer fv clone in the web browser to view data sets without download
  • support VO cone and Simple Image Access data of all data in the archive

6. Deployment

This will be done in at least three phases:

  • Phase 1: Generate the "Data Cover Sheet" with barebones information, plus a working download URL.
  • Phase 2: Add "Additional Resources" information to Data Cover Sheet, link to any services which can key off project ID and known information.
  • Phase 3: Add the proposal abstract and PI/Co-I names to the Data Cover Sheet

7. Test Plan

  • Direct Downloader. Can we get data from the URL in one step, given only the project ID?


APPROVED: To the best of my knowledge, the request in this MR is complete. I have thought through this request, and believe it to be an important feature to implement or bug to fix. ACCEPTED: I acknowledge that I have validated the completed code according to the acceptance tests.
Written symbol - name - date
Double-Checked symbol - name - date
Approved by Sponsor symbol - name - date
Accepted/Delivered by Sponsor symbol - name - date

  • Use %<nop>X% if MR is not complete (will display ALERT!)
  • Use %<nop>Y% if MR is complete (will display DONE)

-- NicoleRadziwill - 2009-01-15

This topic: Software > DataVaultLite
Topic revision: 2009-01-23, RonDuPlain
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback