A Lightweight System for Flexible VO-Compatible Data Access
Modification Request #9 (C1-C4 2009)
1. Introduction
This MR describes our proposed approach for phasing in a lightweight system for accessing data via the web (data delivery), to be shared by multiple end to end systems across NRAO. The purpose of this MR is to describe how the data access system is structured, how we can move from the current system to the new structure, and identifies potential interfaces with other systems and web pages (e.g. NRAO Papers, proposals.nrao.edu, VO
DataScope, project/observation home pages especially for Large Projects, Data Vault).
2. Background
This MR was motivated by some issues with the flexibility of the Data Vault at
Data Vault at http://archive.cv.nrao.edu, as well as data integrity of the FITS headers in the archive. Specifically: links from the NRAO Papers system (at
https://safe.nrao.edu/php/library/search.shtml) system to the Data Vault frequently 1) take up to a minute, and 2) turn up no search results, because the proposal code known by NRAOPapers as a key can't be matched to project codes in the original FITS files. This is a shortcoming of the data integrity of the FITS files; the Data Vault search interface has very tight integrity requirements for matching search terms to data records. Many FITS files do not meet the current integrity requirements, and unfortunately, this can't be changed for all the data already in the archive. So we look for a solution which 1) does not require any changes to the datasets in the archive, 2) does not loosen important integrity constaints, 3) returns results quickly, and 4) returns actual results where data is known to be in the archive.
After a long time working on the archiving problem, we realized that the one common thread unifying all of our stakeholders was
data access. Everyone wants to be able to get quick access to the raw data, either for an entire project or for a part of a project. The most important aspect of end to end data access, then, would be to quickly and easily be able to download raw data based on a key (such as the proposal/project code). To accomplish this, we would have to place data
access (not search, nor data mining) at the center of the system. This presents a distinct contrast with the current Data Vault system, which places metadata querying and data
mining at the center of the system.
3. Requirements
What are the implementation-independent requirements for this MR?
- The Core: Given a project ID, provide a REST-based interface to return the corresponding data set(s).
- The Data Cover Sheet:
- Phase 1: Display the project ID with Prepare/Download links, and a choice of archive compression options to return. Do not allow download of proprietary data.
- Phase 2: Add a list of URLs that correspond to other services or data related to that dataset.
- Phase 3: Add the project title, abstract and PI/Co-I information. (At this point, the Data Cover Sheets could potentially become the displayed unit of information for proposals.nrao.edu.)
- At all phases, provide at least some mechanism for mapping project IDs to at least observers.
- How the external services use the core and data cover sheet:
- Exposing Data Cover Sheets to VO DataScope
- We want users to be able to get this information from DataScope results - how to do it?
- Direct downloader is prerequisite.
- Things that Large Projects Need (Tony Remijan knows this well for PRIMOS)
- How does NRAO Papers - Data Vault relationship change?
- Proprietary Data Download: integrate with the observatory-wide authenticate system once ready.
4. Design
The primary design concern of Data Vault Lite is to get back to the basics. While Data Vault works primarily from a MySQL database, Data Vault Lite aims to revolve around data on disk. A prerequisite of this: an organized archive machine, where data is easily and systematically found by project ID. The very first cut of Data Vault Lite will work with the archive storage machine at Charlottesville Edgemont Rd to find a data set on the machine by formula (not database pointer) given a project ID, and understanding the directory tree of the archive.
Once data is indexed through such an interface, the downloader will be staged to collect all of a project into a single archive file. With this functionality, the goal will be to increase interaction with a given project's data set. For starters, Data Vault Lite will provide a simple API to give an XML result informing the caller if data is available for a particular project ID, and where users can download that data.
With a simple API, the design of Data Vault Lite will shift from
getting data to
getting data services. Users will have access to a data set given a project ID, and over time, users will have more and more linked services to the project ID cover sheet. The focus will be to get various systems talking to each other through simple API (and API-like) interactions.
(We have a photo of a whiteboard drawing. We need to digitize it and place it here.)
If a service can key off project ID, it should be included in the data cover sheet.
5. Supporting Even More User Interaction
Once we have a landing page for each GBT dataset by project ID, and we link in all available information we can find systematically (through Phase 3), we can look for more opportunities to increase user interaction tools with the data sets. Possibilities include:
- embed a FITS viewer fv clone in the web browser to view data sets without download
- support VO cone and Simple Image Access data of all data in the archive
6. Deployment
This will be done in at least three phases:
- Phase 1: Generate the "Data Cover Sheet" with barebones information, plus a working download URL.
- Phase 2: Add "Additional Resources" information to Data Cover Sheet, link to any services which can key off project ID and known information.
- Phase 3: Add the proposal abstract and PI/Co-I names to the Data Cover Sheet
7. Test Plan
- Direct Downloader. Can we get data from the URL in one step, given only the project ID?
Signatures
APPROVED: To the best of my knowledge, the request in this MR is complete. I have thought through this request, and believe it to be an important feature to implement or bug to fix.
ACCEPTED: I acknowledge that I have validated the completed code according to the acceptance tests.
Written |
symbol - name - date |
Double-Checked |
symbol - name - date |
Approved by Sponsor |
symbol - name - date |
Accepted/Delivered by Sponsor |
symbol - name - date |
Symbols:
- Use
%<nop>X%
if MR is not complete (will display )
- Use
%<nop>Y%
if MR is complete (will display )
--
NicoleRadziwill - 2009-01-15