ADASS 2007 Notes

Visualization Tutorials

3-D Visualization in Astronomy

  • View data in three axes: (freq,RA,DEC)
  • Normally we would like to display 3 axes with similar types, i.e. (pos,pos,pos), but at least in Radio Astronomy we observe (vel,pos,pos)
    • It is easy to put false significance into different axis. A point not fully discussed.
    • Even so, better than trying to integrate 'slices' in your head.
  • Interesting: researchers at focused on medical imaging, but can easily see astronomical applications. Same problems, different data.
  • 3-D intensity is function of (freq,RA,DEC), coloring can give additional meaning.
  • 4-D (3-D + time)

  • Requirements for 3-D viewers:
    • data formats
    • extraction tools
    • application domain specific specialization
    • Interactive
    • Quality/Complexity (resolution/output)
    • Few standard applications, but some listed at

Astronomical Medicine Project (Harvard)

  • Processing pipeline for generic visualization
  • Slices vs. Volumes
    • It used to be an exclusive domain of radiologists to read volume data by viewing 'slices'. Now a surgeon can view 3-D data directly.
    • Brains not good at integrating 2-D data into 3-D meaning, but can grasp 3-D data easily.
  • Points vs. Surfaces
  • There is only so much 2-D imaging can do. But true 3-D data has opened up a new class of volume-imaging algorithms.

  • Typical pipeline:
    • Acquire data
    • Enhance
    • Registration of data to some datum
    • Segment/Classify objects/features
    • Display (stereo (3-D), mono (2.5D))
    • Interact and iterate

  • Visualization
    • Computer Graphics + Structures + Application Semantics + Human Perseption
      • Example:
          Isometric surfaces volume rendering
        Speed Fast Slow
        Segmentation application specific global transforms
    • Similarities between medical and astronomical 3-D:
      • spatially oriented data
      • complex datasets
    • Differences:
      • point sources (astro)
      • non-spatial axes
      • absolute coordinate axes for registration (WCS); doesn't exist in patient data

  • Tools
    • 3-D Slicer
    • VTK library (visualization)
    • ITK library (segmentation)
    • OSIRIX (mac only, linux under development, open-src)
    • Open Inventor/Coin
    • Java 3D & Open-GL

  • Golden Age of Astronomy Visualization
    • GAP currently exists; hardware capability has out stripped what we can think of doing with it

Stereo Display and Imaging

  • Natural mode of perception
  • Spacial localization
  • Allows see hidden detail
  • Surface curvature
  • Realistic Perception

  • Hembolt's Equation:

  • Screen size limits depth presentations possible.
  • Camera Control: non-trivial to setup view mapping process; what type of model to use?
    • suggest parallel frustrums

  • Movie

  • Types of 3-D displays
    • Glasses free laptop display (assumes fixed viewing angle)
    • Head-tracked, with glasses,
    • Volumetric/Holographic display using "painting" of spinning disk
    • All types have depth limitations

  • Movie "Perseus" from Radio observations

3-D Viewers

  • Osiri-X
    • mac-only
    • written by medical community to view data
    • nice interface
  • 3-D Slicer
    • open source project
    • has tools for selecting data
    • understands 'clump-finder' files
    • multi-window display
  • Visi-VO
    • Joe: Is this VO the same/compatible with US VO?
    • Tool for particle-like display
    • retrieves data from VO web services
    • can display point sources as spheres for clarity
    • Uses plastic for inter-app communications via VO tables
      • select objects in one program, matching selection in other apps viewing same data;

3D Links:

3-D Publications

  • Neat concepts web site
  • Volume rendering, particle display library
  • Region selection
  • Statistics on regions
  • 3-D PDF + Javascript to publish interactive 3-D datasets in normal PDF files.
    • Uses VRML encoded data in document
  • Can link items in data to external documents, VO queries, etc.
  • File sizer are larger but entirely manageable
  • Library is a toolbox, not a publication/analysis environment

Data Mining Tutorial/Lecture

  • See handouts. Very well prepared speaker.
  • Recommended text: "Introduction to Data Mining", P. Tan, M. Steinback, V. Kumar, 2006 Addision Westley
  • (Industrial standards for data mining best practices.)
  • I have the handouts for this BOF (see me for them)

Data Preservation

Data Preservation and the V.O.

  • LOCKSS "*L* ots O f C opies K eeps S tuff S ecurely"
  • Why preserve data?
    • New science from old data
    • re-confirmation of results
  • Conditions of reuse
    • discovery/access
    • accurate meta-data, data characterization
    • simple structure, queries must be straightforward to prepare
  • Current Model
    • Raw data downloaded
      • Up to astronomer to analyse
      • Move to place analysis at data centre (i.e store processed data)
  • How long should data be preserved?
    • Decades
      • proper motion analysis, orbital determination
      • long term variably studies
    • Concerns
      • Data lifetime exceeds careers
      • storage media
      • ((meaning))
  • Who preserves data?
    • varies by country
    • Universities, National facilities, observatories, libraries
    • Agencies find hard to fund archives
    • DCC Inter-disaplinary center (Digital Curation Centre)
  • Open Archive Information System Reference Model
    • IBM Paper: "Towards OAIS-based preservation-aware storage"
    • Should start taking the OAIS-RM seriously
  • Not all data is digital
    • Plate libraries 500,000 plates
    • ROE library is 19,000, with only 25% digitized.
    • How much should be done about this?
    • Problem of meta-data in historical data
  • Traceability of data
    • processing tools,
    • STD's for meta-data

Trusted Data Repositories

  • Need for trusted repositories
    • Easy -- with money
  • OAIS Reference Model
    • claim to preserve data is "untestable"
    • Can we make it testable?
      • Understandability
      • still meaningless
        • Too inter-related
  • Designated community (who is customer?)
  • Representation Information
    • What do the bits mean?
  • FITS file
    • Need for standard doc
    • Extensions doc
    • software language/virtual machine standards
    • data dictionaries
  • Changes in environments, organizations, standards, schema changes
    • web link "aging"
    • copyrights in data
    • Loss of data knowledge by retirement of people
  • Maintaining Authenticity
    • Trust publisher?
    • Traceable to source?
      • A1..A6 Contracts/Funding
      • B1..B6 Digital Obs Management
      • C1..C3 Technologies, Security
      • ISO-9000 "Do what you say, and say what you do"
  • Question: Is certification really worth it? (An issue of trust.)

Long Term Preservation of Results

  • Data in journals is only a graphical representation
    • re-confirmation of results from journal not possible
    • captures data but not word about data
    • access not integrated
    • data in publication vs. data in archive
      • gave example of anomily in data
      • example of what we would like to do
  • Data is useless without accurate meta-data.

Building a Framework for Data Preservation (LSST)

  • LSST will take millions of images, billions of astronomical objects, trillions of sources
  • data rates prohibit data quality monitoring
  • policy based access, (like template policies)
  • policy for storage format, mechanism
  • can be extended
  • project is open source, but work in progress...

Making the Sky Searchable

  • google technology (machine learning)
  • source extraction via the web
  • impressive, finds sky location by searching quadrangles of sources
  • automated meta-data generation (remember those 500,000 plates?)
  • extraction of date/time from object relational information
  • Can id snapshots, plates, amatuer photos
  • relies upon quality/contents of USNO-B catalog
  • Has trouble with small field images
  • How does it work?
    • uses quadrangles to id hypothesis
    • tests thousands of hypothesis/sec
    • Uses two most distant positions in image
    • creates a 4-D hash from a (2D-2D) positions
    • uses a balanced K-D tree in quad space
    • K-D tree can be saved/loaded as memory image, making queries very fast
    • Once matched searches for additional quads to verify hypothesis
    • searches in both spatial and temporal (standard epochs) space
  • Plans:
    • currently alpha, beta this winter
    • all code is open-source 'C'
  • Evil Plans:
    • all astronomical data ever taken is on the order of 300 Terrabytes
    • take over the world
    • Paper is Science, Lang et. al


Automatic Image Registration

  • Relative not absolute
  • image to image
  • Paper in "Image and Vision Computing", 2003
    • Cross-correlation'
      • rotationally invariant algorithm
    • catalog matching
      • tunable parameters
    • multi-resolution
      • wavelets
  • An integrated approach (MIRA)
    • chain code source ?borders?

High Energy Physics Computing

Data Acquisition in High Energy Physics

  • Throw away most of the data
  • DAQ systems must filter events
  • Architecture Evolution
    • From single system to distributed/parallel
    • TB/sec data rates
      • 500 readout units
  • Configured using XML
  • Error handling/monitoring is critical
  • traffic shaping
    • token-credit system
    • chain network switches in a 3-D pattern
  • Problem is data not compute bound
  • Question: Fault handling of big complex system?
    • Keep spares in stock (doesn't this mean no upgrades?)
  • Question: Power?
    • Cluster management tools

Data Analysis in High Energy Physics Computing

  • Made comments about power/cooling
  • Kilo-spec-2000/million-dollar
  • Uncovering the truth requires measuring probabilities
  • lots of collisions, but very low ?interesting to sky?
  • Large 4-pi acceptance sensors
    • 500-2000 physicists
    • simulation necessary -- on a massive scale
  • HEP data models
    • moved to C++/OO
    • pointers/references describe relations
    • queries require code?
  • History
    • disks used to read/write tapes
    • tapes no longer used but still painful
    • Tapes still a valuable repository
    • Every Physicist is a C++ programmer?
  • Mentioned 'ROOT' Objects in RAM persistent

GLAST LAT Computing: A Fusion of HEP and Astronomy

  • Large FOV ~20% sky
  • Range is 10-100 GeV
  • satellite 10yr life span
  • Filter 97% data using classification trees
  • all C++
  • using ROOT for object persistance
  • Big use of simulation
  • Pipeline:
    • Java
    • 300-400 TB on disk
    • HEASARC tools

Image Processing and Mosaicking

NOAO Pipeline Applications

  • Pipeline
    • Data source
    • Framework
    • Archive
    • Access
  • Node Manager
  • Pipeline Manager
  • Data Manager
  • Pipeline Scheduling Agent
  • Quick reduction pipeline vs. Science pipeline
  • Split up raw data to process in parallel
    • parallel filtering/resampling/cleaning
    • calibrate and mosaic
  • Cluster of 8 nodes, with dual Athelon CPU running RH9 and RAID-0
    • Showed how performance scaled with # of CPU's

Custom Mosaics on Demand

  • Montage Service
  • code at
  • tools for science grade mosiacing
  • runs as web service, sign-up service
  • request mosaics from existing datasets
  • Architecture:
    • 15 xeon 3.2GHz dual processor dual core (60 cores total)
    • 6 TB of disk
    • No grid technologies involved
    • I/O limited problem
    • Request oriented management environment (ROME)
    • wget/http based

Distributed Processing of Future Radio Observatories (LOFAR)

  • Data volumes in LOFAR: 5hrs==12TB;
  • Aussie SKA prototype: 12hrs==72TB
  Old New
Size few GB TB's ++
Processing time weeks days
Archived all data reduced data
Where processed desktop data center
# of passes many once
tools AIPS, CASA, MIRIAD, etc. ???
  • Data distribution
    • bring processing to data, not data to processing
    • data partitioning for parallel processing should be efficient
    • data loss? Partitioning should minimize impact (redundancy?)
  • Architecture
    • Connection types:Sockets, MPI, memory, database
    • Worker pattern, with master and slaves
    • one pass through data since I/O dominates
    • Examining GPGPU, FPGA, Cell processors (Rapidmind)

LOFAR self-calibration

  • UV data iteratively calibrated
  • Design
    • data rates near GB/sec
      • too much for one-computer
    • ordering matters for processing locality
    • efficient exchange when necessary
  • Partitioning:
    • Time -- no
    • Baseline -- no
    • Frequency -- yes
    • process locally communicate globally
  • Shared repository pattern (general)
    • Black board pattern
    • Controller pattern
    • Repository manager pattern
  • Architecture
    • deployed on a 12 node cluster using RDMS as blackboard


  • List of about 12 proposed changes
  • How to obsolete old keywords?
  • FITS users guide now obsolete, updated text in FITS standard.
    • Is that the best place for a users guide???
  • Note: Last ADASS held poll about revising FITS
    • about 22 said revise, but keep most of it
    • about 18 said revise and look to new technologies
    • on the order of 5 said don't touch.
    • Whats been done since -- nothing.
    • Perhaps astronomy should consider developments in other fields of science
  • Religious wars:
    • FITS format is easily parsed (Why are these people writing parsers???)
    • XML can't be parsed (anyone hear of expat, SAX, or any number of other parsers in every conceivable language)
    • I heard more about format than semantics, when semantics (e.g. FITS keyword definitions) seems like the best advantage of FITS
  • Separate data model from data format
    • FITS is a relational data model, which could be stored an a number of formats
    • The problem is the FUD over tools
    • Any changes will need to be backward-capable.
  • Call to non-BOF attendees for comments (due mid-Oct to FITS committees)

Algorithms and Image Processing

(Note joined late due to answering Todd's request for information.)
  • Algorithm smearing gaussian as a function of gradient
  • Imaging pipeline "SPOT" (LSST?)
    • cheetah demo
      • interactive overlays, processing
      • setup pipeline (static)
      • visualization engine
      • pipeline can interact via/with command line

LSST Image Processing challenges

  • 15 TB every night
  • probing cosmology and dark matter
  • must reduce systematic errors
  • one "probe" (science goal?) is weak gravitational lensing
  • cosmic shear and distance vs. red-shift
  • searching for cohernet ellipticity on sky (grav lensing)
    • but some galaxies really are elliptic
    • amounts to statistical noise
    • systematics still dominate
  • HW fix: point-spread function
  • SW fix: stack-fit approach
    • but we still throw away data
    • instead register images and examine each
    • develope model and iterate on each image
    • requires 10^22 FLOPS
    • Paper: "Introduction to Multifit Fitting Galaxy Shapes using multiple exposures"
    • glFIT * Pipeline will also output a measure of quality * Processing challenge, may pursue FPGA, Cell, GPGPU technologies

LSST Database Organization

  • Real-time astro processing
  • 60 sec goal on transient events
  • potentially large reads
  • ave. 40k detections
  • updates often
  • cluster data spatially
  • read only relevant columns
  • keep specialized copy
  • divide sky into 300,000 quadrangles
  • pre-fetch for observing
  • flush updates to disk (often)
  • in memory sort further into zones sorted by RA
  • further divide so computation can be made parallel
  • results promising
  • uses MYSQL and C++
  • asked about monet database work

ASKAP Computing

  • Talked about challenges of SKA prototype

Autonomous Observing The astronomers last stand

  • (Not really an image processing talk)
  • Intelligent 'agents' - does stuff for you
    • both proactive and reactive
    • learns
    • data processed in pipeline
    • telescope and databases both appear as same to agent, with different resources
  • Two standards
    • HTN and eStar
    • VO Event
    • Lots of different telescopes in network
  • Negotiation for time
    • google sky URL can display RT-events
    • prioritized monitoring of events
    • events can trigger observing

Vector Gradient Intersection Transform

  • circular detection method
  • Hough transform
    • finds center and radius separately
  • VGIT
    • compute gradient of image
    • pairs of gradients toward center, find intersection
    • now consider radius
      • create a gradient radius histogram
    • group points of similar/perpendicular locations
    • complexity same as Hough transform

LSB Galaxy Detection using Markovian Approach

  • Low surface brightness
  • >22.5 mag/arc-sec^2
  • segmentation using Markovian approach
  • discriminate between background and target
  • Works as an 'inverse' method
  • ?Find the Baysian?
  • use a Markovian tree
  • quad tree provides in-scale balancing
  • for each frame
    • ave profile along each ellipse radius
    • sort according to profile
  • results promising
  • pipeline implementation

Optimal Extraction of Eschelle Spectra

Image Processing and Visualization

Astronomy + Medicine = Understanding

  • Initiative: collaborate between other sciences and astronomy
  • Innovative Computing:
  • Treat image as a function of wavelength, intensity, state (polarization), 2-D position and time
  • Tools:
    • 3-D Slicer
    • Osiri-X (mac-only)
    • Volview
    • GAIA GAJA?
  • 3-D publishing
    • doesn't scale to super-sized databases, but big improvement
  • Some say Web-2.0 == Plastic, but speaker notes that we just need to make things easier to share tools

3-D Visualization and Detection of Outflows

  • How people find outflows today: look at each channelized image, and merge information in their brain.
  • 3-D plot allows brain to conceive more data in context, even if some axes have different meaning


  • 3-D scientific graphing library
  • the new 'PGPLOT'
  • pathways to 3-D PDF's
  • includes both qualitative and quantitative analysis capabilities
  • API's in C, C++, Fortran (Python by extension?)
  • code example: really cool
  • no menus
  • based only on Open-GL constructs, no GTK etc dependence
  • Can open FITS files as 2-D textures (I assume this means FITS files with 2-D image data)
  • Has labeling, selection
  • Full release November 2007
  • closed source, but free
  • derivative applications: s2slides

Accessing eSDO Solar Image Processing through Astro Grid

  • Data from solar orbiting satellite
  • 1.2 TBytes/day, Jan 2009 launch date
  • algorithms for processing are in C and IDL, downloadable as release 10/1/2007
  • mentioned a 'workflow' style editor?
  • Interfaced to VO via Plastic
  • Using Open MPI to run parallel jobs (MPI-CH2)
  • simple time access protocol (STAP) (A query protocol for finding scans?)

Processing Astronomical Data with Hollywood Tools

  • Mentioned research vs. public outreach
  • Commercial software tools:
    • Photoshop for image cleaning
    • Cinematograph "Shake" (mac)
    • Data pipeline: perl, IRAF, C
    • 3D Models in perl (generation)
    • MAYA 4.5 (Used to be the Photoshop of modeling, but now sells a lite version)
    • MEL code?
  • Visualization Wall for previewing IMAX movies (a 16x16 LCD 'wall')
  • Free Software:
    • Renderman (standard not software)
    • PRMAN
    • BMRT, Agsis, Pixie
  • Shading:
    • Texture mapping and the 'SPLAT' shader
  • Choreography (important for IMAX)
    • Camera motions must be smooth and well planned
    • background selection important for visual context
  • Custom Software:
    • Rendering efficiency (started out at 29h/frame; now down to 9s/frame at 650,000 particles)
    • other at visualization 'FJS' website

V.O Explorer: Visualizing and Data Discovery

Rich Web Applications

  • Definition: "Rich web apps are just like desktop applications"
  • web 1.0 - based on page-concept
  • web 2.0 - based on desktop concept * applications are in constant 'beta', updated each time they run
  • Adobe 'flex'
    • eclipse base IDE builder
    • design view
    • code tools
    • flex framework (free); tools are not
    • compiles into .swf (flash) files
    • loads of features
  • NVO spectrum application
    • runs as NVO web service
    • charts with statistical queries 'all in browser'
  • Performance

Data Mining BOF

  • Led by Sabine McConnell, Trent University
  • Methodology:
    • Collect
    • Prepare
    • Build Model
    • Evaluate
  • Definition of 'Large' dataset: millions of records
  • Heterogeneous data
    • instruments have different resolutions
    • surveys have different source lists
  • Data Mining and the VO
    • ??
  • XML based PMML (Predictive modelling markup language)
  • Slide depicting search technology integration into astronomical data mining:
  Used in Astronomy Available Needed
Techiques 30% 30% 30%
Parallel Techniques 2% 80% 18%
Distributed Techniques 1% 40% 59%

Preprocessing Data

  • Need to know algorithm before preprocessing data
  • Distributed vs. Large datasets
  • It is key to reduce data dimensionality (curse of dimensionality)
  • sampling, bootstrapping etc.
  • attribute selection
  • Meta-data
    • store statistical data in table as summary to describe common measures
  • Question: VO techniques for addressing distributed data problem
    • estimate query cost
    • Two methods to address distributed data:
      • One is to represent data in a summary form, such as Fourier coefficients; where each site has one or more coefficients
      • Second is to build models at each site, then transmit the models (which tend to be smaller than data) (e.g. a decision tree)
  • Fuzzy logic models
  • Discussion of various standard datasets (i.e. a canonical dataset for testing)
  • VOStat query tool?
  • Summary meta-data can simplify data mining. (eg. statistical min/max/mean of data stored in tables)

Data Mining

Large Pulsar Survey at Arecibo using ALFA

  • Jim Cordes
  • 7 beam Rx at L-band
  • N_detected = f_birthrate x R x T_radio
  • They hope to find 500-1000 new pulsars
  • Will be about 1 petabyte of data
  • Looking for intermittent pulsars and transients
  • traditional analysis: Fourier domain techniques
  • single pulse searches, mentioned McLaughlin 2006 Nature article?
  • save raw data for reprocessing
  • at 4GB/scan survey will be about 1PB at current rates
  • Pipeline: Unisys Itainium cluster
    • M$ SQL-DB
    • sending disks around via fex-ex to Cornell, WVU, UVA, and University BC
    • database:
  • SKA beam forming
    • 10% of processing goes to beam synthesis
  • Arecibo's future? -- Not closing.

Probablistic Cross Identification of Astronomical Sources

  • How to match sources between catalogs?
  • 2 way matches currently used, but N-way matches are harder
  • Need measures of quality
  • What is the right question?
    • How good
    • probability
    • observed evidence
    • ?Bayes hypothesis testing?

A Method for Exploitation of Domain Information in Parameter Estimation

  • discussed a method of 'learning mappings'
  • can fit of forward model, but not necessarily a reverse model
  • estimate and iterate using the estimates
  • use domain information to weight result
  • local interpolation/global solution

Needles in a Haystack - Faceted Browsing at the VO

  • Reviewed user interfaces (e.g. nethack)
  • Virgo Stellarium, google sky
  • Two different levels of inquisition:
    • big: "Show me a plot of all known galaxies with magnitude > 23 and plot as a function of red-shift"
    • small: "Tell me everything known about M51..."
  • (exhibit toolkit?)
  • Vizier Document search browser for IVOA resources
  • (Resource description framework)
  • Norman Gray's RDF at the VO
  • Works
  • VOExplorer
    • Still UI issues. Want OR and NOT, but only have AND
    • Number of Facets
    • choice of facets is naiive
    • tabular data
    • standard vocabulary and ontology


Paperscope - Graphically Exploring the ADS

  • ADS format not easy to use -- text only
  • create GUI to organize info
  • creates 2-D graph of notes and references
    • shows refering papers and references
    • shows several levels of references
    • can 'traceback' references
  • Pursuing 'plasticization' to use with VOExplorer
  • Implemented in java,swing,prefuse
  • Prefuse library for graphs uses graphML
  • Need http access to the ADS
  • nice general tool

VIM (not the text editor) A tool to explore your sources

  • source cone searches
  • joins between tables with selectable columns
  • search by position
  • cgi based script, powered by stilts (2 million+ rows?)
  • Image displays, larger image on hover, all registered to common frame
  • meta-data?
  • create new columns from other columns by expression
  • results can be cached on server
  • google sky - KML?
  • python based?
  • VO table -> plot with tomcat
  • VIM scripting library

ADASS 2008 Annoucement

  • Quebec City, Quebec Canada Nov 2-5
  • Really, it might not be cold

The IAU 2000/2006 Changes to Reference System

  • No more LST (Now have Earth rotation angle, proportional to UT1)
  • New reference RA zero point frozen close but not at J2000 (23mAsec off)
  • IAU resolutions
  • ICRS is the new system

RISA A Remote Interface for Analysis (ESA)

  • how to allow users to do the best XMM xxxxxx
  • SAS Scientific Analysis System
    • Free, no license
    • basic functionality
    • portable
    • GUI-based
    • but can be driven from a script/pipeline
    • for imaging
  • Created as a web service to ease maintanance of installed base
  • Remote computing
  • Workflow
    • Input src,name,author,etc.
    • VO table output
  • Server side
    • computing on a 10 note dual core machine
    • server-side 'grid'

Software Modelling of Instrument Field Spectrometer

  • data modelling project
  • Why do we need to simulate instruments?
    • increased scale of equipment
    • complexity of the telescope
    • test of data reduction/analysis
    • increased understanding of the instrument
    • modeled data is an effective communicator
  • Common Infrastructure
  • Instrument specified model
  • Spec-sim GUI
    • creates blank sky cube
    • Input from user "targets"
    • ends up with sky model in (RA,DEC,Vel)
    • transmission functions (like IFManager filters)
    • User-defined elements
    • Slice cube into slit
    • add noise
    • text file from user, with C-style groupings
    • creates image of spectro-image
    • used to model/evaluate active optics
    • experiments in calibration
    • developed in IDL

ECSS in the Extreme (Programming)

  • GAIA satellite@L2 launch date Dec 2011
  • Global iterative solution
    • multiple passes over sky ~100
  • Waterfall approach:
    • Even its inventor said that it should be done at least twice
  • Plan once a month:
    • points 2/day
    • buy stories for the month
    • costing done as a group
    • XP tracker plugin tracks actual vs. estimated time
    • gets better at how long things take
  • Long term plan
    • has more detail in 6 month plan
    • concentrate on 1 month tasks
    • immediately notice if features slow us down
    • testing/continuous integration
      • test once a day
      • 80% coverage
    • Lots of reviews
      • document tree, templates for documents
    • customize process

Data Mining II

Morphological Description of Large Scale Cosmic Structure

The IPHAS Early Release

  • Survey of the galactic plane in H-alpha
  • 212 nights, with 25% lost due to weather
  • 200 million objects in 5 Terabytes of images
  • 400 GB of catalog data
  • User access:
    • database with meta-data
      • pointing quality
      • weather quality
      • pointings
  • VO table browser for core search
  • Methods:
    • astrogrid: discover-flow (mac user interface concept?)
    • VO Explorer
    • import astrogrid module into ipython for cone searches

Robust Machine Learning Applied to Terascale Astronomical Datasets

  • multi-band data, multiple surveys
  • Computing:
    • 2580 core Xeons 3.2GHz, 3GB RAM/node
    • shell scripts to run and distribute algorithms (not OpenMP?)
  • kNN single nearest neighbor
  • probability density function

20 Spatial Queries for an Astronomers Benchmark

  • [all] Data is or will be online
  • no single management solutions
  • wouldn't it be nice to have a standard test-suite (hence the disussion at the BOF)
  • Benchmarking:
    • Relevance
    • Portable
    • Scalable
    • Simple, test must be understandable
    • Repeatable
  • Bench marking is hard
  • Scalability not assured
  • different environments

Finding Outliers in Multivariate Data with Measurement Error

  • Parameter space: RA/DEC/Flux/Shape/time...
  • Different data types and encodings (int,float,enum)
  • missing data
  • errors in data entry, instrument error
  • Is it an error? or is it an interesting object?
  • latent variable models (with hidden layers)
  • ?try using a mixture of Gaussians?
  • ?use student T distributions with "longer tails"?
  • provides way of defining "outlierness"

-- JoeBrandt - 23 Sep 2007
Topic revision: r8 - 2007-10-07, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback