2012 Progress on Statistical Issues in Searches

Monday, June 4, 2012

Model selection, estimation, and bootstrap smoothing (Brad Efron, Stanford, Statistician)

  • Very detailed statistics talk - interested in including model selection in accuracy estimation (Model selection problems).
  • This "E{y|xj}" means "Expected value of y given each x."
  • Statistical models are used to analyze data. Each model is associated with different criteria and statisticians use those criteria to pick a model.
  • Discussed the following models to analyze cholesterol data (compliance, decrease in cholesterol):
    • Cp (Cubic) Model Analysis
    • Nonparametric Bootstrap Analysis
      • Methods for bootstrap confidence intervals: Standard, Percentile, Smoothed Standard, BCa/ABC
      • Smoothing increases computation required to determine variance - uses Accuracy Theorem to avoid this.
  • Discussed analysis of supernova data (absolute magnitude of Type Ia supernova, vector of 10 spectral energies)
    • Parametric Bootstrap Analysis
    • Chose Full Model (Ordinary Least Squares Prediction)
    • Then tried Lasso Model since Full Model "might be too noisy".
    • Used parametric Accuracy Theorem for boot strap smoothing in this case.
    • Then tried Bootstrap re-weighting and just plotted results to see the effect.
  • Regressions, Histograms, Box plots, estimators, oh my.
  • Statisticians use experience and heuristics to figure out which model provides the "best" data analysis. This process sometimes uses a priori knowledge of the data.
  • Question - Model selection is impacted by the amount of available data (available data typically increases over time). Answer: Uh, carefully apply model selection.
  • Additional detailed statistics questions.

Dogs, non-dogs and statistics: (Bayesian) searches in cosmology (Roberto Trotta, Imperial College London, Cosmology)

  • Imperial Centre for Inference and Cosmology, Bayesian work (as opposed to frequentist)
  • Background:
    • Cosmological concordance model: inflation, dark matter, dark energy
    • Model assumptions: Isotropy and homogeneity, Approx. Gaussianity of CMB fluctuations, Adiabaticity
    • Data set: WMAP7
      • The power spectrum contains the full statistical information IF fluctuations are Gaussian.
    • Baryonic Acoustic Oscillation (BAO) - derived from WMAP data.
      • Gets relation between redshift and acoustic scale.
      • This yields constraints on total matter and dark energy.
  • Discussed ways to deviate from the vanilla model:
    • Start with Non-Gaussianity from inflation.
    • Bispectrum, wavelets, skweness, kurtosis, genus statistics, Minkowski funtionals, needlets - higher order statistics accounting for deviation from non-Gaussian inflation.
    • Search for non-Gaussianity, non-trivial topology (light from distant objects can reach us along multiple paths)
    • Searches - use the WMAP data, apply these different models, plot the results and compare with vanilla model.
  • Generic Departures from the LCDM
    • Search from deviations from Concordance Model
      • Number of neutrinos species and statistical isotropy - unclear if these two are significant deviations from the model.
  • Principled Bayesian model selection:
    • Level 1: Select model M and prior P(T|M) -> Parameter inference
    • Level 2: Compare several possible models
    • Level 3: Model averaging (none of the models is clearly the best)
    • Bayesian stats lets you to get P(M|d) from P(d|M)
    • Bayesian evidence balances quality of fit vs extra model complexity.
    • Jaynes - "there is no point in rejecting a model unless one has a better alternative"
    • Showed source detection using Bayesian reconstruction - 7/8 objects correctly identified. Mistake happens b/c two objects are very close.
    • Showed cluster detection from Sunyaev-Zeldovish effect in CMB maps using Bayesian Model selection.
    • Mentioned Multinest (Feroz and Hobson, 2007)
  • "Many anomalies/unexpected deviations go away with better data/modeling/insight. Is this evidence the community jumps too soon b/c of statistical flukes?"
  • Great question - What is the scientific conclusion when Bayesian and Frequentist approaches disagree?

Recent developments and current challenges in statistics for particle physics (Kyle Cranmer)

  • Center for Cosmology and Particle Physics
  • LHC
  • This talk attempts to explains what LHC particle physicists do to see if statisticians can help out...
  • Showed Lagrangian of the Standard (Matter) Model
    • energy of configuration of matter fields
    • energy allows for predicting how configuration will evolve with time
    • Feynman diagrams / Quantum Field Theory
  • Method
    • QFT, Monte Carlo/Perturbation Theory, Feynman Diagrams
    • Simulate particle interactions
    • Run algorithms on sim data to detect particular particles
    • 10^14 collisions, looking for a few interesting interactions (e.g. Higgs)
  • Models
    • Poisson Probability Density Function
    • Parametric (exponential, polynomial), non-parametric
    • 20 dimensions of 'nuisance parameters' over several different channels of data.
    • RooFit, RooStats - toolkit for determining expected values of some physical process.
  • Statistical Analysis
    • Primarily frequentist, some Bayesian but there is a general dislike for assigning prior probabilities to theoretical particles not yet observed.
    • "Bayesian probability allows prior knowledge and logic to be applied to uncertain statements. There’s another interpretation called frequency probability, which only draws conclusions from data and doesn’t allow for logic and prior knowledge." (ML in Action)

Inverse problems in X-ray scattering (Stefano Marchesini, Lawrence Berkeley National Laboratory, Photon Science)

  • Inverse Problems background
  • Started with background on X-rays and basic description of interaction with matter.
  • Complicated to calculate model based on observed data, particularly when absorption and reflection are taken into account.
  • Sparse modeling is a powerful method to extract information from noisy data.

Statistical issues in astrophysical searches for dark matter (Jan Conrad, Stockhom University, Particle Physics)

  • DM interacts with particles and cause a measurable signal which can be detected.
  • Remove instrument background - Multivariate classification, machine learning
  • exact methods vs asymptotes
  • Marked Poisson?
  • Same point as Kyle's talk - nuisance parameters
  • Hypothesis testing
    • Applicability of Wilks and Chernoffs theorem
    • Separate families of hypotheses?
  • Interval estimation
    • Methods used: Unified ordering, marginalization
    • Emergence of the profile likelihood
    • Complex likelihood functions: global fit in Supersymmetry
  • Bayesian v Frequentist statistics

The parametric bootstrap and particle physics (Luc Demortier, Rockefeller University, Particle Physics)

  • Started out as a nice, organized talk promising a beginning, a middle and an end. Lost it half way through...Bataan death march.
  • Bootstrap provides a tractable numerical approach that approximates exact methods.
  • Bootstrap is a frequentist methodology.
  • Two basic ideas:
    • Plug-in (or substitution) principle
    • Re-sampling - generate toy data samples that are statistically equiv to the observed dta samples. two possibilities:
      • Parametric re-sampling
      • Non-parametric re-sampling
  • Main uses:
    • Bias reduction
    • variance estimation
    • Confidence interval construction
    • Hypothesis testing
  • When can we be sure bootstrap estimator converges to the true values as the sample size increases?
    • Not on boundaries
    • Not on maximum values
  • Gave a confidence interval example from particle physics.
  • Bootstrap Confidence Interval lessons
    • Use a pivot or an approx pivot whenever possible (A pivot is a function of both data and parameters, whose distribution does not depend on any unknowns.).
    • Test inversion seems to improve performance of bootstrap.
  • Nice summary slide.
  • Boot strap book

Two developments in tests for discovery: use of weighted Monte Carlo events and an improved measure (Glen Cowan, Royal Holloway, University of London, Particle Physics)

  • Super fast, hardcore statistics talk.
  • I could not follow the contours of the swamp, but I did learn a new word.

Panel Session The Development and Use of Public Databases: It's Complicated! (Ioannidis, Scargle, Cartaro, Skinner)

Tuesday, June 5, 2012

Effects of extraneous noise in Cryptotomography (reconstructions from random, unoriented tomograms) (Duane Loh (grad student), PULSE Institute - SLAC National Laboratory, Photon Science)

  • Statistics in nano-particle (3D) imaging using x-rays.
  • Individual particle conformation?
  • Photon counts from diffraction patterns - 100's of photons per particle imaged.
  • Particles can't be held, continual x-ray bombardment destroys particle, don't know orientation of the particle (hence the crypto part).
  • Signal averaging, phase retrieval on random, noisy tomograms.
  • How to proceed:
    • Look for fixed points. Replicate in 8 orientations and compare to data stream. Waved hands on stats used to determine compatibility between data and model.
    • Ab initio reconstruction - expectation maximization. Uh...
  • Talked about how he might deal with extraneous noise (i.e. signal averaging). Rushed b/c running out of time.
  • Combine orientations and look for statistically improbable representative values.
  • Novice speaker.

Filtering femtosecond x-ray scattering data using Singular Value Decomposition (Trigo, Mariano, SLAC , Photon Science)

  • Interested in imaging ultra-fast x-ray diffraction.
  • Self Amplified Spontaneous Emission - process of laser emitting electrons...synchrotron radiation...self amplifying...
  • Scattering experiment that uses SVD to manipulate the data in matrix format to reduce dimensionality and obtain higher signal to noise.

Machine Science: Distilling Natural Laws from Experimental Data, from nuclear physics to computational biology (Automating discovery of invariants) (Lipson, Hod - Cornell University, Statistical)

  • Machine looks at physical system and produces mathematical, symbolic model.
  • Example - double pendulum model produced Hamiltonian invariant.
  • Motivated by making better robots - learning from environment with fewest experiences as possible.
  • Started with Genetic Algorithms to create robots that move as fast as possible (simulated) that also worked in reality.
  • Can more complex robots learn using evolutionary techniques?
  • Also evolving control programs to make robots behave in certain ways.
  • Limited success with programs controlling robot movement. Robots moved but in lame ways.
  • "Simulator" -> evolve robots -> build it -> collect sensor data -> evolve simulator (co-evolution)
    • Emergent model - robot figures out what it is, i.e. how it is configured
    • Then figures out how to move
  • Ideal of self modeling applied to more abstract applications
    • candidate models -> candidate tests -> perturbations -> candidate models in a way that maximizes disagreement between predictions
    • Symbolic regression (e.g. what function describes a data set). Disadvantages: Computationally expensive and over fits the data.
    • Combined this with co-evolution using sub-sets of data to maximize disagreement between models.
  • Started to apply this to inferring equations from data (e.g. used bio system equations to produce data, fed data to system, system returns equations)
  • Then took high speed photos of flapping wing, derived representative parameters (e.g. lift, wing size), fed data to system and system reproduced several models, some known, some new.
  • Epic fail on single cell domain. Added time delay building block and removed sin/cos building blocks. Produced more elegant model that the human produced equations.
  • Looking For Invariants
    • We can fit models to data but what do models mean?
    • Pendulum example...energy is constant, Hamiltonian gives ability to prediction of system behavior. Can this be applied to evolution model - what is constant in the system?
    • Started using ratio of derivatives to find non-trivial invariants.
    • Published code: Eureqa - data mining tool: you give it data and it spits out models.
    • How does it handle noise?
  • Great talk.

Molecular structure from protein soup: assembling 3-D images of weakly scattering objects Anton Barty

  • Skipped

Confidence limits Philip Stark

Practical Issues in Statistical Interpretation of Tevatron Data (Thomas Junk, Fermilab, Particle Physics)

Trials 'factor' in 'bump' hunts (Arthur Snyder, SLAC, Particle Physics)

  • "I just used a triple Gaussian with a simple E-2 background and a Wengier's bootstrap estimate. I think its right. I need to look into it."
  • "I can do 107 in a few hours without optimizing the code so..."

Some Discovery and Classification Challenges in the Exploration of Massive Data Sets and Data Streams (George Djorgovski, Cal Tech)

  • Astronomy with massive data sets and data streams
  • Telescope + instrument are "just" a front end to data systems, where the real action is.
  • Volume and Complexity (e.g. Panchromatic) increases
  • Simulations also generate lots of data - theory expressed as a data set, which must be mined and compared with empirical data sets.
  • Targeted observations vs sky surveys
  • Systematic Exploration of Observale parameter Space (time, morphological, spectrum, astrometric domains)
  • raw data -> source detection and measure attributes -> analyze the data (select intresting targets) -> obtain follow-up spectra
  • Statistical Approaches are Inevitable
  • Fit models of energy distribution for galaxies, for example. Can include probability density distribution which need not be simple.
  • Statistical Descriptors
    • Power spectrum
    • Properties in population studies
  • Transient/variable Sources
    • several sub-significant detections
    • source position can be known or unknown
    • could be heterogeneous surveys, other wavelengths
    • most data may not contain any detections
  • Frank Masci's Method
  • Other cases:
    • Sources on a Clustered Background
    • Significant flare
  • Clustering in Parameter Space
    • multivariate correlations - clusters with a reduction of the statistical dimensionality
    • e.g color - g-r, u-g, [3.6]-r parameter space (optical)
    • e.g quasar selection in color parameter space found by finding outliers in clustering parameter spaces
  • Thoughts/Challenges
    • Data mining is statistics expressed as algorithms
    • Scalability issues
    • Maybe learn to live with an incomplete analysis
    • Clusters are seldom Gaussian - beware of the assumptions
    • Can we develop some entropy-like criterion to isolate portions or projections of hyper-dimensional parameter spaces where "something non-random may be going on?"
    • Need to account fo heteroskedadic errors
    • Visualization in >>3 dimensions is a huge problem
  • Cosmic Cinematography (Astronomy in Time Domain)
    • Synoptic surveys + time domain
    • Catalina Real-Time Transient Survey (public data policy)
  • Classification should take humans out of the loop b/c data sizes are going to get huge.
    • Talked about classifier for variable stars
    • Currently likes 2D Light Curve Priors (Lead: B. Moghaddam)
  • Marvin Weinstein dynamic clustering algorithm - find clusters in very big dimensions.

-- MarkWhitehead - 2012-06-04
Topic revision: r6 - 2012-06-08, MarkWhitehead
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback