## 2012 Progress on Statistical Issues in Searches

### Monday, June 4, 2012

#### Model selection, estimation, and bootstrap smoothing (Brad Efron, Stanford, Statistician)

• Very detailed statistics talk - interested in including model selection in accuracy estimation (Model selection problems).
• This "E{y|xj}" means "Expected value of y given each x."
• Statistical models are used to analyze data. Each model is associated with different criteria and statisticians use those criteria to pick a model.
• Discussed the following models to analyze cholesterol data (compliance, decrease in cholesterol):
• Cp (Cubic) Model Analysis
• Nonparametric Bootstrap Analysis
• Methods for bootstrap confidence intervals: Standard, Percentile, Smoothed Standard, BCa/ABC
• Smoothing increases computation required to determine variance - uses Accuracy Theorem to avoid this.
• Discussed analysis of supernova data (absolute magnitude of Type Ia supernova, vector of 10 spectral energies)
• Parametric Bootstrap Analysis
• Chose Full Model (Ordinary Least Squares Prediction)
• Then tried Lasso Model since Full Model "might be too noisy".
• Used parametric Accuracy Theorem for boot strap smoothing in this case.
• Then tried Bootstrap re-weighting and just plotted results to see the effect.
• Regressions, Histograms, Box plots, estimators, oh my.
• Statisticians use experience and heuristics to figure out which model provides the "best" data analysis. This process sometimes uses a priori knowledge of the data.
• Question - Model selection is impacted by the amount of available data (available data typically increases over time). Answer: Uh, carefully apply model selection.

#### Dogs, non-dogs and statistics: (Bayesian) searches in cosmology (Roberto Trotta, Imperial College London, Cosmology)

• Imperial Centre for Inference and Cosmology, Bayesian work (as opposed to frequentist)
• Background:
• Cosmological concordance model: inflation, dark matter, dark energy
• Model assumptions: Isotropy and homogeneity, Approx. Gaussianity of CMB fluctuations, Adiabaticity
• Data set: WMAP7
• The power spectrum contains the full statistical information IF fluctuations are Gaussian.
• Baryonic Acoustic Oscillation (BAO) - derived from WMAP data.
• Gets relation between redshift and acoustic scale.
• This yields constraints on total matter and dark energy.
• Discussed ways to deviate from the vanilla model:
• Bispectrum, wavelets, skweness, kurtosis, genus statistics, Minkowski funtionals, needlets - higher order statistics accounting for deviation from non-Gaussian inflation.
• Search for non-Gaussianity, non-trivial topology (light from distant objects can reach us along multiple paths)
• Searches - use the WMAP data, apply these different models, plot the results and compare with vanilla model.
• Generic Departures from the LCDM
• Search from deviations from Concordance Model
• Number of neutrinos species and statistical isotropy - unclear if these two are significant deviations from the model.
• Principled Bayesian model selection:
• Level 1: Select model M and prior P(T|M) -> Parameter inference
• Level 2: Compare several possible models
• Level 3: Model averaging (none of the models is clearly the best)
• Bayesian stats lets you to get P(M|d) from P(d|M)
• Bayesian evidence balances quality of fit vs extra model complexity.
• Jaynes - "there is no point in rejecting a model unless one has a better alternative"
• Showed source detection using Bayesian reconstruction - 7/8 objects correctly identified. Mistake happens b/c two objects are very close.
• Showed cluster detection from Sunyaev-Zeldovish effect in CMB maps using Bayesian Model selection.
• Mentioned Multinest (Feroz and Hobson, 2007)
• "Many anomalies/unexpected deviations go away with better data/modeling/insight. Is this evidence the community jumps too soon b/c of statistical flukes?"
• Great question - What is the scientific conclusion when Bayesian and Frequentist approaches disagree?

#### Recent developments and current challenges in statistics for particle physics (Kyle Cranmer)

• Center for Cosmology and Particle Physics
• LHC
• This talk attempts to explains what LHC particle physicists do to see if statisticians can help out...
• Showed Lagrangian of the Standard (Matter) Model
• energy of configuration of matter fields
• energy allows for predicting how configuration will evolve with time
• Feynman diagrams / Quantum Field Theory
• Method
• QFT, Monte Carlo/Perturbation Theory, Feynman Diagrams
• Simulate particle interactions
• Run algorithms on sim data to detect particular particles
• 10^14 collisions, looking for a few interesting interactions (e.g. Higgs)
• Models
• Poisson Probability Density Function
• Parametric (exponential, polynomial), non-parametric
• 20 dimensions of 'nuisance parameters' over several different channels of data.
• RooFit, RooStats - toolkit for determining expected values of some physical process.
• Statistical Analysis
• Primarily frequentist, some Bayesian but there is a general dislike for assigning prior probabilities to theoretical particles not yet observed.
• "Bayesian probability allows prior knowledge and logic to be applied to uncertain statements. Theres another interpretation called frequency probability, which only draws conclusions from data and doesnt allow for logic and prior knowledge." (ML in Action)

#### Inverse problems in X-ray scattering (Stefano Marchesini, Lawrence Berkeley National Laboratory, Photon Science)

• Inverse Problems background
• Started with background on X-rays and basic description of interaction with matter.
• Complicated to calculate model based on observed data, particularly when absorption and reflection are taken into account.
• Sparse modeling is a powerful method to extract information from noisy data.

#### Statistical issues in astrophysical searches for dark matter (Jan Conrad, Stockhom University, Particle Physics)

• DM interacts with particles and cause a measurable signal which can be detected.
• Remove instrument background - Multivariate classification, machine learning
• exact methods vs asymptotes
• Marked Poisson?
• Same point as Kyle's talk - nuisance parameters
• Hypothesis testing
• Applicability of Wilks and Chernoffs theorem
• Separate families of hypotheses?
• Interval estimation
• Methods used: Unified ordering, marginalization
• Emergence of the profile likelihood
• Complex likelihood functions: global fit in Supersymmetry
• Bayesian v Frequentist statistics

#### The parametric bootstrap and particle physics (Luc Demortier, Rockefeller University, Particle Physics)

• Started out as a nice, organized talk promising a beginning, a middle and an end. Lost it half way through...Bataan death march.
• Bootstrap provides a tractable numerical approach that approximates exact methods.
• Bootstrap is a frequentist methodology.
• Two basic ideas:
• Plug-in (or substitution) principle
• Re-sampling - generate toy data samples that are statistically equiv to the observed dta samples. two possibilities:
• Parametric re-sampling
• Non-parametric re-sampling
• Main uses:
• Bias reduction
• variance estimation
• Confidence interval construction
• Hypothesis testing
• When can we be sure bootstrap estimator converges to the true values as the sample size increases?
• Not on boundaries
• Not on maximum values
• Gave a confidence interval example from particle physics.
• Bootstrap Confidence Interval lessons
• Use a pivot or an approx pivot whenever possible (A pivot is a function of both data and parameters, whose distribution does not depend on any unknowns.).
• Test inversion seems to improve performance of bootstrap.
• Nice summary slide.
• Boot strap book

#### Two developments in tests for discovery: use of weighted Monte Carlo events and an improved measure (Glen Cowan, Royal Holloway, University of London, Particle Physics)

• Super fast, hardcore statistics talk.
• I could not follow the contours of the swamp, but I did learn a new word.

### Tuesday, June 5, 2012

#### Effects of extraneous noise in Cryptotomography (reconstructions from random, unoriented tomograms) (Duane Loh (grad student), PULSE Institute - SLAC National Laboratory, Photon Science)

• Statistics in nano-particle (3D) imaging using x-rays.
• Individual particle conformation?
• Photon counts from diffraction patterns - 100's of photons per particle imaged.
• Particles can't be held, continual x-ray bombardment destroys particle, don't know orientation of the particle (hence the crypto part).
• Signal averaging, phase retrieval on random, noisy tomograms.
• How to proceed:
• Look for fixed points. Replicate in 8 orientations and compare to data stream. Waved hands on stats used to determine compatibility between data and model.
• Ab initio reconstruction - expectation maximization. Uh...
• Talked about how he might deal with extraneous noise (i.e. signal averaging). Rushed b/c running out of time.
• Combine orientations and look for statistically improbable representative values.
• Novice speaker.

#### Filtering femtosecond x-ray scattering data using Singular Value Decomposition (Trigo, Mariano, SLAC , Photon Science)

• Interested in imaging ultra-fast x-ray diffraction.
• Self Amplified Spontaneous Emission - process of laser emitting electrons...synchrotron radiation...self amplifying...
• Scattering experiment that uses SVD to manipulate the data in matrix format to reduce dimensionality and obtain higher signal to noise.

#### Machine Science: Distilling Natural Laws from Experimental Data, from nuclear physics to computational biology (Automating discovery of invariants) (Lipson, Hod - Cornell University, Statistical)

• Machine looks at physical system and produces mathematical, symbolic model.
• Example - double pendulum model produced Hamiltonian invariant.
• Motivated by making better robots - learning from environment with fewest experiences as possible.
• Started with Genetic Algorithms to create robots that move as fast as possible (simulated) that also worked in reality.
• Can more complex robots learn using evolutionary techniques?
• Also evolving control programs to make robots behave in certain ways.
• Limited success with programs controlling robot movement. Robots moved but in lame ways.
• "Simulator" -> evolve robots -> build it -> collect sensor data -> evolve simulator (co-evolution)
• Emergent model - robot figures out what it is, i.e. how it is configured
• Then figures out how to move
• Ideal of self modeling applied to more abstract applications
• candidate models -> candidate tests -> perturbations -> candidate models in a way that maximizes disagreement between predictions
• Symbolic regression (e.g. what function describes a data set). Disadvantages: Computationally expensive and over fits the data.
• Combined this with co-evolution using sub-sets of data to maximize disagreement between models.
• Started to apply this to inferring equations from data (e.g. used bio system equations to produce data, fed data to system, system returns equations)
• Then took high speed photos of flapping wing, derived representative parameters (e.g. lift, wing size), fed data to system and system reproduced several models, some known, some new.
• Epic fail on single cell domain. Added time delay building block and removed sin/cos building blocks. Produced more elegant model that the human produced equations.
• Looking For Invariants
• We can fit models to data but what do models mean?
• Pendulum example...energy is constant, Hamiltonian gives ability to prediction of system behavior. Can this be applied to evolution model - what is constant in the system?
• Started using ratio of derivatives to find non-trivial invariants.
• Published code: Eureqa - data mining tool: you give it data and it spits out models.
• How does it handle noise?
• Great talk.

• Skipped

#### Trials 'factor' in 'bump' hunts (Arthur Snyder, SLAC, Particle Physics)

• "I just used a triple Gaussian with a simple E-2 background and a Wengier's bootstrap estimate. I think its right. I need to look into it."
• "I can do 107 in a few hours without optimizing the code so..."

#### Some Discovery and Classification Challenges in the Exploration of Massive Data Sets and Data Streams (George Djorgovski, Cal Tech)

• Astronomy with massive data sets and data streams
• Telescope + instrument are "just" a front end to data systems, where the real action is.
• Volume and Complexity (e.g. Panchromatic) increases
• Simulations also generate lots of data - theory expressed as a data set, which must be mined and compared with empirical data sets.
• Targeted observations vs sky surveys
• Systematic Exploration of Observale parameter Space (time, morphological, spectrum, astrometric domains)
• raw data -> source detection and measure attributes -> analyze the data (select intresting targets) -> obtain follow-up spectra
• Statistical Approaches are Inevitable
• Fit models of energy distribution for galaxies, for example. Can include probability density distribution which need not be simple.
• Statistical Descriptors
• Power spectrum
• Properties in population studies
• Transient/variable Sources
• several sub-significant detections
• source position can be known or unknown
• could be heterogeneous surveys, other wavelengths
• most data may not contain any detections
• Frank Masci's Method
• Other cases:
• Sources on a Clustered Background
• Significant flare
• Clustering in Parameter Space
• multivariate correlations - clusters with a reduction of the statistical dimensionality
• e.g color - g-r, u-g, [3.6]-r parameter space (optical)
• e.g quasar selection in color parameter space found by finding outliers in clustering parameter spaces
• Thoughts/Challenges
• Data mining is statistics expressed as algorithms
• Scalability issues
• Maybe learn to live with an incomplete analysis
• Clusters are seldom Gaussian - beware of the assumptions
• Can we develop some entropy-like criterion to isolate portions or projections of hyper-dimensional parameter spaces where "something non-random may be going on?"
• Need to account fo heteroskedadic errors
• Visualization in >>3 dimensions is a huge problem
• Cosmic Cinematography (Astronomy in Time Domain)
• Synoptic surveys + time domain
• Catalina Real-Time Transient Survey (public data policy)
• Classification should take humans out of the loop b/c data sizes are going to get huge.
• Talked about classifier for variable stars