Supercomputing 2012 Summary:
Papers can be found at:
http://www.gb.nrao.edu/internal/sc12/SC12Proceedings/SC12/html/Papers.html
Tutorials are at:
http://www.gb.nrao.edu/internal/sc12/SC12_Tutorials/SC12Tutorials.html
Sunday:
Tutorials on:
Monday:
- Multicore tutorial - likwid, numactl (full day)
Tuesday:
Exibitor Forum on Advanced Architectures:
PCIe as a data center fabric (vendor forum, PLX)
- fewer component parts (except retimer)
- PCIe looks the same for local or across room (software-wise)
- higher end storage going to PCIe interfaces
- "Convergence is the key"
- instead of duplicating functions on every node
- aggregate storage of all hosts aggregate hosts etc.
- Enhancements to PCIe to make it fabric
- SR-IOV established standard
- Provide easy migration paths (upgrade nodes independently)
- PCIe card 'Express NIC'
- top of rack PCI express switch
- 1U box
- PLX PEX 8796
- software drivers to emulate features
- PCIe to PCIe direct connect or fabric via switch/cards
- BIOS support (will it understand so many devices?)
Paving the road to exascale computing (vendor forum, Mellanox)
Todd Wilde
- switch bridges Infiniband and Ethernet connect-X3/Switch-X2 Connect-IB
- next tech -> 100GBit/sec * dual ports
- off-loads protocol processing to adapter hw
- connection/connection-less
- virtualization ready
- RH6.4 via kvm
- MVM & FCA software integrated into MPI
- HW accelerated scatter gather
- message aggregation
- Mellanox to GPU direct transfers
- Mellanox VPI
- March 2013 release
- Analysis tools for network analysis
Affordable Shared Memory for Big Data (vendor forum, NUMASCALE)
- Super micro hardware partner
- intro into ccNUMA
- turn collections of machines into one big machine
- global shared memory
- numa-connect connection fabric
- numa-chip & remote cache
- 4-6 SERDES links
- remote cache (2-4GB)
- L4 cache
- operates on 64byte cache line size
- L4 cache ~300ns
- Remote memory ~1000ns
- Illustrated torus connection network
- Applications store directly to remote memory
- 72 nodes, 1728 cores, 4.6TB RAM
- Most MPI msgs < 128 bytes
- Supports 256TB of main memory
- Kernel support in Linux
- Opteron only
- Not a PCI card, plugs into Hypertransport socket on some Motherboards
Python in HPC (BOF)
Lighting Talks:
Cloud hosted environment
Peter Wang:
- IP[y] Notebook
- place environment at data
- ipython notebook
- andaconda
- browser interface
- matplotlib? Custom, but becoming open source
- continuum.io product webpage
Numba
Travis Oliyphant
- continuum analytics
- cpython target
- LLVM-PY
- compiles
- decorators autojit jit
- big speedups over python
- not necessary to write c anymore
easybuild
Kenneth Hoste
- build/install procedures (if exist) are time consuming
- framework for building software
- sounds like Yocto
- written in python
- various compiler support
- on github
- build is reproduceable
- automatic dependency resolution
- easy_install --user easybuild ...
Enthought
Sean Ross-Ross
- Python compiler 'seamless'
- listed several compiled python systems
- templates?
- jit decorator compiles into asm
- Call compilied python from C code
FeniCS and PyOP2
Andy Terrel
- finite element codes
- FFC compiler
- Dolphin (domain specific lang)
- solver
- PyOP2 - Parallel unstructed mesh framework
- integrates with FeniCS
- parallel loops for iteration over sets
- fenicsproject.org
Signal processing library VSIPL
Randel Judd
- OMG standard
- SWIGed version
- omg.org
- interested in help to get info in/out of his library
Open Discussion:
- Getting users off Matlab
- They like GUIs
- iPython notebook
- numfocus.org
- pyhpc.org
GPU Programming Models:
Early Evaluation of Directive based GPU Programming Models:
Oak Ridge/Georga Institude of Technology
- Motivation to GPU's: power budgets
- CUDA, OpenMPC, OpenAcc, PGI Accelerator
- each has different levels of abstractions
- Provide high level of abstraction
- Allow directives to specify code location (GPU/CPU)
- specify memory copies
- kernel loop style 'gang'
- R-stream example
- only single directive (placement CPU/GPU)
- Implict vs. Explict features (copy, dealloc, GPU specific opts)
- If the code doesn't fit compiler model ...
- Tested various models (5)
- 2 kernels, 3 NAS parallel benchmarks
- OpenMPC 100%
- R-stream 37.9% (program coverage)
- Case study Jacobi example
- memory access patterns of various parallization efforts
- NAS parallel benchmark FT (old code)
- lots of function calls
- not trival to port
- OpenACC is 1st step toward single GPU standard
- interpretation of nested work-sharing loops (issue)
- Scalability
- limited number of GPU's per node
- applicable only at small scale
- Debugabiliy
- Too much abstraction - little idea how translation works
Automated Generation of Software Pipelines for Hetrogenious Parallel Systems
Jacques Pienaar Purdue University // NEC labs
- Data, Task and pipeline parallelism
- expressing pipeline parallelism is hard
- System generates pipeline from annotated C++ codes
- Identify stages, schedule tasks onto processing units
- Balance of parallel pipelines (constrained by slowest)
- Partition to balance pipelines
- iteratively partition and schedule
- Automatic Heterogeous Pipelines (AHP)
- source to source compiler
#pragma ahp pipeline
{
}
#pragma ahp task in(...) out(...)
#pragma ahp functions { }
#pragma ahp variant target(MIC) /* (like ifdef for arch) */
- tasks like gates, dependencies like wires (HW circuit analogies)
- iterative partition, schedule
- Uses TBB pipeline framework
Accelerating Map-reduce on CPU-GPU combination
Ohio State
- trends toward integrated CPU/GPU. Examples:
- AMD fusion
- Intel Ivy Bridge
- Scheduling
- map-dividing scheme
- pipeling scheme
- memory hierarchy of GPU-CPU integrated
- private
- shared
- host (shared with CPU)
- Map-reduce
- map generates key,value pairs
- reduce merges the resulting values associated with key
- Memory overhead
- map, shuffle, reduce
- their implementation omits shuffle stage
- Reduction object based on hash table
- Reduction based on continuous reduction
- Scheduling
- map-dividing
- split CPU/GPU
- static scheduling not sufficient
- use one core as scheduler, rest of cores and GPU are workers
- piplining scheme
Exhibitor Forum:
How Memory and SSD's can Optimize Data Center Operations
- In-depth description of SSD, how they operate and challenges
End of Latency: A new Storage Architecture
- Improving latencies on storage arrays using SSD caches
Expanding the role of Solid State in HPC
- Buy lots of our drives ...
State of OpenMP//OpenMP-4.0 (BOF)
OpenMP Future
- 15th birthday of openmp
- 1997 "new std to govern PC's with multiple chips"
- OpenMP mission
- OpenACC is spinoff from 4 members, plan poss re-integration into openmp
- openmp.org/calendar.html
- Tech report 1
- Release candidate 1 for openmp-4.0
Tech presentation from lang chair
- openmp 3.1 July 2011
- openmp 4.0 nearing completion
- rc1 comment draft
- rc2 feb 2013
- early June spec release
- feedback from non-members is welcome
- OpenMP 5.0 comment draft @SC14
- Openmp4:
- ticket process
- SIMD directives
- extended support for affinity
- additional support for fortran 2003
- user-defined reductions
(Presenter promised to post slides to OpenMP website)
- #pragma omp simd [clause ]
- (chunk following loop in SIMD chunks)
- clauses:
- safelen(length) -- limits
- linear -- list vars
- aligned
- private, lastprivate, reduction, collapse
- firstprivate? (couldn't find a use case)
- What happens if loop contains func calls?
- #pragma omp declare simd -- decorates function
- both parallelize and vectorize simultaneously
- #pragma omp parallel for simd
- Other OpenMP 4 support
- OMP_PLACES to specify threads, cores, sockets
- proc_bind(master | close | spread)
- ignored if OMP_PROC_BIND is false
- omp_get_proc_bind()
- Fortran 2003 support
- taskgroups
- cancel tasks
- new data environment
OpenMP accelerator model (OpenMP 4/5)
- target directives
- target
- target data
- target update
- target mirror
- target linkable
- new runtime funcs:
/* A parallel reduction on the target: */
#pragma omp target device(acc0) map(B,C)
#pragma omp parallel for reduction(+:sum)
#pragma omp target declare
void func(...) /* compile function for target device */
- Synchronization capabilities supported in OpenMP, but GPU's have weak memory sync models
Wednesday:
Algorithms
A Framework for Low-communication 1-D FFT's
- DFT - O(N2)
- FFT - O(NlogN)
- applies smaller DFT number of times
- communication pattern of FFT
- cost of moving data higher than FP cost
- 3 passes of all-all communication
- new class of DFT factorization
- all to all is needed even if final interest is a subset
- come up with a time sequence rep subset of F(s)
- modulate by window function, then make periodic by shift add
- Modulation is convolution in t domain
- periodicity in F is sampling in t domain
- Approximate
- convolution is weighted sum (windowed) truncation error
- demodulation step in F window trunc
- periodization stage can produce aliasing
- but we can control these tradeoff for speed
- O(M log M) + O(N B) <-- more but less communication
- Also nicely SIMD friendly
- blocking for caches
- Imple MPI/OMP + FFT + BLAS
- Run on Intel HW
- Xeon E5-2670
- QDR IB
- 2x FFTW/Intel MKS
- 2.4x theoretical at 10.5 digit
- SOI is Implemented in MKL 11.0.1
Tiling Stencil Computations to Maximize Parallelism
- examples of evaluation order of gridded data
- dependencies between grid data
- validity of tiling constraints:
- tile is a piece of computation which can be executed atomically
- block after block
- Invalid vs valid tiling
- blocking by hyerplanes
- many valid ones -> cost function + validity
- inter tile dependency
- hyperplanes found such that normal to the face is in their cone
- source level transformations used
- Intra tile scanning order
- lower dimensional concurrent start (N-1)
- comparisons of various partitioning schemes
Efficient Data Restructuring and Aggregation for I/O acceleration in PIDX
- ViSUS vs. PIDX (simulation)
- IDX multi-res uses cache oblivious layout
- PIDX - I/O library to write data
- Data streaming for processing and data visualization
- case study: S3D on 32,000 cores
- Data layout hierarchy of HZ curves (data layout)
- Z-order blocking of data to assure locality in 2-D
- multi-level hierarchy
- Parallel implementation so that data is written in parallel
- No interleaving
- writing small blocks inefficient, so aggregate into larger blocks
- HZ encoding based on powers of 2, For non-powers of 2, indexes are skipped
- Lots of discussion about handling non-powers of 2 sized data. (main focus)
- Multi-phase I/O to further aggregate into large blocks
- Results:
- PNetCDF - 6 GB/sec
- Fortran I/O 7 GB/sec
- PIDX I/O 20 GB/sec
OpenMPI state of the union (BOF)
MPI State of the Union
- MPI 3.0 is out
- Book on sale!
- Open MPI is combination of many MPI implementation
- versions odds are feature/experimental series v1.
- versions evens are stable releases v1.
- v1.6 current stable
- v1.7 goals
- MPI-3.0 compliance
- thread safety
- better resource exhaustion resilience
- Cray/Gemini support
- memory use at scale/scalability
- v1.7.0:
- better fortran bindings
- Java bindings
- improved locality controls
- improved runtime
- new collectives
- MPI-3 features
- Matched probe
- non-blocking collective ops
- version query
- v1.7.1 plans
- one sided interface including shared memory windows
- MPI research
- new collectives
- use hardware features
- aggregate broadcast bandwidth
- runtime
- helpers for launch, connect, control, I/O
- critical for scalability of any paradigm
- binomial broadcast trees log_2(N) steps
- migrate from tree to binomial tree after startup
- faster than slurm
- CR strategies (checkpointing)
- Cisco Updates
- Ethernet based company
- Ultra-low latency Ethernet
- userspace with ibverbs usnic
- hardware offload
- back-to-back ping pong with verbs
- Cisco switch port-to-port latency 190ns
- Prototype OpenMPI plugin
- 300-400ns
- total 2.2-2.4 us
- Mo betta Fortran bindings
- use mpi
- prototypes for all MPI subroutines
- F08 bindings * now have distinct types for MPI handles
- Thread safety
- tested with Intel, Absoft, Portland
- not gfortran, but gnu folks are activity working on it
- hwloc hardware locality project
- C programmatic interface for NUMA topology queries
- Smallest unit of affinity is a hyper-thread
- mentioned never using hyperthreaded cores
- Affinity is complicated but necessary
- VampirTrace
- ships with OpenMPI
- uses iofsl.org I/O scalability from Berkley
- 200,000 procs, 4.2TB trace data, 1T events
- supports MPI + CUDA
- uses CUPTI tool interface
- supports NVIDIA CARMA devices
- Clang compiler extensions
- MOSIX support
Questions:
- GPU support?
- Collaboration with OpenMP community?
- no, but there is some research
- Broken: OpenMP<=>OpenMPI communities overlap, yet little cooperation
Simulations
Simulating the Universe at Extreme Scale
Salman Habib Argonne National Laboratory
- HACC Hardware/Hybrid Accelerated Cosmology Code) Framework
- LSST data deluge
- Cosmology = Physics + statistics
- Can the observable universe fit into a computer?
- 4225 Mpsec
- solving for viasov-poisson equation (6D)
- Million to one resolution
- Can it be run in a week timescale?
- Used HACC framework
- lifescale of code exceeds that of hardware
- Must be able to do in-situ analysis -- cant save total state
- Split the force - multi-grained - shift into k-space
- RCB tree
- variable time-stepping
- plug-in style solvers
- OpenMP + MPI
Booth Duty 2-4
Runtime-Based Analysis and Optimization
Code Generation for Parallel Execution of a Class of Irregular Loops
Designing a Unified Programming Model for Heterogenous Machines
Nvidia person
- Programming languages must evolve
- Motivated not by memory, cores etc., but rather energy efficientcy
- How might C++ evolve to address hardware trajectory?
- Simple notation for both CPU/GPU/HW
- explict constructs for parallelization and locality
- provide a substrate for abstractions
- extend main-stream language (ex. OpenMP)
- implement as library (ex. TBB)
- Make processor heterogenity explict using type system
- task hierarchy of threads
- Examples (Phalanx)
- uses GasNet, OpenMP, and Cuda underneath the hood
- 'place' [where computation should be performed] object
- hierarchy of computing resources
- allocations, task management calls include 'place' object
- default action up to runtime
- or can be made explicit by programmer
- task_entry(thread self)
- task_entry(thread self)
- Async task graphs
- can include large number of threads
- SIMD
- Memory model
- pointer containers
- restricted by type on what/where it can point to (ex. GPU internal memory)
OpenMP Haters Club (BOF) [AKA: Is OpenMP the best we can hope for?]
5 minute talks
Cilk and OpenMP
- Bradley Kusmal:
- parallelism not concurrency (tasks)
- serial semantics, but execute in parallel
- cilk overview
- designed for recursion
- divide and conquer loops
- OpenMP falls short
- too many schedules
- too many knobs
OpenMP
- Micheael Wolfe, Portland group
- uniformly accessible memory
- persistent execution threads
- NUMA
- hetrogenious cores
- HW thread creation
- Language Alternatives
- C++ Parallelism
- Fortran do parallel
- cilk extensions to C
- Move to Embedded market?
- Once languages adopt [cilk], OpenMP is done! (and thats ok)
Robert Geva, Intel
- Programmer personalities
- should default be fastest code, or repeatable??
- cilk much smaller than OpenMP
- cilk likes recursion
- some algorithms work better in recursive forms
- What about vectorization constructs?
OpenMP runtime architect
- Jim Cownie Intel
- features added as required (OpenMP4)
- language adoption could sink alternates.
Panel: Cilk vs. TBB
- compiler vs. library implementation
- cilk work stealing vs dedicated threads
- parallelize outer loops, vectorize inner-loops
- openmp introduces simd constructs
- cilk fulfills outer loops
- keywords proposed to C/C++ standards groups
Panel seems to favor cilk model, but doesn't have good answers to how to get full performance. However, cilk gets you 80% there, OpenMP with more knobs may get last 20%.
Thursday:
Invited DOE talk
- Discussed next revolution in HPC: exa-FLOP computing
- Sounded like it would cost exa-dollars
Invited GPU History talk
- Interesting but nothing new
Cosmology Applications:
Optimizing the Computation of N-pt Correlation Functions
- Sloan DDS 15TB
- LSST 10TB/day
- SKA dedicated exascale machine
- comparing sims vs obs
- statistics NPCF (fun. spatial stat.)
- O(N3) work, repeated for each scale
- notion of error
- repeat many times
- Main result: speed up 3-point correlation function
- Algoritms:
- npcf
- 3 point correlation function triangle of dist pts
- for dist 'r' compute number of pairs near dist r
- counting tuples of distance constraints
- Faster
- Fourier approximations are used by others
- tree-based methods
- Ntropy code
- Fast 2-pt cf (paper this year SC12)
- Direct computation
- key idea: bound by pruning
- quadtrees
- trees vs direct computation
- bitwise interpretation
- compare and store in bitfield 0=no count 1=count
- combine bitfields with logical operations
- result is population count of combined bitfields
- Optimizations
- merge base cases
- floating point phase
- Logical operation phase
- 64 and 128 bit instructions (archs have count bits in field instructions)
- popcnt, sse popcnt
- Results
- order of magnitude speedup
- 40,000 vs 1500 hours on 10x more data
- MPI implementation + ?
- Questions:
- Trading space or complexity?
- SIMD part not automatic
Adaptive Mesh Refinement Trees [Hierarchical Task Mapping of Cell-Based AMR Trees]
- Parallel cosmology simulation tool
- simulate evolution of universe
- Refine by cel-by-cell bases
- space filling curve based approach
- (looks like recursive quadtrees)
- Existing work
- Hierarchical approach, objective reduce communications
- map tasks to topology
- internode mapping minimize communications by partitioning
- iterate & recurse task graph onto node/cores
- partitioning example didn't make sense, increased from 40 msgs to 49!
- but perhaps these occur in parallel... 25/24
- compared intra vs inter CPU socket communications
Global Arrays/PGAS BOF
Global Arrays is leading model on PGAS
Overview:
- scalable
- actively supported
- Why GA/PGAS
- decouple data and computations
- segegration allows computation to shift around to balance
- Message Passing
- regular comm patterns
- algorithms have high degree of synch
- PGAS
- irregular communications patterns
- algorithms async
- data consistency explicitly managed
- Benefits of GA
- HL of abstraction
- interops with other libs like MPI
- predictable performance with explict data movement
- applications ....
- Note: (num of attendees < 10)
- Global trees, sparse data structures
- potential for arbitrary distributed data structures
- GA model very simple, core functions ~ 11 operations
- went through GA examples vs. MPI examples
- Current Status
- 5.1.1 current release
- Enhanced support for MPI
- Python numpy support
- Global pointers for arbitrary data types
- a bit C-like & primative
Memory Systems:
Hardware-Software Protocols for Cache Consistency
- Cache hierarchy
- scalability
- alternatives
- Hybrid memory systems
- power
- programibility
- Hybrid
- local mem alongside cache hierarchy
- programmer writes conventional code
- compiler applies tiling transforms
- regular vs. irregular access patterns
- Three phase
- control
- map global memory to local memory via DMA transfers
- compute iterations
- sync
- work on localized data
- local memory not coherent with global memories
Their work:
- idea
- avoid duplication and unnecessary copying of data
- ensure copies are identical
- ensure valid copy of the data is always accessed
- Compiler support
- Phases
- classification of memory references
- irregular access
- garded accesses, double store modifications
- only regular accesses handled
- code generation
- Hardware design
- keep track of the local memory contents
- system memory base address translation
- address generation
- base/offset System mem/local mem
- Example ...
- Overhead
- more efficient resource access (seems like read-mostly would be faster) but write-lots would incure significant double store overhead
- more energy efficient?
HW transactional memory
- Sequoria at LLNL
- on manycore systems multiple treads lead to race conditions and: "wrong answer"
- avoid locks with clever algorithms
- BG/Q offers HW transactional memory
- What apps favor TM?
- TM
- optimistically just update, roll-back on conflict
- #pragma omp tm_atomic
- compiler support on BG/Q
- Tests/results
- parallization with OpenMP or hybrid OpenMP/MPI
--
JoeBrandt - 2012-12-10