Supercomputing 2012 Summary:

Papers can be found at:
Tutorials are at:


Tutorials on:


  • Multicore tutorial - likwid, numactl (full day)


Exibitor Forum on Advanced Architectures:

PCIe as a data center fabric (vendor forum, PLX)

  • fewer component parts (except retimer)
  • PCIe looks the same for local or across room (software-wise)
  • higher end storage going to PCIe interfaces
  • "Convergence is the key"
    • instead of duplicating functions on every node
    • aggregate storage of all hosts aggregate hosts etc.
  • Enhancements to PCIe to make it fabric
  • SR-IOV established standard
  • Provide easy migration paths (upgrade nodes independently)
  • PCIe card 'Express NIC'
  • top of rack PCI express switch
    • 1U box
    • PLX PEX 8796
    • software drivers to emulate features
    • PCIe to PCIe direct connect or fabric via switch/cards
  • BIOS support (will it understand so many devices?)

Paving the road to exascale computing (vendor forum, Mellanox)

Todd Wilde

  • switch bridges Infiniband and Ethernet connect-X3/Switch-X2 Connect-IB
  • next tech -> 100GBit/sec * dual ports
  • off-loads protocol processing to adapter hw
  • connection/connection-less
  • virtualization ready
  • RH6.4 via kvm
  • MVM & FCA software integrated into MPI
    • HW accelerated scatter gather
    • message aggregation
  • Mellanox to GPU direct transfers
    • Mellanox VPI
    • March 2013 release
  • Analysis tools for network analysis

Affordable Shared Memory for Big Data (vendor forum, NUMASCALE)

  • Super micro hardware partner
  • intro into ccNUMA
  • turn collections of machines into one big machine
    • global shared memory
    • numa-connect connection fabric
    • numa-chip & remote cache
  • 4-6 SERDES links
  • remote cache (2-4GB)
  • L4 cache
  • operates on 64byte cache line size
  • L4 cache ~300ns
  • Remote memory ~1000ns
  • Illustrated torus connection network
  • Applications store directly to remote memory
  • 72 nodes, 1728 cores, 4.6TB RAM
  • Most MPI msgs < 128 bytes
  • Supports 256TB of main memory
  • Kernel support in Linux
  • Opteron only
  • Not a PCI card, plugs into Hypertransport socket on some Motherboards

Python in HPC (BOF)

Lighting Talks:

Cloud hosted environment

Peter Wang:
  • IP[y] Notebook
  • place environment at data
  • ipython notebook
  • andaconda
  • browser interface
  • matplotlib? Custom, but becoming open source
  • product webpage


Travis Oliyphant
  • continuum analytics
  • cpython target
  • compiles
  • decorators autojit jit
  • big speedups over python
  • not necessary to write c anymore


Kenneth Hoste
  • build/install procedures (if exist) are time consuming
  • framework for building software
  • sounds like Yocto
  • written in python
  • various compiler support
  • on github
  • build is reproduceable
  • automatic dependency resolution
  • easy_install --user easybuild ...


Sean Ross-Ross
  • Python compiler 'seamless'
  • listed several compiled python systems
  • templates?
  • jit decorator compiles into asm
  • Call compilied python from C code

FeniCS and PyOP2

Andy Terrel
  • finite element codes
  • FFC compiler
  • Dolphin (domain specific lang)
  • solver
  • PyOP2 - Parallel unstructed mesh framework
  • integrates with FeniCS
  • parallel loops for iteration over sets

Signal processing library VSIPL

Randel Judd
  • OMG standard
  • SWIGed version
  • interested in help to get info in/out of his library

Open Discussion:

  • Getting users off Matlab
  • They like GUIs
  • iPython notebook
    • need our content!

GPU Programming Models:

Early Evaluation of Directive based GPU Programming Models:

Oak Ridge/Georga Institude of Technology

  • Motivation to GPU's: power budgets
  • CUDA, OpenMPC, OpenAcc, PGI Accelerator
    • each has different levels of abstractions
  • Provide high level of abstraction
  • Allow directives to specify code location (GPU/CPU)
    • specify memory copies
    • kernel loop style 'gang'
  • R-stream example
    • only single directive (placement CPU/GPU)
  • Implict vs. Explict features (copy, dealloc, GPU specific opts)
  • If the code doesn't fit compiler model ...
  • Tested various models (5)
  • 2 kernels, 3 NAS parallel benchmarks
  • OpenMPC 100%
  • R-stream 37.9% (program coverage)
  • Case study Jacobi example
    • memory access patterns of various parallization efforts
  • NAS parallel benchmark FT (old code)
    • lots of function calls
    • not trival to port
  • OpenACC is 1st step toward single GPU standard
    • interpretation of nested work-sharing loops (issue)
  • Scalability
    • limited number of GPU's per node
    • applicable only at small scale
  • Debugabiliy
    • Too much abstraction - little idea how translation works

Automated Generation of Software Pipelines for Hetrogenious Parallel Systems

Jacques Pienaar Purdue University // NEC labs

  • Data, Task and pipeline parallelism
  • expressing pipeline parallelism is hard
  • System generates pipeline from annotated C++ codes
  • Identify stages, schedule tasks onto processing units
  • Balance of parallel pipelines (constrained by slowest)
  • Partition to balance pipelines
    • iteratively partition and schedule
  • Automatic Heterogeous Pipelines (AHP)
  • source to source compiler
    • annotated C++ code
      • 4 compilier directives:
        #pragma ahp pipeline
        #pragma ahp task in(...) out(...)
        #pragma ahp functions { }
        #pragma ahp variant target(MIC) /* (like ifdef for arch) */

  • tasks like gates, dependencies like wires (HW circuit analogies)
  • iterative partition, schedule
  • Uses TBB pipeline framework

Accelerating Map-reduce on CPU-GPU combination

Ohio State
  • trends toward integrated CPU/GPU. Examples:
    • AMD fusion
    • Intel Ivy Bridge
  • Scheduling
    • map-dividing scheme
    • pipeling scheme
  • memory hierarchy of GPU-CPU integrated
    • private
    • shared
    • host (shared with CPU)
  • Map-reduce
    • map generates key,value pairs
    • reduce merges the resulting values associated with key
  • Memory overhead
    • map, shuffle, reduce
    • their implementation omits shuffle stage
  • Reduction object based on hash table
  • Reduction based on continuous reduction
  • Scheduling
    • map-dividing
      • split CPU/GPU
      • static scheduling not sufficient
    • use one core as scheduler, rest of cores and GPU are workers
  • piplining scheme
    • dynamic load balancing

Exhibitor Forum:

How Memory and SSD's can Optimize Data Center Operations

  • In-depth description of SSD, how they operate and challenges

End of Latency: A new Storage Architecture

  • Improving latencies on storage arrays using SSD caches

Expanding the role of Solid State in HPC

  • Buy lots of our drives ...

State of OpenMP//OpenMP-4.0 (BOF)

OpenMP Future
  • 15th birthday of openmp
  • 1997 "new std to govern PC's with multiple chips"

  • OpenMP mission
  • OpenACC is spinoff from 4 members, plan poss re-integration into openmp
  • Tech report 1
  • Release candidate 1 for openmp-4.0

Tech presentation from lang chair
  • openmp 3.1 July 2011
  • openmp 4.0 nearing completion
    • rc1 comment draft
    • rc2 feb 2013
    • early June spec release
  • feedback from non-members is welcome
  • OpenMP 5.0 comment draft @SC14
  • Openmp4:
    • ticket process
    • SIMD directives
    • extended support for affinity
    • additional support for fortran 2003
    • user-defined reductions

(Presenter promised to post slides to OpenMP website)

  • #pragma omp simd [clause ]
  • (chunk following loop in SIMD chunks)
  • clauses:
    • safelen(length) -- limits
    • linear -- list vars
    • aligned
    • private, lastprivate, reduction, collapse
    • firstprivate? (couldn't find a use case)
  • What happens if loop contains func calls?
    • #pragma omp declare simd -- decorates function
  • both parallelize and vectorize simultaneously
  • #pragma omp parallel for simd
  • Other OpenMP 4 support
    • OMP_PLACES to specify threads, cores, sockets
    • proc_bind(master | close | spread)
      • ignored if OMP_PROC_BIND is false
    • omp_get_proc_bind()
    • Fortran 2003 support
    • taskgroups
    • cancel tasks
    • new data environment

OpenMP accelerator model (OpenMP 4/5)
  • target directives
    • target
    • target data
    • target update
    • target mirror
    • target linkable
  • new runtime funcs:
    • omp_get_device_num()
      /* A parallel reduction on the target: */
      #pragma omp target device(acc0) map(B,C)
      #pragma omp parallel for reduction(+:sum)
     #pragma omp target declare
     void func(...)  /* compile function for target device */
  • Synchronization capabilities supported in OpenMP, but GPU's have weak memory sync models



A Framework for Low-communication 1-D FFT's

  • DFT - O(N2)
  • FFT - O(NlogN)
    • applies smaller DFT number of times
  • communication pattern of FFT
  • cost of moving data higher than FP cost
  • 3 passes of all-all communication
    • reduce to just once
  • new class of DFT factorization
    • all to all is needed even if final interest is a subset
  • come up with a time sequence rep subset of F(s)
  • modulate by window function, then make periodic by shift add
  • Modulation is convolution in t domain
  • periodicity in F is sampling in t domain
  • Approximate
    • convolution is weighted sum (windowed) truncation error
    • demodulation step in F window trunc
    • periodization stage can produce aliasing
    • but we can control these tradeoff for speed
  • O(M log M) + O(N B) <-- more but less communication
  • Also nicely SIMD friendly
  • blocking for caches
  • Imple MPI/OMP + FFT + BLAS
  • Run on Intel HW
  • Xeon E5-2670
  • QDR IB
  • 2x FFTW/Intel MKS
  • 2.4x theoretical at 10.5 digit
  • SOI is Implemented in MKL 11.0.1

Tiling Stencil Computations to Maximize Parallelism

  • examples of evaluation order of gridded data
  • dependencies between grid data
  • validity of tiling constraints:
    • tile is a piece of computation which can be executed atomically
    • block after block
  • Invalid vs valid tiling
  • blocking by hyerplanes
    • many valid ones -> cost function + validity
  • inter tile dependency
  • hyperplanes found such that normal to the face is in their cone
  • source level transformations used
    • Intra tile scanning order
    • lower dimensional concurrent start (N-1)
  • comparisons of various partitioning schemes

Efficient Data Restructuring and Aggregation for I/O acceleration in PIDX

  • ViSUS vs. PIDX (simulation)
  • IDX multi-res uses cache oblivious layout
  • PIDX - I/O library to write data
  • Data streaming for processing and data visualization
  • case study: S3D on 32,000 cores

  • Data layout hierarchy of HZ curves (data layout)
    • Z-order blocking of data to assure locality in 2-D
    • multi-level hierarchy
    • Parallel implementation so that data is written in parallel
    • No interleaving
    • writing small blocks inefficient, so aggregate into larger blocks
    • HZ encoding based on powers of 2, For non-powers of 2, indexes are skipped
      • Lots of discussion about handling non-powers of 2 sized data. (main focus)
  • Multi-phase I/O to further aggregate into large blocks
  • Results:
    • PNetCDF - 6 GB/sec
    • Fortran I/O 7 GB/sec
    • PIDX I/O 20 GB/sec

OpenMPI state of the union (BOF)

MPI State of the Union

  • MPI 3.0 is out
  • Book on sale!
  • Open MPI is combination of many MPI implementation
  • versions odds are feature/experimental series v1.
  • versions evens are stable releases v1.
  • v1.6 current stable
  • v1.7 goals
    • MPI-3.0 compliance
    • thread safety
    • better resource exhaustion resilience
    • Cray/Gemini support
    • memory use at scale/scalability
  • v1.7.0:
    • better fortran bindings
    • Java bindings
    • improved locality controls
    • improved runtime
    • new collectives
  • MPI-3 features
    • Matched probe
    • non-blocking collective ops
    • version query
  • v1.7.1 plans
    • one sided interface including shared memory windows

  • MPI research
    • new collectives
      • use hardware features
      • aggregate broadcast bandwidth
    • runtime
      • helpers for launch, connect, control, I/O
      • critical for scalability of any paradigm
      • binomial broadcast trees log_2(N) steps
      • migrate from tree to binomial tree after startup
      • faster than slurm
    • CR strategies (checkpointing)
  • Cisco Updates
    • Ethernet based company
    • Ultra-low latency Ethernet
    • userspace with ibverbs usnic
    • hardware offload
    • back-to-back ping pong with verbs
    • Cisco switch port-to-port latency 190ns
    • Prototype OpenMPI plugin
      • 300-400ns
      • total 2.2-2.4 us
    • Mo betta Fortran bindings
      • use mpi
      • prototypes for all MPI subroutines
        • ... but not for gfortran
      • F08 bindings * now have distinct types for MPI handles
      • Thread safety
      • tested with Intel, Absoft, Portland
        • not gfortran, but gnu folks are activity working on it
    • hwloc hardware locality project
      • C programmatic interface for NUMA topology queries
      • Smallest unit of affinity is a hyper-thread
        • mentioned never using hyperthreaded cores
      • Affinity is complicated but necessary
    • VampirTrace
      • ships with OpenMPI
      • uses I/O scalability from Berkley
      • 200,000 procs, 4.2TB trace data, 1T events
      • supports MPI + CUDA
      • uses CUPTI tool interface
      • supports NVIDIA CARMA devices
    • Clang compiler extensions
    • MOSIX support

  • GPU support?
  • Collaboration with OpenMP community?
    • no, but there is some research
    • Broken: OpenMP<=>OpenMPI communities overlap, yet little cooperation


Simulating the Universe at Extreme Scale

Salman Habib Argonne National Laboratory

  • HACC Hardware/Hybrid Accelerated Cosmology Code) Framework
  • LSST data deluge
  • Cosmology = Physics + statistics
  • Can the observable universe fit into a computer?
  • 4225 Mpsec
  • solving for viasov-poisson equation (6D)
  • Million to one resolution
  • Can it be run in a week timescale?
  • Used HACC framework
    • 13.94 PFlops
  • lifescale of code exceeds that of hardware
  • Must be able to do in-situ analysis -- cant save total state
  • Split the force - multi-grained - shift into k-space
  • RCB tree
  • variable time-stepping
  • plug-in style solvers
  • OpenMP + MPI

Booth Duty 2-4

Runtime-Based Analysis and Optimization

Code Generation for Parallel Execution of a Class of Irregular Loops

  • Interesting paper

Designing a Unified Programming Model for Heterogenous Machines

Nvidia person

  • Programming languages must evolve
    • Motivated not by memory, cores etc., but rather energy efficientcy
  • How might C++ evolve to address hardware trajectory?
  • Simple notation for both CPU/GPU/HW
    • explict constructs for parallelization and locality
    • provide a substrate for abstractions
    • extend main-stream language (ex. OpenMP)
    • implement as library (ex. TBB)
    • Make processor heterogenity explict using type system
    • task hierarchy of threads
  • Examples (Phalanx)
    • uses GasNet, OpenMP, and Cuda underneath the hood
    • 'place' [where computation should be performed] object
    • hierarchy of computing resources
    • allocations, task management calls include 'place' object
      • default action up to runtime
      • or can be made explicit by programmer
    • task_entry(thread self)
    • task_entry(thread self)
  • Async task graphs
    • can include large number of threads
    • SIMD
  • Memory model
    • pointer containers
    • restricted by type on what/where it can point to (ex. GPU internal memory)

OpenMP Haters Club (BOF) [AKA: Is OpenMP the best we can hope for?]

5 minute talks
Cilk and OpenMP
  • Bradley Kusmal:
  • parallelism not concurrency (tasks)
  • serial semantics, but execute in parallel
  • cilk overview
    • designed for recursion
    • divide and conquer loops
  • OpenMP falls short
  • too many schedules
  • too many knobs

  • Micheael Wolfe, Portland group
  • uniformly accessible memory
  • persistent execution threads
  • NUMA
  • hetrogenious cores
    • fat and thin cores
  • HW thread creation
  • Language Alternatives
    • C++ Parallelism
    • Fortran do parallel
    • cilk extensions to C
  • Move to Embedded market?
  • Once languages adopt [cilk], OpenMP is done! (and thats ok)

Robert Geva, Intel
  • Programmer personalities
  • should default be fastest code, or repeatable??
  • cilk much smaller than OpenMP
  • cilk likes recursion
  • some algorithms work better in recursive forms
    • tree walk example
  • What about vectorization constructs?

OpenMP runtime architect
  • Jim Cownie Intel
  • features added as required (OpenMP4)
  • language adoption could sink alternates.

Panel: Cilk vs. TBB
  • compiler vs. library implementation
  • cilk work stealing vs dedicated threads
  • parallelize outer loops, vectorize inner-loops
    • openmp introduces simd constructs
    • cilk fulfills outer loops
    • keywords proposed to C/C++ standards groups

Panel seems to favor cilk model, but doesn't have good answers to how to get full performance. However, cilk gets you 80% there, OpenMP with more knobs may get last 20%.


Invited DOE talk

  • Discussed next revolution in HPC: exa-FLOP computing
  • Sounded like it would cost exa-dollars

Invited GPU History talk

  • Interesting but nothing new

Cosmology Applications:

Optimizing the Computation of N-pt Correlation Functions

  • Sloan DDS 15TB
  • LSST 10TB/day
  • SKA dedicated exascale machine
  • comparing sims vs obs
  • statistics NPCF (fun. spatial stat.)
  • O(N3) work, repeated for each scale
    • notion of error
    • repeat many times
  • Main result: speed up 3-point correlation function
  • Algoritms:
    • npcf
    • 3 point correlation function triangle of dist pts
    • for dist 'r' compute number of pairs near dist r
    • counting tuples of distance constraints
  • Faster
    • Fourier approximations are used by others
    • tree-based methods
      • Ntropy code
      • Fast 2-pt cf (paper this year SC12)
  • Direct computation
    • key idea: bound by pruning
    • quadtrees
    • trees vs direct computation
    • bitwise interpretation
    • compare and store in bitfield 0=no count 1=count
    • combine bitfields with logical operations
    • result is population count of combined bitfields
  • Optimizations
    • merge base cases
    • floating point phase
      • SIMD compute distances
  • Logical operation phase
    • 64 and 128 bit instructions (archs have count bits in field instructions)
    • popcnt, sse popcnt
  • Results
    • order of magnitude speedup
      • 40,000 vs 1500 hours on 10x more data
    • MPI implementation + ?
  • Questions:
    • Trading space or complexity?
    • SIMD part not automatic

Adaptive Mesh Refinement Trees [Hierarchical Task Mapping of Cell-Based AMR Trees]

  • Parallel cosmology simulation tool
    • simulate evolution of universe
  • Refine by cel-by-cell bases
    • space filling curve based approach
    • (looks like recursive quadtrees)
  • Existing work
  • Hierarchical approach, objective reduce communications
    • map tasks to topology
    • internode mapping minimize communications by partitioning
    • iterate & recurse task graph onto node/cores
  • partitioning example didn't make sense, increased from 40 msgs to 49!
    • but perhaps these occur in parallel... 25/24
    • compared intra vs inter CPU socket communications

Global Arrays/PGAS BOF

Global Arrays is leading model on PGAS

  • scalable
  • actively supported
  • Why GA/PGAS
    • decouple data and computations
    • segegration allows computation to shift around to balance
      • also adds resilience
  • Message Passing
    • regular comm patterns
    • algorithms have high degree of synch
  • PGAS
    • irregular communications patterns
    • algorithms async
      • can be used for sync
    • data consistency explicitly managed
  • Benefits of GA
    • HL of abstraction
    • interops with other libs like MPI
    • predictable performance with explict data movement
  • applications ....
  • Note: (num of attendees < 10)
  • Global trees, sparse data structures
  • potential for arbitrary distributed data structures
  • GA model very simple, core functions ~ 11 operations
  • went through GA examples vs. MPI examples
    • both are explicit
  • Current Status
    • 5.1.1 current release
    • Enhanced support for MPI
    • Python numpy support
    • Global pointers for arbitrary data types
    • a bit C-like & primative

Memory Systems:

Hardware-Software Protocols for Cache Consistency

  • Cache hierarchy
    • scalability
    • alternatives
    • Hybrid memory systems
    • power
    • programibility
  • Hybrid
    • local mem alongside cache hierarchy
    • programmer writes conventional code
    • compiler applies tiling transforms
    • regular vs. irregular access patterns
  • Three phase
    • control
      • map global memory to local memory via DMA transfers
      • compute iterations
      • sync
    • work on localized data
    • local memory not coherent with global memories

Their work:
  • idea
    • avoid duplication and unnecessary copying of data
    • ensure copies are identical
    • ensure valid copy of the data is always accessed
  • Compiler support
    • Phases
      • classification of memory references
        • regular access
        • strided
      • irregular access
        • garded accesses, double store modifications
    • only regular accesses handled
    • code generation
  • Hardware design
    • keep track of the local memory contents
    • system memory base address translation
    • address generation
      • base/offset System mem/local mem
  • Example ...
  • Overhead
    • more efficient resource access (seems like read-mostly would be faster) but write-lots would incure significant double store overhead
    • more energy efficient?

HW transactional memory

  • Sequoria at LLNL
  • on manycore systems multiple treads lead to race conditions and: "wrong answer"
  • avoid locks with clever algorithms
  • BG/Q offers HW transactional memory
  • What apps favor TM?
  • TM
    • optimistically just update, roll-back on conflict
    • #pragma omp tm_atomic
  • compiler support on BG/Q
  • Tests/results
    • parallization with OpenMP or hybrid OpenMP/MPI

-- JoeBrandt - 2012-12-10
Topic revision: r5 - 2012-12-12, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback