Intel Xeon Phi processor high performance programming / by Jim Jeffers, James Reinders, Avinash Sodani.

Saved in:
Bibliographic Details
Online Access: Full Text (via ScienceDirect)
Main Authors: Jeffers, Jim (Computer engineer) (Author), Reinders, James (Author), Sodani, Avinash (Author)
Format: eBook
Language:English
Published: Cambridge, MA : Morgan Kaufmann is an imprint of Elsevier, [2016]
Edition:Knights Landing edition.
Subjects:
Table of Contents:
  • Machine generated contents note: ch. 1 Introduction
  • Introduction to Many-Core Programming
  • Trend: More Parallelism
  • Why Intel® Xeon Phi["! Processors Are Needed
  • Processors Versus Coprocessor
  • Measuring Readiness for Highly Parallel Execution
  • What About GPUs?
  • Enjoy the Lack of Porting Needed but Still Tune!
  • Transformation for Performance
  • Hyper-Threading Versus Multithreading
  • Programming Models
  • Why We Could Skip To Section II Now
  • For More Information
  • ch. 2 Knights Landing Overview
  • Overview
  • Instruction Set
  • Architecture Overview
  • Motivation: Our Vision and Purpose
  • Summary
  • For More Information
  • ch. 3 Programming MCDRAM and Cluster Modes
  • Programming for Cluster Modes
  • Programming for Memory Modes
  • Query Memory Mode and MCDRAM Available
  • SNC Performance Implications of Allocation and Threading
  • How to Not Hard Code the NUMA Node Numbers
  • Approaches to Determining What to Put in MCDRAM.
  • Note continued: Why Rebooting Is Required to Change Modes
  • BIOS
  • Summary
  • For More Information
  • ch. 4 Knights Landing Architecture
  • Tile Architecture
  • Cluster Modes
  • Memory Interleaving
  • Memory Modes
  • Interactions of Cluster and Memory Modes
  • Summary
  • For More Information
  • ch. 5 Intel Omni-Path Fabric
  • Overview
  • Performance and Scalability
  • Transport Layer APIs
  • Quality of Service
  • Virtual Fabrics
  • Unicast Address Resolution
  • Multicast Address Resolution
  • Summary
  • For More Information
  • ch. 6 [æ]arch Optimization Advice
  • Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
  • Memory Subsystem
  • [æ]arch Nuances (Tile)
  • Direct Mapped MCDRAM Cache
  • Advice: Use AVX-512
  • Summary
  • For More Information
  • ch. 7 Programming Overview for Knights Landing
  • To Refactor, or Not to Refactor, That Is the Question
  • Evolutionary Optimization of Applications
  • Revolutionary Optimization of Applications.
  • Note continued: Know When to Hold'em and When to Fold'em
  • For More Information
  • ch. 8 Tasks and Threads
  • OpenMP
  • Fortran 2008
  • Intel TBB
  • hStreams
  • Summary
  • For More Information
  • ch. 9 Vectorization
  • Why Vectorize?
  • How to Vectorize
  • Three Approaches to Achieving Vectorization
  • Six-Step Vectorization Methodology
  • Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
  • Compiler Tips
  • Compiler Options
  • Compiler Directives
  • Use Array Sections to Encourage Vectorization
  • Look at What the Compiler Created: Assembly Code Inspection
  • Numerical Result Variations with Vectorization
  • Summary
  • For More Information
  • ch. 10 Vectorization Advisor
  • Getting Started with Intel Advisor for Knights Landing
  • Enabling and Improving AVX-512 Code with the Survey Report
  • Memory Access Pattern Report
  • AVX-512 Gather/Scatter Profiler
  • Mask Utilization and FLOPS Profiler
  • Advisor Roofline Report.
  • Note continued: Explore AVX-512 Code Characteristics Without AVX-512 Hardware
  • Example
  • Analysis of a Computational Chemistry Code
  • Summary
  • For More Information
  • ch. 11 Vectorization with SDLT
  • What Is SDLT?
  • Getting Started
  • SDLT Basics
  • Example Normalizing 3d Points with SIMD
  • What Is Wrong with AOS Memory Layout and SIMD?
  • SIMD Prefers Unit-Stride Memory Accesses
  • Alpha-Blended Overlay Reference
  • Alpha-Blended Overlay With SDLT
  • Additional Features
  • Summary
  • For More Information
  • ch. 12 Vectorization with AVX-512 Intrinsics
  • What Are Intrinsics?
  • AVX-512 Overview
  • Migrating From Knights Corner
  • AVX-512 Detection
  • Learning AVX-512 Instructions
  • Learning AVX-512 Intrinsics
  • Step-by-Step Example Using AVX-512 Intrinsics
  • Results Using Our Intrinsics Code
  • For More Information
  • ch. 13 Performance Libraries
  • Intel Performance Library Overview
  • Intel Math Kernel Library Overview.
  • Note continued: Intel Data Analytics Library Overview
  • Together: MKL and DAAL
  • Intel Integrated Performance Primitives Library Overview
  • Intel Performance Libraries and Intel Compilers
  • Native (Direct) Library Usage
  • Offloading to Knights Landing While Using a Library
  • Precision Choices and Variations
  • Performance Tip for Faster Dynamic Libraries
  • For More Information
  • ch. 14 Profiling and Timing
  • Introduction to Knight Landing Tuning
  • Event-Monitoring Registers
  • Efficiency Metrics
  • Potential Performance Issues
  • Intel VTune Amplifier XE Product
  • Performance Application Programming Interface
  • MPI Analysis: ITAC
  • HPCToolkit
  • Tuning and Analysis Utilities
  • Timing
  • Summary
  • For More Information
  • ch. 15 MPI
  • Internode Parallelism
  • MPI on Knights Landing
  • MPI Overview
  • How to Run MPI Applications
  • Analyzing MPI Application Runs
  • Tuning of MPI Applications
  • Heterogeneous Clusters
  • Recent Trends in MPI Coding.
  • Note continued: Putting it all Together
  • Summary
  • For More Information
  • ch. 16 PGAS Programming Models
  • To Share or not to Share
  • Why Use PGAS on Knights Landing?
  • Programming with PGAS
  • Performance Evaluation
  • Beyond PGAS
  • Summary
  • For More Information
  • ch. 17 Software-Defined Visualization
  • Motivation for Software-Defined Visualization
  • Software-Defined Visualization Architecture
  • OpenSWR: OpenGL Raster-Graphics Software Rendering
  • Embree: High-Performance Ray Tracing Kernel Library
  • OSPRay: Scalable Ray Tracing Framework
  • Summary
  • Image Attributions
  • For More Information
  • ch. 18 Offload to Knights Landing
  • Offload Programming Model-Using with Knights Landing
  • Processors Versus Coprocessor
  • Offload Model Considerations
  • OpenMP Target Directives
  • Concurrent Host and Target Execution
  • Offload Over Fabric
  • Summary
  • For More Information
  • ch. 19 Power Analysis
  • Power Demand Gates Exascale
  • Power 101.
  • Note continued: Hardware-Based Power Analysis Techniques
  • Software-Based Knights Landing Power Analyzer
  • ManyCore Platform Software Package Power Tools
  • Running Average Power Limit
  • Performance Profiling on Knights Landing
  • Intel Remote Management Module
  • Summary
  • For More Information
  • ch. 20 Optimizing Classical Molecular Dynamics in LAMMPS
  • Molecular Dynamics
  • LAMMPS
  • Knights Landing Processors
  • LAMMPS Optimizations
  • Data Alignment
  • Data Types and Layout
  • Vectorization
  • Neighbor List
  • Long-Range Electrostatics
  • MPI and OpenMP Parallelization
  • Performance Results
  • System, Build, and Run Configurations
  • Workloads
  • Organic Photovoltaic Molecules
  • Hydrocarbon Mixtures
  • Rhodopsin Protein in Solvated Lipid Bilayer
  • Coarse Grain Liquid Crystal Simulation
  • Coarse-Grain Water Simulation
  • Summary
  • Acknowledgment
  • For More Information
  • ch. 21 High Performance Seismic Simulations
  • High-Order Seismic Simulations.
  • Note continued: Numerical Background
  • Application Characteristics
  • Intel Architecture as Compute Engine
  • Highly-Efficient Small Matrix Kernels
  • Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection
  • Dense Matrix Kernel Generation: AVX2
  • Dense Matrix Kernel Generation: AVX-512
  • Kernel Performance Benchmarking
  • Incorporating Knights Landing's Different Memory Subsystems
  • Performance Evaluation
  • Mount Merapi
  • 1992 Landers
  • Summary and Take-Aways
  • For More Information
  • ch. 22 Weather Research and Forecasting (WRF)
  • WRF Overview
  • WRF Execution Profile: Relatively Flat
  • History of WRF on Intel Many-Core (Intel Xeon Phi Product Line)
  • Our Early Experiences with WRF on Knights Landing
  • Compiling WRF for Intel Xeon and Intel Xeon Phi Systems
  • WRF CONUS12km Benchmark Performance
  • MCDRAM Bandwidth
  • Vectorization: Boost of AVX-512 Over AVX2
  • Core Scaling
  • Summary
  • For More Information
  • ch. 23 N-Body simulation.
  • Note continued: Parallel Programming for Noncomputer Scientists
  • Step-by-Step Improvements
  • N-Body Simulation
  • Optimization
  • Initial Implementation (Optimization Step 0)
  • Thread Parallelism (Optimization Step 1)
  • Scalar Performance Tuning (Optimization Step 2)
  • Vectorization with SOA (Optimization Step 3)
  • Memory Traffic (Optimization Step 4)
  • Impact of MCDRAM on Performance
  • Summary
  • For More Information
  • ch. 24 Machine Learning
  • Convolutional Neural Networks
  • OverFeat-FAST Results
  • For More Information
  • ch. 25 Trinity Workloads
  • Out of the Box Performance
  • Optimizing MiniGhost OpenMP Performance
  • Summary
  • For More Information
  • ch. 26 Quantum Chromodynamics
  • LQCD
  • The QPhiX Library and Code Generator
  • Wilson-Dslash Operator
  • Configuring the QPhiX Code Generator
  • The Experimental Setup
  • Results
  • Conclusion
  • For More Information.