Intel Xeon Phi processor high performance programming / by Jim Jeffers, James Reinders, Avinash Sodani.
Saved in:
Online Access: |
Full Text (via ScienceDirect) |
---|---|
Main Authors: | , , |
Format: | eBook |
Language: | English |
Published: |
Cambridge, MA :
Morgan Kaufmann is an imprint of Elsevier,
[2016]
|
Edition: | Knights Landing edition. |
Subjects: |
Table of Contents:
- Machine generated contents note: ch. 1 Introduction
- Introduction to Many-Core Programming
- Trend: More Parallelism
- Why Intel® Xeon Phi["! Processors Are Needed
- Processors Versus Coprocessor
- Measuring Readiness for Highly Parallel Execution
- What About GPUs?
- Enjoy the Lack of Porting Needed but Still Tune!
- Transformation for Performance
- Hyper-Threading Versus Multithreading
- Programming Models
- Why We Could Skip To Section II Now
- For More Information
- ch. 2 Knights Landing Overview
- Overview
- Instruction Set
- Architecture Overview
- Motivation: Our Vision and Purpose
- Summary
- For More Information
- ch. 3 Programming MCDRAM and Cluster Modes
- Programming for Cluster Modes
- Programming for Memory Modes
- Query Memory Mode and MCDRAM Available
- SNC Performance Implications of Allocation and Threading
- How to Not Hard Code the NUMA Node Numbers
- Approaches to Determining What to Put in MCDRAM.
- Note continued: Why Rebooting Is Required to Change Modes
- BIOS
- Summary
- For More Information
- ch. 4 Knights Landing Architecture
- Tile Architecture
- Cluster Modes
- Memory Interleaving
- Memory Modes
- Interactions of Cluster and Memory Modes
- Summary
- For More Information
- ch. 5 Intel Omni-Path Fabric
- Overview
- Performance and Scalability
- Transport Layer APIs
- Quality of Service
- Virtual Fabrics
- Unicast Address Resolution
- Multicast Address Resolution
- Summary
- For More Information
- ch. 6 [æ]arch Optimization Advice
- Best Performance From 1, 2, or 4 Threads Per Core, Rarely 3
- Memory Subsystem
- [æ]arch Nuances (Tile)
- Direct Mapped MCDRAM Cache
- Advice: Use AVX-512
- Summary
- For More Information
- ch. 7 Programming Overview for Knights Landing
- To Refactor, or Not to Refactor, That Is the Question
- Evolutionary Optimization of Applications
- Revolutionary Optimization of Applications.
- Note continued: Know When to Hold'em and When to Fold'em
- For More Information
- ch. 8 Tasks and Threads
- OpenMP
- Fortran 2008
- Intel TBB
- hStreams
- Summary
- For More Information
- ch. 9 Vectorization
- Why Vectorize?
- How to Vectorize
- Three Approaches to Achieving Vectorization
- Six-Step Vectorization Methodology
- Streaming Through Caches: Data Layout, Alignment, Prefetching, and so on
- Compiler Tips
- Compiler Options
- Compiler Directives
- Use Array Sections to Encourage Vectorization
- Look at What the Compiler Created: Assembly Code Inspection
- Numerical Result Variations with Vectorization
- Summary
- For More Information
- ch. 10 Vectorization Advisor
- Getting Started with Intel Advisor for Knights Landing
- Enabling and Improving AVX-512 Code with the Survey Report
- Memory Access Pattern Report
- AVX-512 Gather/Scatter Profiler
- Mask Utilization and FLOPS Profiler
- Advisor Roofline Report.
- Note continued: Explore AVX-512 Code Characteristics Without AVX-512 Hardware
- Example
- Analysis of a Computational Chemistry Code
- Summary
- For More Information
- ch. 11 Vectorization with SDLT
- What Is SDLT?
- Getting Started
- SDLT Basics
- Example Normalizing 3d Points with SIMD
- What Is Wrong with AOS Memory Layout and SIMD?
- SIMD Prefers Unit-Stride Memory Accesses
- Alpha-Blended Overlay Reference
- Alpha-Blended Overlay With SDLT
- Additional Features
- Summary
- For More Information
- ch. 12 Vectorization with AVX-512 Intrinsics
- What Are Intrinsics?
- AVX-512 Overview
- Migrating From Knights Corner
- AVX-512 Detection
- Learning AVX-512 Instructions
- Learning AVX-512 Intrinsics
- Step-by-Step Example Using AVX-512 Intrinsics
- Results Using Our Intrinsics Code
- For More Information
- ch. 13 Performance Libraries
- Intel Performance Library Overview
- Intel Math Kernel Library Overview.
- Note continued: Intel Data Analytics Library Overview
- Together: MKL and DAAL
- Intel Integrated Performance Primitives Library Overview
- Intel Performance Libraries and Intel Compilers
- Native (Direct) Library Usage
- Offloading to Knights Landing While Using a Library
- Precision Choices and Variations
- Performance Tip for Faster Dynamic Libraries
- For More Information
- ch. 14 Profiling and Timing
- Introduction to Knight Landing Tuning
- Event-Monitoring Registers
- Efficiency Metrics
- Potential Performance Issues
- Intel VTune Amplifier XE Product
- Performance Application Programming Interface
- MPI Analysis: ITAC
- HPCToolkit
- Tuning and Analysis Utilities
- Timing
- Summary
- For More Information
- ch. 15 MPI
- Internode Parallelism
- MPI on Knights Landing
- MPI Overview
- How to Run MPI Applications
- Analyzing MPI Application Runs
- Tuning of MPI Applications
- Heterogeneous Clusters
- Recent Trends in MPI Coding.
- Note continued: Putting it all Together
- Summary
- For More Information
- ch. 16 PGAS Programming Models
- To Share or not to Share
- Why Use PGAS on Knights Landing?
- Programming with PGAS
- Performance Evaluation
- Beyond PGAS
- Summary
- For More Information
- ch. 17 Software-Defined Visualization
- Motivation for Software-Defined Visualization
- Software-Defined Visualization Architecture
- OpenSWR: OpenGL Raster-Graphics Software Rendering
- Embree: High-Performance Ray Tracing Kernel Library
- OSPRay: Scalable Ray Tracing Framework
- Summary
- Image Attributions
- For More Information
- ch. 18 Offload to Knights Landing
- Offload Programming Model-Using with Knights Landing
- Processors Versus Coprocessor
- Offload Model Considerations
- OpenMP Target Directives
- Concurrent Host and Target Execution
- Offload Over Fabric
- Summary
- For More Information
- ch. 19 Power Analysis
- Power Demand Gates Exascale
- Power 101.
- Note continued: Hardware-Based Power Analysis Techniques
- Software-Based Knights Landing Power Analyzer
- ManyCore Platform Software Package Power Tools
- Running Average Power Limit
- Performance Profiling on Knights Landing
- Intel Remote Management Module
- Summary
- For More Information
- ch. 20 Optimizing Classical Molecular Dynamics in LAMMPS
- Molecular Dynamics
- LAMMPS
- Knights Landing Processors
- LAMMPS Optimizations
- Data Alignment
- Data Types and Layout
- Vectorization
- Neighbor List
- Long-Range Electrostatics
- MPI and OpenMP Parallelization
- Performance Results
- System, Build, and Run Configurations
- Workloads
- Organic Photovoltaic Molecules
- Hydrocarbon Mixtures
- Rhodopsin Protein in Solvated Lipid Bilayer
- Coarse Grain Liquid Crystal Simulation
- Coarse-Grain Water Simulation
- Summary
- Acknowledgment
- For More Information
- ch. 21 High Performance Seismic Simulations
- High-Order Seismic Simulations.
- Note continued: Numerical Background
- Application Characteristics
- Intel Architecture as Compute Engine
- Highly-Efficient Small Matrix Kernels
- Sparse Matrix Kernel Generation and Sparse/Dense Kernel Selection
- Dense Matrix Kernel Generation: AVX2
- Dense Matrix Kernel Generation: AVX-512
- Kernel Performance Benchmarking
- Incorporating Knights Landing's Different Memory Subsystems
- Performance Evaluation
- Mount Merapi
- 1992 Landers
- Summary and Take-Aways
- For More Information
- ch. 22 Weather Research and Forecasting (WRF)
- WRF Overview
- WRF Execution Profile: Relatively Flat
- History of WRF on Intel Many-Core (Intel Xeon Phi Product Line)
- Our Early Experiences with WRF on Knights Landing
- Compiling WRF for Intel Xeon and Intel Xeon Phi Systems
- WRF CONUS12km Benchmark Performance
- MCDRAM Bandwidth
- Vectorization: Boost of AVX-512 Over AVX2
- Core Scaling
- Summary
- For More Information
- ch. 23 N-Body simulation.
- Note continued: Parallel Programming for Noncomputer Scientists
- Step-by-Step Improvements
- N-Body Simulation
- Optimization
- Initial Implementation (Optimization Step 0)
- Thread Parallelism (Optimization Step 1)
- Scalar Performance Tuning (Optimization Step 2)
- Vectorization with SOA (Optimization Step 3)
- Memory Traffic (Optimization Step 4)
- Impact of MCDRAM on Performance
- Summary
- For More Information
- ch. 24 Machine Learning
- Convolutional Neural Networks
- OverFeat-FAST Results
- For More Information
- ch. 25 Trinity Workloads
- Out of the Box Performance
- Optimizing MiniGhost OpenMP Performance
- Summary
- For More Information
- ch. 26 Quantum Chromodynamics
- LQCD
- The QPhiX Library and Code Generator
- Wilson-Dslash Operator
- Configuring the QPhiX Code Generator
- The Experimental Setup
- Results
- Conclusion
- For More Information.