Performance Counter - Operating Systems and Middleware Group at

Performance Counter - Operating Systems and Middleware Group at
Performance Counter
Non-Uniform Memory Access Seminar
Karsten Tausche
2014-12-10
Performance Counter
Hardware Unit for event measurements
“Performance Monitoring Unit” (PMU)
Originally for CPU-Debugging
used by manufacturers
Model Specific Register (MSR)
Program and access PMUs
2014-12-10
NUMA Seminar - Karsten Tausche
2
Categories
Total instruction count
Branch instruction (total/conditional)
Load and Store instructions
Arithmetic instructions
Cache events
Uncore events
2014-12-10
NUMA Seminar - Karsten Tausche
3
Motivation
Analyze CPU behavior on instruction level
High Performance Computing
CPU simulators
Embedded
Unmodified Code
Does not effect cache contents
Performance Counter vs. Instrumentation
No performance impact with active PMUs
2014-12-10
NUMA Seminar - Karsten Tausche
4
Outline
• Accessing PMUs
• Performance Counters on Intel Xeon
Definitions
Core/Uncore events
• Accuracy
Analysis
Dependencies
Reducing Inaccuracies
• Tools
Intel Performance Counter Monitor
Perf
PAPI
Intel Perf
2014-12-10
NUMA Seminar - Karsten Tausche
5
Counter Access
Select counted events
Either fixed function PMU
Or multiple countable events per PMU
Configure counters
Start/stop
Count in kernel/user-mode
Read/write values
Programming via MSR bitfields
CPU model specific field definitions
MSRs accessible only in kernel mode
2014-12-10
NUMA Seminar - Karsten Tausche
6
Counter Access
PMU event counting per core/hyper thread
Kernel mode driver and user space library/tool
Counter per software thread
Generalized PMU programming with event IDs
Access in user mode, without root privileges
[Abstraction from platform/operating system]
2014-12-10
NUMA Seminar - Karsten Tausche
7
Performance Counter on Intel Xeon
“instructions_retired”
Executed by speculative
out-of-order pipeline
Handled by Retirement Unit
Results visible to user
2014-12-10
NUMA Seminar - Karsten Tausche
8
Performance Counter on Intel Xeon
“Uncore”
Intel Xeon E5-2600 Family (Sandy Bridge EP)
per socket
resources
“Box”
modular
uncore unit
2014-12-10
NUMA Seminar - Karsten Tausche
9
Core Performance Counter on Intel Xeon
Instructions counts
total, branches,
arithmetic, …
2014-12-10
NUMA Seminar - Karsten Tausche
10
Core Performance Counter on Intel Xeon
Intel Turbo Boost
frequencies
2014-12-10
NUMA Seminar - Karsten Tausche
11
Core Performance Counter on Intel Xeon
L1, L2 Cache
hits and misses
2014-12-10
NUMA Seminar - Karsten Tausche
12
Uncore Performance Counter on Intel Xeon
Last Level Cache
(shared L3 cache)
hits and misses
2014-12-10
NUMA Seminar - Karsten Tausche
13
Uncore Performance Counter on Intel Xeon
Home Agent
memory controller and
cache coherency
memory read/write,
local/remote,
conflicts,
directory/snooping
2014-12-10
Pbox
Physical connection
between cores
or sockets
NUMA Seminar - Karsten Tausche
14
Uncore Performance Counter on Intel Xeon
Integrated
Memory Controller
DRAM access
read/write queues,
ECC correctable errors,
refreshes,
thermal throttling
2014-12-10
NUMA Seminar - Karsten Tausche
15
Uncore Performance Counter on Intel Xeon
QPI
Ring ↔ Link Layer
(socket interconnect)
Filter event counts:
physical address,
Home Node ID,
instruction
2014-12-10
NUMA Seminar - Karsten Tausche
16
Uncore Performance Counter on Intel Xeon
QPI
Ring ↔ Link Layer
(socket interconnect)
link speed,
transfers,
total link utilization
2014-12-10
NUMA Seminar - Karsten Tausche
17
Uncore Performance Counter on Intel Xeon
Power Controller Unit
Socket energy usage
DRAM energy usage
2014-12-10
NUMA Seminar - Karsten Tausche
18
Uncore Performance Counter on Intel Xeon
Power Controller Unit
per core
temperature
time spent in power states
2014-12-10
NUMA Seminar - Karsten Tausche
19
Performance Counters per Box
QPI: 4 per port + 3
PCU: 4
Ubox (system config): 2
PCIe: 4
CBox: 4
Home Agent: 4
iMC: 4 per channel
2014-12-10
NUMA Seminar - Karsten Tausche
20
Performance Counters on Intel Xeon
ubuntu-numa0101.fsoc: Linux 3.13, 2x Intel Xeon E5-2620
•
•
•
•
57 (+x) Performance Monitoring Units per socket
634 countable events
Allowing comprehensive runtime analysis
Mostly focused on a few context specific events
2014-12-10
NUMA Seminar - Karsten Tausche
21
Accuracy
Not defined/guarantied by manufacturer
At least not for Intel/AMD
Speculative architecture
Out-of-order pipeline, serving multiple computing units
Branch prediction
Hardware parallelization
CPU behavior depending on timings, cache-contents, etc.
2014-12-10
NUMA Seminar - Karsten Tausche
22
Accuracy Analysis
[Weaver2013]
Non-determinism
Identical run, different result
Overcount
Same (wrong) result for identical runs
2014-12-10
NUMA Seminar - Karsten Tausche
23
Accuracy Analysis
[Weaver2013]
One deterministic event without overcount on Sandy Bridge EP:
BR_INST_RETIRED_CONDITIONAL
(executed conditional branch instructions)
PMU hardly usable for deterministic implementations
Deterministic replay
Deterministic threading libraries
CPU simulators
Don’t use micro-operation-counter
System specific, undocumented low-level instructions in CISC processors
2014-12-10
NUMA Seminar - Karsten Tausche
24
Inaccuracy Sources
[Weaver2013]
Hardware Interrupts: increment most events
Nondeterministic overcount
Wrong/unintuitive counter behavior
Floating point instructions with “wait for exception” – count twice?
Count µOPs instead of retired instructions (e.g., load/store events)
Accessing counters
Requires system call (interrupt, context switch)
2014-12-10
NUMA Seminar - Karsten Tausche
25
Reduce Inaccuracies
Carefully controlled test environment
Kernel version, tool versions
Running processes
BIOS/Power saving settings
Compiler/Runtime configuration
E.g., Address space layout randomization
2014-12-10
NUMA Seminar - Karsten Tausche
26
Reduce Inaccuracies
Prefer low-level APIs
Prefer dynamic library over command line tool
Compare tools/libraries
[Read papers]
[Check errors with well known Assembly]
2014-12-10
NUMA Seminar - Karsten Tausche
27
Error Rates
Counted instructions in the micro benchmark use by [Weaver2013]
Integer divides
Nehalem: 11.2%
Nehalem-EX: 1.1%
Sandy Bridge EP: n/a
Ivy Bridge: 2.8%
Floating point instructions
Nehalem: 0.0%
Nehalem EX 0.0%
Sandy Bridge EP: 0.003%
Ivy Bridge: 0.08%
2014-12-10
NUMA Seminar - Karsten Tausche
28
Using Performance Counters
Use Profilers first
Using automated tests
Partly implemented with Performance Counter
Optimize as much as possible
Low level analysis with Performance Counters
Optimize problematic code sections
Platform specific optimization
Benefit from minimal overhead
2014-12-10
NUMA Seminar - Karsten Tausche
29
Tools and Libraries
•
•
•
•
Intel Performance Counter Monitor
Perfmon/libpfm
Linux Perf
PAPI
2014-12-10
NUMA Seminar - Karsten Tausche
30
Intel Performance Counter Monitor
Tools and C++ programming interface
Full support for Intel core/uncore events
Supports newer Intel Xeon, Core i, Atom
Uncore mainly available on server platforms
2014-12-10
NUMA Seminar - Karsten Tausche
31
Intel Performance Counter Monitor
Provided by Intel as source code
Driver; GUI/command line tools; C++ library
Linux
build upon MSR kernel module
KDE: ksysguard plug-in
Windows
compile/modify sample driver
Perfmon plug-in
2014-12-10
NUMA Seminar - Karsten Tausche
32
Intel Performance Counter Monitor
Ksysguard (KDE System Monitor)
2014-12-10
NUMA Seminar - Karsten Tausche
33
Intel Performance Counter Monitor
Windows Performance Monitor
2014-12-10
NUMA Seminar - Karsten Tausche
34
perf – “Performance Counters for Linux”
Part of Linux since v2.6.31 (2009)
before: patch and compile your kernel
Command-line tool “perf” (userspace)
Debian/Ubuntu-Package “linux-tools”
perf list
Detects supported events
But no support for finding relevant events
perf stat -e [eventName] command
Run command and count eventName
2014-12-10
NUMA Seminar - Karsten Tausche
35
Linux: perfmon2 / libpfm
Originated from own kernel subsystem
Kernel driver: perfmon2, userspace library: libpfm
Superseded by Linux’ perf_events interface
Part of kernel since v2.6.31, 2009
Libpfm3 – IA64-subsystem in Linux by HP
Libpfm4 – complete rewrite by Google
2014-12-10
NUMA Seminar - Karsten Tausche
36
Linux: libpfm4
Using Linux perf_events interface
libpfm4
Retrieve supported events per source
Translating event IDs and names
Program events
Architecture support
Intel x86: since Pentium P6, Core Duo/Solo, Atom, Nehalem
AMD64 x86: K7, K8 and newer; uncore since Bulldozer
Some ARM, SPARC, IBM Power, MIPS models
2014-12-10
NUMA Seminar - Karsten Tausche
37
Libpfm4 on ubuntu-numa0101.fsoc
Source code: examples and tools
showevtinfo (example tool)
Lists supported and detected PMUs
Xeon E5-2620 (Sandy Bridge EP):
Lists 4196 available events, 634 supported
Per event: PMU, index, parameters, description
2014-12-10
NUMA Seminar - Karsten Tausche
38
Libpfm4 on ubuntu-numa0101.fsoc
showeventinfo
IDX : 37748741
PMU name : ix86arch (Intel X86 architectural PMU)
Name : BRANCH_INSTRUCTIONS_RETIRED
Equiv : None
Flags : None
Desc : count branch instructions at retirement. Specifically, this event counts the retirement of the
last micro-op of a branch instruction
Code : 0xc4
Modif-00 : 0x00 : PMU : [k] : monitor at priv level 0 (boolean)
Modif-01 : 0x01 : PMU : [u] : monitor at priv level 1, 2, 3 (boolean)
Modif-02 : 0x02 : PMU : [e] : edge level (may require counter-mask >= 1) (boolean)
Modif-03 : 0x03 : PMU : [i] : invert (boolean)
Modif-04 : 0x04 : PMU : [c] : counter-mask in range [0-255] (integer)
Modif-05 : 0x05 : PMU : [t] : measure any thread (boolean)
2014-12-10
NUMA Seminar - Karsten Tausche
39
PAPI
Performance Application Programming Interface
Platform independent PMU interface
Provide standard definitions for performance metrics
“easy to use, well documented, freely available”
Windows support discontinued after XP
Preset Events
Supported across [nearly] all platforms
High Level API
Simplified access to Preset Events
Low Level API
Adds access to native events
2014-12-10
NUMA Seminar - Karsten Tausche
40
PAPI
Performance Application Programming Interface
High Level API: core events only
Uncore still requiring Low Level API
papi_avail
Check available events
papi_event_chooser
List events that can be combined with a given list
papi_native_avail
Detailed information about native events
2014-12-10
NUMA Seminar - Karsten Tausche
41
Demo
perf
• perf stat -e instructions:u -e cache-misses:u ./cache-miss
count total instructions and cache-misses, both in user mode
• perf stat -r 10 …
run command 10 times, print average and stddev
libpfm4 examples in libpfm/libpfm-4.5.0/examples
• ./showevtinfo | less
lists first PMUs supported by libpfm and detected on your system
also lists all supported events with description and parameters
2014-12-10
NUMA Seminar - Karsten Tausche
42
Demo sources
/NUMASem/demo_PerformanceCounter/
• cache-miss / cache-miss2
use perf stat to check cache misses with different array iteration patterns
• libpfm_cache-miss
Uses libpfm4 library calls to count cache events
List of measured events is defined by “eventNames” string vector
This is a simplified version of libpfm examples:
/NUMASem/libpfm-4.5.0/perf_examples/self.c
• libpfm_qpi_remote
Uses libpfm4 to count QPI events and libnuma to pin the task and allocated memory
to different sockets.
See line 75 to switch between allocating memory on local or remote socket.
2014-12-10
NUMA Seminar - Karsten Tausche
43
Sources
• Zaparanuks, D.; Jovic, M.; Hauswirth, M., “Accuracy of performance
counter measurements”, 2009
• Weaver, V.M.; Terpstra, D.; Moore, S. “Non-determinism and overcount on
modern hardware performance counter implementations”, 2013
• “Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume
3B: System Programming Guide, Part 2”, September 2014
• https://software.intel.com/en-us/articles/intel-performance-countermonitor-a-better-way-to-measure-cpu-utilization
• http://perfmon2.sourceforge.net/
• http://icl.cs.utk.edu/projects/papi/wiki/Main_Page
• Linux man pages
2014-12-10
NUMA Seminar - Karsten Tausche
44
Was this manual useful for you? yes no
Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Download PDF

advertising