B.2 Tutorial: Introduction to perf

Introduction to Perf

The perf framework also referred to as perf_events, is a performance monitoring tool and event tracer closely integrated with the Linux OS kernel. Its primary functionality is based on the sys_perf_event_open system call introduced in the 2.6 series of Linux. In certain systems, to enable counter values collection without root permission, one can see the paranoid level.

$ sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'

The KTH lab computers provide a subset of counters without enabling them.

The system call enables access to special-purpose registers of the CPU that may be configured to collect the counts of specific hardware-level events. These events may vary from processor to processor, but their main categories include the following:

    • Cache related: misses and references issued: These may be further grouped by cache level (L1 through L3), cache type (instruction and data), and access type (loads and stores).
    • Translation lookaside buffer related: These may also be subdivided into instruction and data categories, and by access type (load/store).
    • Branch Statistics: These include counts of overall branch occurrences and missed branch target loads.
    • Instructions and cycles: Perf can provide the number of executed instructions or the count of CPU cycles that occurred during program execution.
    • Stalled or idle cycles: These further subdivide into front-end and back-end stalls. The first indicates an inability to fill completely the available capacity of the first stages of the execution pipeline and may be caused by instruction cache or translation lookaside buffer (TLB) misses, mispredicted branches, or unavailability of translation into micro-operations for specific instruction(s). The back-end issues may be caused by intel instruction dependencies (e.g., a long- latency instruction delaying the execution of other dependent instructions, such as division) or availability of memory units.
    • Node-level statistics: prefetches, loads and stores, and misses. Prefetch misses are counted separately to avoid false inflation of statistics describing actual data accesses generated by the monitored code.
    • Data collected by the processor’s performance management unit (PMU): These counters provide the aggregate values for the whole CPU, including primarily uncore-related events. Uncore is a term coined by Intel to describe segments of CPU logic that are not parts of the core execution pipeline and thus are shared by the cores. They include memory controllers and their interfaces, a node-level interconnect bus that provides NUMA functionality, a last-level cache, a coherency traffic monitor, and power management.

The perf tool also provides access to many software-level kernel events that may be of great use for performance analysis. They comprise counts of context switches, context migrations, data alignment faults, major, minor, and aggregate page faults, accurate time measurements, and custom events defined using the Berkeley Packet Filter framework. The complete list of events supported on the local system is obtained with:

$ perf list
...

Perf may be invoked in several modes of operation selected by the first argument on the command line. The frequently used commands are:

  • stat, which executes the provided application with arguments while collecting the counts of specified events or a default event set
  • record, which enables per thread, per process, or per CPU profiling report, which performs analysis of data collected by records annotate, which correlates the gathered profiling data to assembly code top, which displays the statistics in a real-time using format resembling that of the Unix top utility for the visualization of process activity bench, which invokes a number of predefined kernel benchmarks.

Simple Counter Collection

To test this functionality in practice, we can profile our test application. The result for a matrix-multiply code written in C is presented below.

$ perf stat ./mvmult 20000
tcmalloc: large alloc 3200000000 bytes == 0x55f63d330000 @  0x7f8725444680 0x7f87254642ec 0x55f63b9ef315 0x55f63b9ef148 0x7f87252480b3 0x55f63b9ef21e
Size 20000; abs. sum: 10000.000000 (expected: 10000)

 Performance counter stats for './mvmult 20000':

          3 774,74 msec task-clock                #    2,458 CPUs utilized          
            52 824      context-switches          #    0,014 M/sec                  
                4      cpu-migrations            #    0,001 K/sec                  
           783 166      page-faults               #    0,207 M/sec                  
     12 376 900 881      cycles                    #    3,279 GHz                    
      5 955 837 348      instructions              #    0,48  insn per cycle         
      1 117 217 291      branches                  #  295,972 M/sec                  
         8 612 181      branch-misses             #    0,77% of all branches        

       1,535663165 seconds time elapsed

       2,017880000 seconds user
       1,757637000 seconds sys


Note that perf may accommodate a greater number of events in a single invocation than available hardware slots in the processor using a technique called multiplexing. It means that at any given moment only a subset of requested events is configured on the processor; this subset is periodically replaced with one that contains other requested events. This is repeated cyclically to permit all specified events to be active for an approximately equal share of time during application execution. The collected information may be analyzed using the “perf report” command.