CoreSight, Perf and the OpenCSD Library

 In Core Dump

core-dump

Introduction

In this article we explain how the CoreSight framework found in the Linux kernel has been integrated with the standard Perf core, both at the kernel and user space level.  In the latter part the newly introduced Open CoreSight Decoding Library (OpenCSD) is used to assist with trace decoding.  The topic of trace decoding with openCSD will be covered in an upcoming post.

All examples presented in this post have been collected on a juno-R0 platform using code that is public and accessible to everyone.

Background on Perf and the Performance Management Units

The standard Perf core is a performance analysis tool found in the Linux kernel.  It comes with a complement user space tool, simply called perf, that provides a suite of sub-commands to control and present trace profiling sessions.  Perf is most commonly used to access SoC performance counters, but over the years it has grown well beyond that and now covers tracepoints, software performance counters and dynamic probes.  

The perf core is generic and caters to many architectures.  To hide variations between HW implementation and profiling metrics the concept of Performance Monitoring Unit (PMU) is used.  A PMU is a structure providing a well defined set of interfaces that PMU drivers implement in order to carry action on behalf of the Perf core.  The actions carried out the by the PMU drivers are not relevant to the Perf core itself, as long as the semantic of the API is respected.

Every time a process is installed on a CPU for execution, the scheduler invokes the Perf core.   From there Perf will see if any event is associated with that process and if so, the PMU API performing HW specific operations is invoked.  The same happens when the process is removed from a CPU.  That way statistics and performance counters are collected for that process only and aren’t impacted by other activities concurrently happening in the system.  Traces collected during a session are transferred to user space using a mmap’ed area and made available to users in the perf.data file.  The latter is then read by the various perf sub-command for rendering in human readable format.  

Integrating the CoreSight drivers with the Perf core was advantageous on many fronts.  On the kernel side it streamlined the configuration of trace sessions – with hundreds of parameters per CPU this was certainly not something to pass on.  It also offered a way to easily transfer massive amounts of trace data to user space with little overhead.  In user space the metadata pertaining to each trace session could be embedded in the perf.data file and perf sub-commands like report and script used to decode trace data.  Last but not least most of the upstream code can be re-used in the PMU abstraction.

Integration of CoreSight with the Perf Framework

The kernel side

To bridge the gap between the CoreSight framework and the Perf core, CoreSight tracers (ETMv3/4 and PTM) are modelled as PMUs.  At boot time the newly introduced function etm_perf_init() registers an etm_pmu with the perf core:

#define CORESIGHT_ETM_PMU_NAME “cs_etm”

static struct pmu etm_pmu;

static int __init etm_perf_init(void)
{
         int ret;
 
         etm_pmu.capabilities    = PERF_PMU_CAP_EXCLUSIVE;
 
         etm_pmu.attr_groups     = etm_pmu_attr_groups;
         etm_pmu.task_ctx_nr     = perf_sw_context;
         etm_pmu.read            = etm_event_read;
         etm_pmu.event_init      = etm_event_init;
         etm_pmu.setup_aux       = etm_setup_aux;
         etm_pmu.free_aux        = etm_free_aux;
         etm_pmu.start           = etm_event_start;
         etm_pmu.stop            = etm_event_stop;
         etm_pmu.add             = etm_event_add;
         etm_pmu.del             = etm_event_del;
         etm_pmu.get_drv_configs = etm_get_drv_configs;
         etm_pmu.free_drv_configs = etm_free_drv_configs;
                                
        ret = perf_pmu_register(&etm_pmu, CORESIGHT_ETM_PMU_NAME, -1);
        if (ret == 0)
                 etm_perf_up = true;
 
         return ret;
 }
device_initcall(etm_perf_init);

Calling perf_pmu_register() creates a new PMU with the characteristics found in the struct pmu given as a parameter.  When a successful registration has completed the new PMU can be found alongside the other PMUs catalogued at boot time:

linaro@linaro-nano:~$
linaro@linaro-nano:~$ ls /sys/bus/event_source/devices/
breakpoint  cs_etm  software  tracepoint
linaro@linaro-nano:~$
linaro@linaro-nano:~$ ls /sys/bus/event_source/devices/cs_etm
cpu0  cpu1  cpu2  cpu3  cpu4  cpu5  format  perf_event_mux_interval_ms  power  subsystem  type  uevent
linaro@linaro-nano:~$

The astute reader will notice that cpu[0… 5] are not part of the typical sysFS entries associated with PMUs, and they will be correct.  Upon successful registration with the CoreSight core, the ETMv3/PTM and ETMv4 drivers create a symbolic link between their sysFS entries and the new cs_etm PMU, allowing the Perf user space API to quickly retrieve the metadata associated with a tracer:

linaro@linaro-nano:~$ ls -l /sys/bus/event_source/devices/cs_etm/cpu0
lrwxrwxrwx 1 root root 0 Jun  1 20:19 /sys/bus/event_source/devices/cs_etm/cpu0 -> ../platform/23040000.etm/23040000.etm
linaro@linaro-nano:~$
linaro@linaro-nano:~$ ls /sys/bus/event_source/devices/cs_etm/cpu0/trcidr/
trcidr0  trcidr1  trcidr10  trcidr11  trcidr12  trcidr13  trcidr2  trcidr3  trcidr4  trcidr5  trcidr8  trcidr9
linaro@linaro-nano:~$
linaro@linaro-nano:~$ ls /sys/bus/event_source/devices/cs_etm/cpu0/mgmt/
trcauthstatus  trcdevid    trclsr    trcpdcr  trcpidr0  trcpidr2  trctraceid trcconfig      trcdevtype  trcoslsr  trcpdsr  trcpidr1  trcpidr3
linaro@linaro-nano:~$

The user space side

In user space integration is done around three tools: perf record, perf report and perf script, which are the perf sub-commands we have been referring to.  The first deals with event configuration and creation while the latter two assist in rendering trace data collected during a session in a human readable format.

perf record

Integration in the perf record sub-command is done by providing an architecture specific function that return a struct auxtrace_record.  As with the kernel PMU abstraction the auxtrace_record structure allows the generic core to perform architecture-specific operations without losing genericity.  That way it is possible to process traces data generated by IntePT and CoreSight without changing anything to the common core.

struct auxtrace_record *cs_etm_record_init(int *err)
{
       struct perf_pmu *cs_etm_pmu;
       struct cs_etm_recording *ptr;

       cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME);

       [clip…]


       ptr->cs_etm_pmu                 = cs_etm_pmu;
       ptr->itr.parse_snapshot_options = cs_etm_parse_snapshot_options;
       ptr->itr.recording_options      = cs_etm_recording_options;
       ptr->itr.info_priv_size         = cs_etm_info_priv_size;
       ptr->itr.info_fill              = cs_etm_info_fill;
       ptr->itr.find_snapshot          = cs_etm_find_snapshot;
       ptr->itr.snapshot_start         = cs_etm_snapshot_start;
       ptr->itr.snapshot_finish        = cs_etm_snapshot_finish;
       ptr->itr.reference              = cs_etm_reference;
       ptr->itr.free                   = cs_etm_recording_free;
       ptr->itr.read_finish            = cs_etm_read_finish;

       *err = 0;
       return &ptr->itr;
out:
       return NULL;
}

Among other things, functions provided to the struct auxtrace_record deal with how to find tracer specific metadata, the presentation and formatting of the metadata in the perf.data file along with specifics related to the size and mapping of the ring buffer shared between the kernel and user space.  That ring buffer is then used to retrieve trace data from the kernel.

perf report and perf script

The decompression and rendering of trace data is done in the report and script utilities.  The process starts by reading the perf.data file and parsing each of the events that were generated during a trace session.  The AUXTRACE_INFO and PERF_RECORD_MMAP2 are especially important.  The first event carries a wealth of information about how the tracers were configured, the so called metadata, and a list of offsets in the perf.data file where lumps of trace data are located.  These offsets are recorded for later processing.  

PERF_RECORD_MMAP2 events carry the name and path of the binary and libraries that were loaded/executed during the trace session.  Those are commonly called Dynamic Shared Object, or DSO.  Having a handle on the DSOs is important for trace decoding since some branch point don’t carry the destination address, only that the branch point was taken or not.  In those cases the code needs to be read to find out where execution resumed.  

Once all that information has been tallied decoding of the trace data can begin.  The process is done by feeding the previously recorded trace data offsets to the decoder.  The decoder is an instantiated object provided by the openCSD companion library.  It decodes trace data lumps in steps, calling a user provided callback function with each successful round .

Un-synthesised output will look like this:

mpoirier@t430:~/work/linaro/coresight/bkk16/jun01-kernel$ ../../kernel-cs-pm/tools/perf/perf report –stdio –dump
. … CoreSight ETM Trace data: size 162416 bytes
 0: I_ASYNC : Alignment Synchronisation.
 12: I_TRACE_INFO : Trace Info.
 17: I_TRACE_ON : Trace On.
 18: I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0xFFFFFFC000531720; Ctxt: AArch64,EL1, NS;
 28: I_ATOM_F2 : Atom format 2.; NE
 29: I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0xFFFFFFC000536038;
 39: I_ATOM_F2 : Atom format 2.; EE
 40: I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFFFFC0005366CC ~[0x166CC]
 43: I_ATOM_F1 : Atom format 1.; E
 44: I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFFFFC000531BC0 ~[0x11BC0]
 48: I_ATOM_F3 : Atom format 3.; NEE
 49: I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0xFFFFFFC000531F54 ~[0x11F54]
 52: I_ATOM_F1 : Atom format 1.; E
 53: I_ADDR_L_32IS0 : Address, Long, 32 bit, IS0.; Addr=0x0016BB60;
 58: I_ATOM_F3 : Atom format 3.; NEE
 59: I_ATOM_F3 : Atom format 3.; NNE
 60: I_ATOM_F6 : Atom format 6.; EEEEEE
 61: I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0x0016BBF4 ~[0x1F4]
 64: I_ATOM_F1 : Atom format 1.; E
 65: I_ADDR_S_IS0 : Address, Short, IS0.; Addr=0x0016BD44 ~[0xBD44]
 68: I_ATOM_F3 : Atom format 3.; NNE
 69: I_ATOM_F1 : Atom format 1.; E

This raw trace packet output, ETMv4 in this case, is great for infrastructure debugging but of little value for system troubleshooting scenarios. These packets are further decoded by the OpenCSD library into a set of generic packets, describing core state and instruction ranges executed. The report and script commands will filter the packets they get back from the decoder and the packets related to executed instruction ranges will be accounted for and submitted for synthesis.  In Perf terminology, the synthesis process deals with how decoded and relevant events are presented to users.

When using the report utility packets are synthesises to form a flame graph, where hot spots can be identified quickly:

mpoirier@t430:~/work/linaro/coresight/jun01-user$ perf report –stdio

 

[clip …]


# Children      Self  Command  Shared Object     Symbol                
# ……..  ……..  …….  …………….  ………………….
#
    4.13%     4.13%  uname    libc-2.21.so      [.] 0x0000000000078758
    3.74%     3.74%  uname    libc-2.21.so      [.] 0x0000000000078e50
    2.06%     2.06%  uname    libc-2.21.so      [.] 0x00000000000fcaf4
    1.65%     1.65%  uname    libc-2.21.so      [.] 0x00000000000fcae4
    1.59%     1.59%  uname    ld-2.21.so        [.] 0x000000000000a7f4
    1.50%     1.50%  uname    libc-2.21.so      [.] 0x0000000000078e40
    1.43%     1.43%  uname    libc-2.21.so      [.] 0x00000000000fcac4
    1.31%     1.31%  uname    libc-2.21.so      [.] 0x000000000002f0c0
    1.26%     1.26%  uname    ld-2.21.so        [.] 0x0000000000016888
    1.24%     1.24%  uname    libc-2.21.so      [.] 0x00000000000fcab8
    1.19%     1.19%  uname    ld-2.21.so        [.] 0x0000000000008eb8
    1.18%     1.18%  uname    libc-2.21.so      [.] 0x0000000000078e7c
    1.17%     1.17%  uname    libc-2.21.so      [.] 0x0000000000078778
    1.08%     1.08%  uname    libc-2.21.so      [.] 0x0000000000078e98
    1.04%     1.04%  uname    libc-2.21.so      [.] 0x0000000000072520
    1.04%     1.04%  uname    libc-2.21.so      [.] 0x0000000000078e84
    0.90%     0.90%  uname    libc-2.21.so      [.] 0x0000000000072368
    0.86%     0.86%  uname    libc-2.21.so      [.] 0x00000000000fcac8
    0.83%     0.83%  uname    libc-2.21.so      [.] 0x0000000000071624
    0.81%     0.81%  uname    ld-2.21.so        [.] 0x00000000000084b4
    0.80%     0.80%  uname    libc-2.21.so      [.] 0x0000000000074900
    0.80%     0.80%  uname    libc-2.21.so      [.] 0x00000000000726c0
    0.79%     0.79%  uname    libc-2.21.so      [.] 0x0000000000078e54
    0.79%     0.79%  uname    libc-2.21.so      [.] 0x00000000000728d0
    0.75%     0.75%  uname    libc-2.21.so      [.] 0x0000000000078e74

The above shows that 4.13% of all the instruction ranges started in library libc-2.21.so at address 0x0000000000078758.  Using the source code, the DSO file and an objdump utility it is possible to quickly identify the function that was referenced.  It is important to keep in mind that flame graphs are generated using the entry point only.  Nothing can be deduced about the path through the code that was taken after that.

From more accurate results it is suggested to work with the script command where a user supplied script can take advantage of all the information conveyed by synthesised events by way of the perf_sample structure.  An example is the cs-trace-disasm.py script produced by Linaro:

FILE: /lib/aarch64-linux-gnu/ld-2.21.so CPU: 0
         7f9175cd80:   910003e0        mov     x0, sp
         7f9175cd84:   94000d53        bl      7f917602d0 <free@plt+0x3790>
FILE: /lib/aarch64-linux-gnu/ld-2.21.so CPU: 0
         7f917602d0:   d11203ff        sub     sp, sp, #0x480
         7f917602d4:   a9ba7bfd        stp     x29, x30, [sp,#-96]!
         7f917602d8:   910003fd        mov     x29, sp
         7f917602dc:   a90363f7        stp     x23, x24, [sp,#48]
         7f917602e0:   9101e3b7        add     x23, x29, #0x78
         7f917602e4:   a90573fb        stp     x27, x28, [sp,#80]
         7f917602e8:   a90153f3        stp     x19, x20, [sp,#16]
         7f917602ec:   aa0003fb        mov     x27, x0
         7f917602f0:   910a82e1        add     x1, x23, #0x2a0
         7f917602f4:   a9025bf5        stp     x21, x22, [sp,#32]
         7f917602f8:   a9046bf9        stp     x25, x26, [sp,#64]
         7f917602fc:   910102e0        add     x0, x23, #0x40
         7f91760300:   f800841f        str     xzr, [x0],#8
         7f91760304:   eb01001f        cmp     x0, x1
         7f91760308:   54ffffc1        b.ne    7f91760300 <free@plt+0x37c0>
FILE: /lib/aarch64-linux-gnu/ld-2.21.so CPU: 0
         7f91760300:   f800841f        str     xzr, [x0],#8
         7f91760304:   eb01001f        cmp     x0, x1
         7f91760308:   54ffffc1        b.ne    7f91760300 <free@plt+0x37c0>
FILE: /lib/aarch64-linux-gnu/ld-2.21.so CPU: 0
         7f91760300:   f800841f        str     xzr, [x0],#8
         7f91760304:   eb01001f        cmp     x0, x1
         7f91760308:   54ffffc1        b.ne    7f91760300 <free@plt+0x37c0>
FILE: /lib/aarch64-linux-gnu/ld-2.21.so CPU: 0

Here we can see exactly the path a processor took through the code.  The first field is the address in the DSO, the second the OPcode as found in the DSO at that specific address while the remaining of the line depicts an assembly language representation of the instructions as provided by objdump.  Instructions on how to setup an environment capable of producing the above output can be found on the openCSD website.

Conclusion

In this post we presented the main elements used to integrate the CoreSight framework with the Linux Perf core.  In kernel space CoreSight tracer configuration and control functions are folded in the PMU interface, allowing the Perf core to control trace generation the same way it does with any other system monitoring metrics.  In user space the very valuable metadata, along with trace session blobs, are extracted from the perf.data file and submitted to the decoder for packet extraction.  Different synthesis methods are offered depending on the level of details needed, i.e the popular flame graph is generated using perf report command while more detailed analysis can be rendered by python or perl scripts.

An upcoming post on this blog will feature the OpenCSD library in detail.  It will introduce the different components, how these are used to decode trace,  and the C++ and C APIs allowing integration with various standalone programs.  The library example and test programs, which demonstrate using the library will also be presented.

Recommended Posts