Instruction Based Sampling
en

Instruction Based Sampling

Instruction Based Sampling (IBS)

Overview

 Instruction Based Sampling is an performance observability feature available as of AMD family 0x10 processors (e.g. Barcelona). While many modern processors offer performance counters as a mechanism for observing counts of certain performance relevant events, this data often lacks the specificity needed to gain an accurate understanding of performance (or the lack thereof). As an example, many performance counter facilities enable one to count memory references, but this doesn't show which memory is being accessed.

 In many ways, AMD's Instruction Based Sampling facility bridges this gap. It works by periodically sampling instructions (or instruction ops) from an instruction stream (program execution). Detailed information about the sampled instruction/op is then collected as it makes its way through the pipeline. The information is then made available through the IBS facility.

 IBS provides the performance analyst with a mechanism for effectively observing:

  • Virtual / physical memory access patterns and utilization
  • Cache / TLB utilization
  • Instruction fetch / execution latencies
  • Branch prediction effectiveness
  • ..and more.

 IBS is described in Appendix G of Software Optimization Guide for AMD Family 10h Processors. This article provides an example of how IBS can be used (using matrix multiplication as example).

IBS Dynamic Tracing (DTrace) Provider

 A prototype DTrace provider has been developed that allows one to interface with the IBS feature through DTrace. The provider exports a set of ibs DTrace probes that (when enabled) fire after IBS samples an instruction / op.

 The information IBS provides about the sampled op/instruction is available both in the body of the DTrace probe, as well as the probe's predicate. DTrace allows one to easily build predicates to filter for the performance events of interest, and its data aggregation features provide a powerful mechanism for managing, analyzing, and visualizing the stream of performance data the IBS feature provides.

Status

 A fairly full featured prototype is available.

Using the provider

 The purpose of IBS DTrace provider is to provide convenient access to the IBS functionality. Currently the provider provides 2 kinds of probes:

  • ibs-fetch-x: For programming the fetch control. The x in the probe name indicates the time interval in terms of number of instruction fetches after which the IBS should pick up an instruction for recording the desired data. Right now it takes any value between 500 and 65535.
  • ibs-exec-x: For programming the execution control. The x in the probe name indicates the time interval in terms of number of executed micro-ops after which the IBS should pick up a micro-op for recording the desired data. Right now it takes any value between 500 and 65535.

 Note: The x in the probe name actually goes into bits [4:19] of the 20 bit count of instruction fetches/micro-ops executed (with bits [0:3] being 0). So the actual number of instruction fetches/micro-ops executed before the IBS selects an instruction/micro-op for recording data is greater than x. For instance x = 1000 corresponds to 16000 instruction fetches/microops executed. When the probe fires, the recorded data is returned in a data structure as args[0]. The data structures are defined as follows:


#define	IBS_REG_BITFIELD(name, ...)			\
	union {						\
		uint64_t reg;				\
		struct {				\
			uint64_t __VA_ARGS__;		\
		} bit;					\
	} name

struct ibs_fetch_data {
	uint64_t cpu_id;

	IBS_REG_BITFIELD(IbsFetchCtl,
	    IbsFetchMaxCnt:16,
	    IbsFetchCnt:16,
	    IbsFetchLat:16,
	    IbsFetchEn:1,
	    IbsFetchVal:1,
	    IbsFetchComp:1,
	    IbsIcMiss:1,
	    IbsPhyAddrValid:1,
	    IbsL1TlbPgSz:2,
	    IbsL1TlbMiss:1,
	    IbsL2TlbMiss:1,
	    IbsRandEn:1,
	    IbsReserved:6);

	uint64_t IbsFetchLinAd;
	uint64_t IbsFetchPhysAd;
};

struct ibs_exec_data {
	uint64_t cpu_id;

	uint64_t IbsOpRip;

	IBS_REG_BITFIELD(IbsOpData,
	    IbsCompToRetCtr:16,
	    IbsTagToRetCtr:16,
	    IbsOpBrnResync:1,
	    IbsOpMispReturn:1,
	    IbsOpReturn:1,
	    IbsOpBrnTaken:1,
	    IbsOpBrnMisp:1,
	    IbsOpBrnRet:1,
	    reserved:26);

	IBS_REG_BITFIELD(IbsOpData2,
	    NbIbsReqSrc:3,
	    reserved:1,
	    NbIbsReqDstProc:1,
	    NbIbsReqCacheHitSt:1,
	    reserved2:58);

	IBS_REG_BITFIELD(IbsOpData3,
	    IbsLdOp:1,
	    IbsStOp:1,
	    IbsDcL1tlbMiss:1,
	    IbsDcL2tlbMiss:1,
	    IbsDcL1tlbHit2M:1,
	    IbsDcL1tlbHit1G:1,
	    IbsDcL2tlbHit2M:1,
	    IbsDcMiss:1,
	    IbsDcMisAcc:1,
	    IbsDcLdBnkCon:1,
	    IbsDcStBnkCon:1,
	    IbsDcStToLdFwd:1,
	    IbsDcStToLdCan:1,
	    IbsDcUcMemAcc:1,
	    IbsDcWcMemAcc:1,
	    IbsDcLockedOp:1,
	    IbsDcMabHit:1,
	    IbsDcLinAddrValid:1,
	    IbsDcPhyAddrValid:1,
	    IbsDcL2tlbHit1G:1,
	    reserved:12,
	    IbsDcMissLat:16,
	    reserved2:16);

	uint64_t IbsDcLinAd;
	uint64_t IbsDcPhysAd;
};


 The names of the fields correspond to the register/bitfield names as described in the family 0x10 BKDG. For bitfiels a union is used to simplify access to the individual bits. For more details refer to the family 0x10h Optimization guide (above).

Sample D scripts

 The following simple script sums up the dcache misses caused by different executables. Note that this number would not be a precise total, since the accounting is not done on a per instruction or micro op basis. But still it gives a reasonable indication of how each executable is doing in terms of cache misses.


#!/usr/sbin/dtrace -s

#pragma D option quiet

ibs-exec-2000
{
        @exec[execname] = sum(args[0]->IbsOpData3.bit.IbsDcMiss);
}

END
{
        printf("\nDcache misses per exec:\n");
        printa(@exec);
}

 The following script adds more functionality and observes only an executable called "memtest":


#!/usr/sbin/dtrace -s

#pragma D option quiet

ibs:::ibs-fetch-500
/execname == "memtest"/
{
        @fetch[execname] = sum(args[0]->IbsFetchCtl.bit.IbsL2TlbMiss);
}

ibs-exec-1000
/execname == "memtest"/
{
        @exec[execname, args[0]->cpu_id] = sum(args[0]->IbsOpData3.bit.IbsDcMiss);
}

ibs-exec-1000
/execname == "memtest" && args[0]->IbsOpData3.bit.IbsDcMiss == 1 && args[0]->IbsOpData3.bit.IbsDcLinAddrValid == 1/
{
        @linadr[args[0]->IbsDcLinAd] = count();
}

END
{
        printf("\nNumber of L2 TLB misses:\n");
        printa(@fetch);
        printf("\nDcache misses per core:\n");
        printa(@exec);
        trunc(@linadr, 10);
        printf("\nTop 10 VA that caused dcache misses:\n");
        printa("%16x   %16x   %@10d\n", @linadr);
}

Limitations and Known Issues

  • The IBS module has a dependency on dtrace and the pcplusmp modules. Make sure they are loaded in the system.
  • At a time only one period can be programmed for both the fetch and execution probes. For instance ibs-fetch-2000 and ibs-fetch-5000 cannot be used together. The periods for fetch and execution probes can be different though, as shown in the sample scripts.
  • Ideally the execution probe should not be programmed for less than a period of 1000, since it causes a tremendous number of interrupts (remember the execution unit counts the number of micro-ops, whose count goes up much faster than that of instructions). In our experiments with the second sample script, ibs-exec-1000 causes a performance delay of around 40% for "memtest" (the executable being observed).

IBS DTrace Provider Source Repository

  • ibs-gate: Anonymous pull is allowed. You must either be a leader of this project or a committer for the ibs-gate repository to push. Please read these instructions on how to use Mercurial repositories. The repository can also be browsed using the OpenSolaris Source Browser.
    Gate Status: The repository is is synced against build 131.
    Closed Binaries tarballs: Build 131 closed bins tarballs (needed for nightly(1)) can be downloaded here.
    To clone from the ibs-gate repository
    $ hg clone ssh://your-login@hg.opensolaris.org/hg/amd/ibs-gate

 For help with using Mercurial, or the ON tools, you can:

    $ cd ibs
    $ /opt/onbld/bin/bldenv -d /opt/onbld/bin/opensolaris.sh 
    $ cd usr/src/tools
    $ dmake install
    $ cd $CODEMGR_WS/usr/src/uts
    $ dmake install
  • To create a kernel tarball to install (x86)...
    $ /opt/onbld/bin/Install -G my_ibs_kernel -k i86pc
  • To build BFU archives, you need to get (and extract) the "closed bins" tarball(s) into your workspace. See above for current pointers (you must use versions appropriate for the build of onnv against which your repo is synced).
    $ cd ibs
    $ tar xf on-closed-bins.i386.tar
    $ /opt/onbld/bin/nightly /opt/onbld/bin/opensolaris.sh

IBS DTrace Provider Standalone Source Package

 Alternatively, you can use a standalone source package that contains just the files necessary to build the IBS provider:


$ gzcat dtrace-ibs.tar.gz | tar xf -
$ cd dtrace-ibs
$ make
$ make install
$ add_drv ibs

Filelast time updatedSolaris Versions supported
dtrace-ibs.tar.gz2010-01-28 18:20build 131 and later

IBS DTrace Provider Binary Package

 To ease testing of the provider, preliminary binary packages are available for download. Those packages contains the IBS provider module and a special devfsadm link module to create the device link in the /dev filesystem. To install one of those packages, extract the tarball into some directory and use pkgadd(1M) add it:


$ cd /tmp
$ mkdir ibs
$ cd ibs
$ gzcat SUNWibs.tar.gz | tar xf - 
$ pkgadd -d .

Filelast time updatedSolaris Versions supported
SUNWibs.tar.gz2010-01-14 17:30build 130 and later
Tags:
Created by admin on 2009/10/26 12:11
Last modified by hrosenfe on 2010/01/28 17:20

Collectives


XWiki Enterprise 2.7.1.34853 - Documentation