Back to basics – gprof PT1 (generating a profile)

A while ago at work we were discussing the pros and cons of the classic Linux profiler gprof

What are profiles useful for?

  • Profiles allow you to observe what a program is doing at runtime without runtime access to the source code (although source is needed to build a profile-enabled binary).
  • A performance regression or improvement in a given program will often show up in a profile as a change in either duration or number of calls to some set of functions that changed between software builds.

The output of a profile run against a very simple program looks like the example below.

                     Call graph (explanation follows)


granularity: each sample hit covers 4 byte(s) for 0.17% of 5.82 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]    100.0    2.38    3.44                 main [1]
                0.88    0.55 666666667/666666667     calcme2 [2]
                0.73    0.28 333333333/333333333     calcme1 [3]
                0.99    0.00 1000000000/1000000000     increment [4]
-----------------------------------------------
                0.88    0.55 666666667/666666667     main [1]
[2]     24.6    0.88    0.55 666666667         calcme2 [2]
                0.55    0.00 666666667/1000000000     return3 [5]
-----------------------------------------------
                0.73    0.28 333333333/333333333     main [1]
[3]     17.3    0.73    0.28 333333333         calcme1 [3]
                0.28    0.00 333333333/1000000000     return3 [5]
-----------------------------------------------
                0.99    0.00 1000000000/1000000000     main [1]
[4]     17.1    0.99    0.00 1000000000         increment [4]
-----------------------------------------------
                0.28    0.00 333333333/1000000000     calcme1 [3]
                0.55    0.00 666666667/1000000000     calcme2 [2]
[5]     14.3    0.83    0.00 1000000000         return3 [5]
-----------------------------------------------

It tells us that of the 5.82 seconds of profiled runtime

  • 2.38 seconds were spent in main()
  • 3.44 seconds in functions called by main (either directly or indirectly).

It also tells us that the function return3() was called 1,000,000,000 times

  • 1/3 by calcme1() and
  • 2/3 by calcme2()

Profiles can give a good overview of where time is spent – albeit with some overhead.

In the profile outputs, the number of calls is exact but the time spent is the result of sampling. So the relative time spent in the functions should be correct, but the absolute time will normally much lower than the runtime/execution time reported by e.g running $ time <some-binary>

generating a profile for gprof to use

There are three pieces to using a profile

  • ask gcc to insert additional code (a function call to __mcount) into the compiled binary
  • ask gcc to tell the kernel to keep track of where the program counter (PC) is running whenever it is sampled (__monstartup)
  • use the gprof tool to process the output.

There is no need for any special hardware support – profiles run fine in a VM/container etc whereas some tools need to access the CPU directly via pass-through, or you need to be root etc.

Continue reading

Linux memory monitoring (allocations Vs usage)

How to use some of Linux’s standard tools and how different types of memory usage shows up.

Examples of using malloc and writing to memory with three use-cases for a simple process

In each case we run the example with a 64MB allocation so that we can see the usage from standard linux tools.

We do something like this

gary@linux:~/git/unixfun$ ./malloc_and_write 65536
Allocating 65536 KB
Allocating 67108864 bytes
The address of your memory is 0x7fa2829ff010
Hit <return> to exit
Continue reading

Using iperf multi-stream may not work as expected

Running iperf with parallel threads

TL;DR – When running iperf with parallel threads/workers the -P option must be specified after the -c <target-IP> option. This is mentioned in the manpage but some options (-t for instance) work in any order, while others (specifically the -P for parallel threads) definitely does not, which is a bit confusing.

For example – these two invocations of iperf give very different results

  • iperf -P 5 -c 10.56.68.97 (The -P before -c) -Yields 20.4 Gbits/sec
  • iperf -c 10.56.68.97 -P 5 (The -P after the -c)- Yields 78.3 Gbits/sec
Continue reading

mpstat has an option to show utilization per NUMA node

Not sure how long this has been a thing, but I recently discovered that mpstat takes a -N option for “NUMA Node” that works in the same way as -P for “Processor”. e.g. $ mpstat -N 0,1 1 will show stats for NUMA nodes 0 and 1 every 1 second. Just like mpstat -P ALL shows all processors mpstat -N ALL shows all NUMA nodes (and is easier to type).

The output looks like this

05:09:17 PM NODE    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
05:09:18 PM 0 1.13 0.00 9.30 0.00 0.28 0.15 0.00 31.78 0.00 57.21
05:09:18 PM 1 0.40 0.00 8.03 0.00 0.28 1.05 0.00 31.34 0.00 58.78
^C

Average: NODE %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
Average: 0 0.80 0.00 8.56 0.00 0.27 0.11 0.00 36.49 0.00 53.78
Average: 1 0.49 0.00 10.02 0.00 0.32 1.01 0.00 25.13 0.00 63.03

Using mpstat -N is is quite easy to check to see how the workload is distributed among the NUMA nodes of a multi-socket machine.