Monday 24 November 2014

Measuring stalled instructions with perf stat

Recently I was playing around with CPU loading and was trying to estimate the number of compute operations being executed on my machine.  In particular, I was interested to see how many instructions per cycle and stall cycles I was hitting on the more demanding instructions.   Fortunately, perf stat allows one to get detailed processor statistics to measure this.

In my first test, I wanted to see how the Intel rdrand instruction performed with 2 CPUs loaded (each with a hyper-thread):

$ perf stat stress-ng --rdrand 4 -t 60 --times
stress-ng: info: [7762] dispatching hogs: 4 rdrand
stress-ng: info: [7762] successful run completed in 60.00s
stress-ng: info: [7762] for a 60.00s run time:
stress-ng: info: [7762]     240.01s available CPU time
stress-ng: info: [7762]     231.05s user time   ( 96.27%)
stress-ng: info: [7762]       0.11s system time (  0.05%)
stress-ng: info: [7762]     231.16s total time  ( 96.31%)

 Performance counter stats for 'stress-ng --rdrand 4 -t 60 --times':

     231161.945062      task-clock (msec)         #    3.852 CPUs utilized          
            18,450      context-switches          #    0.080 K/sec                  
                92      cpu-migrations            #    0.000 K/sec                  
               821      page-faults               #    0.004 K/sec                  
   667,745,260,420      cycles                    #    2.889 GHz                    
   646,960,295,083      stalled-cycles-frontend   #   96.89% frontend cycles idle   
         stalled-cycles-backend   
    13,702,533,103      instructions              #    0.02  insns per cycle        
                                                  #   47.21  stalled cycles per insn
     6,549,840,185      branches                  #   28.334 M/sec                  
         2,352,175      branch-misses             #    0.04% of all branches        

      60.006455711 seconds time elapsed

stress-ng's rdrand test just performs a 64 bit rdrand read and loops on this until the data is ready, and performs this 32 times in an unrolled loop.  Perf stat shows that each rdrand + loop sequence on average consumes about 47 stall cycles showing that rdrand is probably just waiting for the PRNG block to produce random data.

My next experiment was to run the stress-ng ackermann stressor; this performs a lot of recursion, hence one should see a predominantly large amount of branching.

$ perf stat stress-ng --cpu 4 --cpu-method ackermann -t 60 --times
stress-ng: info: [7796] dispatching hogs: 4 cpu
stress-ng: info: [7796] successful run completed in 60.03s
stress-ng: info: [7796] for a 60.03s run time:
stress-ng: info: [7796]     240.12s available CPU time
stress-ng: info: [7796]     226.69s user time   ( 94.41%)
stress-ng: info: [7796]       0.26s system time (  0.11%)
stress-ng: info: [7796]     226.95s total time  ( 94.52%)

 Performance counter stats for 'stress-ng --cpu 4 --cpu-method ackermann -t 60 --times':

     226928.278602      task-clock (msec)         #    3.780 CPUs utilized          
            21,752      context-switches          #    0.096 K/sec                  
               127      cpu-migrations            #    0.001 K/sec                  
               927      page-faults               #    0.004 K/sec                  
   594,117,596,619      cycles                    #    2.618 GHz                    
   298,809,437,018      stalled-cycles-frontend   #   50.29% frontend cycles idle   
         stalled-cycles-backend   
   845,746,011,976      instructions              #    1.42  insns per cycle        
                                                  #    0.35  stalled cycles per insn
   298,414,546,095      branches                  # 1315.017 M/sec                  
        95,739,331      branch-misses             #    0.03% of all branches        

      60.032115099 seconds time elapsed

..so about 35% of the time is used in branching and we're getting  about 1.42 instructions per cycle and no many stall cycles, so the code is most probably executing inside the instruction cache, which isn't surprising because the test is rather small.

My final experiment was to measure the stall cycles when performing complex long double floating point math operations, again with stress-ng.

$ perf stat stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times
stress-ng: info: [7854] dispatching hogs: 4 cpu
stress-ng: info: [7854] successful run completed in 60.00s
stress-ng: info: [7854] for a 60.00s run time:
stress-ng: info: [7854]     240.00s available CPU time
stress-ng: info: [7854]     225.15s user time   ( 93.81%)
stress-ng: info: [7854]       0.44s system time (  0.18%)
stress-ng: info: [7854]     225.59s total time  ( 93.99%)

 Performance counter stats for 'stress-ng --cpu 4 --cpu-method clongdouble -t 60 --times':

     225578.329426      task-clock (msec)         #    3.757 CPUs utilized          
            38,443      context-switches          #    0.170 K/sec                  
                96      cpu-migrations            #    0.000 K/sec                  
               845      page-faults               #    0.004 K/sec                  
   651,620,307,394      cycles                    #    2.889 GHz                    
   521,346,311,902      stalled-cycles-frontend   #   80.01% frontend cycles idle   
         stalled-cycles-backend   
    17,079,721,567      instructions              #    0.03  insns per cycle        
                                                  #   30.52  stalled cycles per insn
     2,903,757,437      branches                  #   12.873 M/sec                  
        52,844,177      branch-misses             #    1.82% of all branches        

      60.048819970 seconds time elapsed

The complex math operations take some time to complete, stalling on average over 35 cycles per op.  Instead of using 4 concurrent processes, I re-ran this using just the two CPUs and eliminating 2 of the hyperthreads.  This resulted in 25.4 stall cycles per instruction showing that hyperthreaded processes are stalling because of contention on the floating point units.

Perf stat is an incredibly useful tool for examining performance issues at a very low level.   It is simple to use and yet provides excellent stats to allow one to identify issues and fine tune performance critical code.  Well worth using.

Saturday 8 November 2014

Systems Performance: Enterprise and the Cloud

Before I started some analysis on benchmarking various popular file systems on Linux I was recommended to read "Systems Performance: Enterprise  and the Cloud" by Brendan Gregg.

In today's modern server and cloud based systems the multi-layered complexity can make it hard to pin point performance issues and bottlenecks. This book is packed full useful analysis techniques covering tracing, kernel internals, tools and benchmarking.

Critical to getting a well balanced and tuned system are all the different components, and the book has chapters covering CPU optimisation (cores, threading, caching and internconnects),  memory optimisation (virtual memory, paging, swapping, allocators, busses),  file system I/O, storage, networking (protcols, sockets, physical connections) and typical issues facing cloud computing.

The book is full of very useful examples and practical instructions on how to drill down and discover performance issues in a system and also includes some real-world case studies too.

It has helped me become even more focused on how to analyse performance issues and consider how to do deep system instrumentation to be able to understand where any why performance regressions occur.

All-in-all, a most systematic and well written book that I'd recommend to anyone running large complex servers and cloud computing environments.