Understanding Linux Perf sched-switch and context-switches

5,153

I know this question is pretty old (Feb 16) but here a response in case it helps someone else. The problem is that you've entered the '-F 999' indicating that you want to sample the events at a frequency of 999 times a second. For 'trace' events, you don't generally want to do sampling. For instance, when I select sched:sched_switch, I want to see every context switch. If you enter -F 999 then you will get a sampling of the context switches... If you look at the output of your 'perf record' cmd with something like:

perf script --verbose -I --header -i perf.dat -F comm,pid,tid,cpu,time,period,event,trace,ip,sym,dso > perf.txt

then you would see that the 'period' (the number between the timestamp and the event name) would not (usually) be == 1.

If you use a 'perf record' cmd like below, you'll see a period of 1 in the 'perf script' output like:

Binder:695_5   695/2077  [000] 16231.700440:          1         sched:sched_switch: prev_comm=Binder:695_5 prev_pid=2077 prev_prio=120 prev_state=S ==> next_comm=kworker/u16:17 next_pid=7665 next_prio=120

A long winded explanation but basically: don't do that (where 'that' is '-F 999').

If you just do something like:

perf record -a -g -e sched:sched_switch -e sched:sched_blocked_reason -e sched:sched_stat_sleep -e sched:sched_stat_wait sleep 5

then the output would show every context switch with the call stack for each event. And you might need to do:

echo 1 > /proc/sys/kernel/sched_schedstats

to get the sched_stat events.

Share:
5,153

Related videos on Youtube

user6256
Author by

user6256

Updated on September 18, 2022

Comments

  • user6256
    user6256 over 1 year

    I've been trying to enable context switch events on perf and use perf script's dump from perf.data to investigate thread blocked time.

    So far the only two recording options that seem to be helpful are context switch and all the sched events.

    Here's the command I'm running on perf:

    perf record -g -a -F 999 -e cpu-clock,sched:sched_stat_sleep,sched:sched_switch,sched:sched_process_exit,context-switches
    

    However, both seem to be incomplete, usually a sched_switch event looks something like this:

    comm1   0/0     [000] 0.0:          1 sched:sched_switch: prev_comm=comm1 prev_pid=0 prev_prio=0 prev_state=S ==> next_comm=comm2 next_pid=1 next_prio=1
        stacktrace...
    

    From my understanding, the prev_comm is always the thread that is going to be blocked, and the next_comm is the thread that is going to be unblocked. Is this a correct assumption? If it is, I can't seem to get complete data on the events since there are many threads that get blocked on prev_comm, but never seem to get a corresponding next_comm.

    Enabling context switches doesn't seem to do much since there is no information on the thread being blocked or unblocked (unless I'm completely missing something, in which I would appreciate an explanation on how they work).

    Here's how a typical context switch event looks like:

    comm1   0/0     [000] 0.0:          1 context-switch:
        stacktrace...
    

    tl;dr, how can I do blocked time investigations on linux through perf script's output and what options need to be enabled on perf record?

    Thanks.

  • maxschlepzig
    maxschlepzig over 5 years
    Perf fails with No permissions to read /sys/kernel/debug/tracing/events/sched/sched_stat_sleep - on a Fedora 27 system - doing echo 1 > /proc/sys/kernel/sched_schedstats and mount -o remount,mode=755 /sys/kernel/debug doesn't help - any idea who to enable this for a normal user?
  • maxschlepzig
    maxschlepzig over 5 years
    Hm, after a chmod 755 /sys/kernel/debug/tracing/ and echo -1 > /proc/sys/kernel/perf_event_paranoid it works for a normal user - perhaps the chmod is superfluous when you decrease the paranoid level.