Understanding Linux Perf sched-switch and context-switches
I know this question is pretty old (Feb 16) but here a response in case it helps someone else. The problem is that you've entered the '-F 999' indicating that you want to sample the events at a frequency of 999 times a second. For 'trace' events, you don't generally want to do sampling. For instance, when I select sched:sched_switch, I want to see every context switch. If you enter -F 999 then you will get a sampling of the context switches... If you look at the output of your 'perf record' cmd with something like:
perf script --verbose -I --header -i perf.dat -F comm,pid,tid,cpu,time,period,event,trace,ip,sym,dso > perf.txt
then you would see that the 'period' (the number between the timestamp and the event name) would not (usually) be == 1.
If you use a 'perf record' cmd like below, you'll see a period of 1 in the 'perf script' output like:
Binder:695_5 695/2077 [000] 16231.700440: 1 sched:sched_switch: prev_comm=Binder:695_5 prev_pid=2077 prev_prio=120 prev_state=S ==> next_comm=kworker/u16:17 next_pid=7665 next_prio=120
A long winded explanation but basically: don't do that (where 'that' is '-F 999').
If you just do something like:
perf record -a -g -e sched:sched_switch -e sched:sched_blocked_reason -e sched:sched_stat_sleep -e sched:sched_stat_wait sleep 5
then the output would show every context switch with the call stack for each event. And you might need to do:
echo 1 > /proc/sys/kernel/sched_schedstats
to get the sched_stat events.
Related videos on Youtube
user6256
Updated on September 18, 2022Comments
-
user6256 over 1 year
I've been trying to enable context switch events on perf and use perf script's dump from perf.data to investigate thread blocked time.
So far the only two recording options that seem to be helpful are context switch and all the sched events.
Here's the command I'm running on perf:
perf record -g -a -F 999 -e cpu-clock,sched:sched_stat_sleep,sched:sched_switch,sched:sched_process_exit,context-switches
However, both seem to be incomplete, usually a sched_switch event looks something like this:
comm1 0/0 [000] 0.0: 1 sched:sched_switch: prev_comm=comm1 prev_pid=0 prev_prio=0 prev_state=S ==> next_comm=comm2 next_pid=1 next_prio=1 stacktrace...
From my understanding, the prev_comm is always the thread that is going to be blocked, and the next_comm is the thread that is going to be unblocked. Is this a correct assumption? If it is, I can't seem to get complete data on the events since there are many threads that get blocked on prev_comm, but never seem to get a corresponding next_comm.
Enabling context switches doesn't seem to do much since there is no information on the thread being blocked or unblocked (unless I'm completely missing something, in which I would appreciate an explanation on how they work).
Here's how a typical context switch event looks like:
comm1 0/0 [000] 0.0: 1 context-switch: stacktrace...
tl;dr, how can I do blocked time investigations on linux through perf script's output and what options need to be enabled on perf record?
Thanks.
-
maxschlepzig over 5 yearsPerf fails with
No permissions to read /sys/kernel/debug/tracing/events/sched/sched_stat_sleep
- on a Fedora 27 system - doingecho 1 > /proc/sys/kernel/sched_schedstats
andmount -o remount,mode=755 /sys/kernel/debug
doesn't help - any idea who to enable this for a normal user? -
maxschlepzig over 5 yearsHm, after a
chmod 755 /sys/kernel/debug/tracing/
andecho -1 > /proc/sys/kernel/perf_event_paranoid
it works for a normal user - perhaps the chmod is superfluous when you decrease the paranoid level.