[LWN Logo]

From: torvalds@transmeta.com (Linus Torvalds)
Subject: Re: Thread creation/switch times on Linux and NT (was Re: Linux users working at Microsoft!)
Date: 8 Mar 1998 01:31:03 GMT

In article <slrn6g3jpd.85.galexand@sietch.sietch.bloomington.in.us>,
Greg Alexander <galexand@sietch.sietch.bloomington.in.us> wrote:
>My biggest suggestion is to try kernel profiling.  Check if any notable
>amount of time is actually spent in goodness before worrying about changing
>it.

Also, check out 2.1.x - there are some changes to various details of the
scheduler that were brought on by the finer locking granularity, but
that were sometimes also related to performance. 

I do obviously agree with the basic points above - I wrote most of the
scheduler.  Usually there aren't all that many runnable processes even
under heavy load, and having a very simple linear queue is a win for
almost all situations in my opinion.  For example, if there are lots of
processes doing IO, the process list tends to be fairly short and you
really want a very simple scheduler for latency reasons.  In contrast,
if there are lots of CPU-bound processes, there may be lots of runnable
processes, but it very seldom results in a re-schedule (because they
keep running until the timeslot ends), so again there is no real reason
to try to be complex. 

So yes, under certain circumstances the current scheduler uses more CPU
than strictly necessary - and the "40 processes doing a sched_yield()
all the time" example is one of the worst (because it implies a lot of
runnable processes but still implies continuous thread switching). 

Personally I don't think it's a very realistic benchmark (it tells you
_something_, but I don't think it tells you anything you need to know),
which is one reason why Linux isn't maybe the best system out there for
that particular benchmark.  But it would be easy enough to make Linux
perform better on it, so I'll think about it. 

[ Even when I don't find benchmarks very realistic I really hate arguing
  against hard numbers: hard numbers are still usually better than just
  plain "intuition".  And I may well be wrong, and maybe there _are_
  circumstances where the benchmark has some real-world implications,
  which is why I wouldn't just dismiss the thing out-of-hand.  It's just
  too easy to ignore numbers you don't like by saying that they aren't
  relevant, and I really try to avoid falling into that trap. ]

The particular problem with "sched_yield()" is that the Linux scheduler
_really_ isn't able to handle it at all, which is why the Linux
sched_yield() implementation sets the counter to zero - I well know that
it's not the best thing to do for performance reasons, and I think it
unduly penalizes people who want to yield some CPU time, but as it
stands the scheduler can't handle it any other way (the "decrement
counter by one" approach that Ingo suggested is similarly broken - it
just happens to not show it quite as easily as the more drastic "zero
the counter", and it has some other problems - mainly that it doesn't
guarantee that we select another process even if another one were to be
runnable). 

I should probably add a "yield-queue" to the thing - it should be rather
easy to do, and it would get rid of the current scheduler wart with
regard to sched_yield().  My reluctance is purely due to the fact that I
haven't heard of any real applications that it would matter for, but I
suspect we need it for stuff like "wine" etc that need to get reasonable
scheduling in threaded environments that look different from pthreads(). 

		Linus