Giter Site home page Giter Site logo

Comments (67)

hamadmarri avatar hamadmarri commented on June 3, 2024 4

Hi @MoisesMH , @raykzhao , @JohnyPeaN , @ltsdw , @ptr1337 , @SoongVilda

I am planning to make a rework on RDB and start it over from the beginning. I need to review how nohz idle wakeup mechanism works first. Also I am thinking to make some extra features where some CPUs are assigned to be an interactive tasks servant (where it gives more priority to interactive tasks but still can run non-interactive tasks at the same time). This idea are based on this (https://www.researchgate.net/profile/Julien-Soula/publication/254213707_ARTiS_an_Asymmetric_Real-Time_Scheduler_for_Linux_on_Multi-Processor_Architectures/links/00b495350104a70d19000000/ARTiS-an-Asymmetric-Real-Time-Scheduler-for-Linux-on-Multi-Processor-Architectures.pdf)

The next RDB must consider all nohz work, and maybe a global queue for candidates tasks in which one task from each CPU (the task that has the highest IS but not running). Each CPU will have one slot in the global queue and it must guarantee that the task that is advertised in the global queue must be ready to migrate at any time, unless the slot has a null value.

The locking number could be increased but the queue is not very big it only contains nproc items.

Thanks

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024 3

I suspect it is related to rcu calls and soft irq. I will post some fixes to try soon.

Thank you

from cacule-cpu-scheduler.

raykzhao avatar raykzhao commented on June 3, 2024 2

Hi @ltsdw @hamadmarri

I think the compiling error is because the nohz_newidle_balance is not defined when CONFIG_NO_HZ_COMMON=n and CONFIG_CACULE_RDB=y. Please try the following fix:

--- a/kernel/sched/fair.c	2021-08-18 22:39:26.513174343 +1000
+++ b/kernel/sched/fair.c	2021-08-18 22:38:19.322803092 +1000
@@ -11084,9 +11084,9 @@
 {
 	return false;
 }
+#endif
 
 static inline void nohz_newidle_balance(struct rq *this_rq) { }
-#endif
 
 #endif /* CONFIG_NO_HZ_COMMON */
 

fix.patch.zip

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024 1

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

Hi @ltsdw ,

Good to hear it's working fine now, however, I really would like to troubleshoot why RDB causes these freezes.

Regarding tunning, there is no specific way to test. I tried to make the defaults to work fine in general, but when you have any issue you can change them. You need to have a background on cpu scheduling so you can read about the every cacule sysctl and change them accordingly.

I would like to keep this issue open until we see why RDB performs bad with wine.

Thank you

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024 1

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024 1

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024 1

Could you please try this fix

rdb-nohz-fix.zip

Please test with either no_hz_idle or no_hz_full

Thank you

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024 1

Hi @ALL

Since the current version of RDB is broken, I will disable it by default. You can still use the older RDB versions where no autogroup support until a fix is found.

Thank you

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024 1

@hamadmarri

I don't know if it was to put an image here or something else, but here:

Screenshot-20-08-2021_09-45-25

Thank you for your support!

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024 1

@hamadmarri
I have made a discovery. The lagging is caused by compositor, not inside the game engine (I noticed that mangohud was showing 60fps constantly). So If I disable the plasma compositor, the game is fluid even with RDB.
With compositor enabled:

cacule = no lags
cacule + rdb = heavy lags
cacule + rdb + fix = very short, but frequent and noticeable lags
cacule + rdb + periodic = no lags

So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown.

Here is top of perf session:

41.15%  swapper          [kernel.vmlinux]                      [k] acpi_idle_enter
10.13%  swapper          [kernel.vmlinux]                      [k] acpi_processor_ffh_cstate_enter
 1.42%  RDR2.exe         ntdll.so                              [.] __fsync_wait_objects
 1.03%  RDR2.exe         ntdll.so                              [.] __wine_syscall_dispatcher
 1.02%  RDR2.exe         [kernel.vmlinux]                      [k] native_sched_clock

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024 1

Hey @hamadmarri

This is the machine I'm testing:
AMD Ryzen 5 3600 6-core processor
2x8GB DDR4 2666 RAM
256GB NVMe M.2 SSD
2TB HDD Drive
4GB GDDR6 VRAM RX 5500 XT

lstopo

Hi @MoisesMH

Just to double check, could you please try with CONFIG_HZ_PERIODIC=y without the fix patch. I recommend using make menuconfig to enable CONFIG_HZ_PERIODIC since it does set the corresponding configs automatically so you don't need to worry about other CONFIG_NO_HZ_* settings.

What I am thinking is that you and @JohnyPeaN have many CPUs where there are high probability that some of them turn to idle state and no_hz wake up didn't work with RDB. Also I am afraid that @ltsdw needs to retry with CONFIG_HZ_PERIODIC=y and make sure no compilation errors and check if CONFIG_HZ_PERIODIC=y is enabled after installation.

Another suspicion is that the RDB-r3 balance tries to pick from all tasks in rq where some of them are in RT policy! In contrast, previous RDB version was just using rq->cfs tasks to balance. So, it could be that the plasma compositor are a RT task policy (not sure), but if it is the case, then RDB is keep balancing RT task (due to moving one task a time) and cfs tasks are not balanced at all (during the freezes). Could @JohnyPeaN please check what policy the plasma compositor is?

I am 100% sure that RDB is not considering the nohz kicking to wakeup idle cpus, and if setting periodic tick works for all of you, then we know that it is about nohz wakeup kicker. However, if @ltsdw still has freezes while using periodic tick, then we might have another issue as well.

Please make sure that:

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

During testing to make sure that the freezes are not related to the cache or starve scores.

Thank you

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024 1

Also, I'm not recompiling to test autogroup on/off. Just to confirm does
echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
switch it off?

@JohnyPeaN
Yeah I do think so, you can also use the kernel command line noautogroup.

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024 1

Nice! Thanks to you. I don't know but lastly I've tried the liquorix kernel with the MuQSS scheduler (CONFIG_HZ_100=y is default), android modules, ntfs3 and uksm. What surprises me is the CPU usage. At the game menu of Star Wars Battlefront II, your scheduler with linux-tkg (CONFIG_HZ_1000=y is default) consumes 24% to 32-33% of CPU usage, but, with, this new kernel, it was reaching a whopping 54% to 59% of CPU Usage. I don't believe it's uksm which is incrementing CPU Usage, because its main function is memory deduplication. It's not possible in my opinion. Also, at gameplay, your scheduler were around 54% to 62% of CPU Usage, while lqx-kernel with MuQSS reached from 66% up to 79%. It's impressive how optimized the linux-tkg kernel is compared to liquorix. Well I haven't tried the linux-tkg with uksm. I'm going to compile it now and see how it does with CacULE with and without RDB for testing. Keep it up!

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024 1

Hey @hamadmarri
That article seems interesting. Later I'm gonna read more. On the other hand, I've made a discovery too haha. I haven't ever though of tweaking those values you provided when you introduced them in a discussion panel. I'm referring to cache_factor and starving_factor (at #43). First, I've changed the cache_factor to 0 as you suggested and the starving factor to 15944. When playing SWBF 2, all of the hangs were apparently gone on a map. Then the match changed to another, which it was more resource hungry I guess because of the more complex graphics (terrain, leaves, ambient occlusion, etc). And then appeared two or three hangs. Then I've changed the cache_value to 8192 as you suggested to increase it. The performance was the same till one big hang appeared (5 secs I guess). Then it became back to normal. The game was fluid. What's important here is that playing with those settings affected the way RDB were performing. I'm not sure if I have to lower the starving_factor to avoid those peaks or not. You mentioned raising it will make the system run smoother, but I'm suspecting raising it too much will lead to starve more groups of applications. I'm not sure about that but I'll try with a better value to see if it's true. Also, I have a doubt: after finding a starve_factor that fits me, then why do you mention we have to raise the cache_factor the most we can? In a less intensive map, when cache_factor=0 and starve_factor=15944, the game was running with no peaks at all, also the framerates were stable. But in these new more intensive map, I've seen just one or two peaks, when cache_factor was 0 or 8192. Indeed with 8192, I've seen one or two peaks more than with 0. How can you explain it to understand the tweaking I've done? Greetings and care yourself!

EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination.

EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II:

  1. One or two lag spikes. The framerate was constant, even on an intensive graphical rendering map such as Ajan Kloss)
    sched_cache_factor = 7972
    sched_starve_factor = 19930

  2. No spikes at all (I don't remember if I tested the map Ajan Kloss with this configuration, but other maps were running flawlessly)
    sched_cache_factor = 3986
    sched_starve_factor = 17937

Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Also all the tunable configs are the default.

from cacule-cpu-scheduler.

raykzhao avatar raykzhao commented on June 3, 2024

Hi @ltsdw

Based on #43, Have you tried to reduce the kernel.sched_cache_factor to a lower value e.g. 0? Also from my experience, you may try to set the kernel.sched_cacule_yield to 0 since it may cause freeze due to some I/O issues, see #35.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Hi there @raykzhao

Thank you for your suggestion I'll try.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

sadly it didn't worked, tried:
kernel.sched_cache_factor=0
kernel.sched_cacule_yield=0

but the hangs still.

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

kernel.sched_cache_factor=0

Could you please also set
kernel.sched_starve_factor=0

Is RDB enabled?

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Could you please also set
kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

Could you please also set
kernel.sched_starve_factor=0

The hang still with kernel.sched_starve_factor=0

Is RDB enabled?

As I think it's enabled by default with the patch, I believe so.

Could you please try without RDB?

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

Could you please try without RDB?

As I don't think there is a runtime way to disable it, it's necessary recompile it, right?

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Yes, you need to recompile. I think the version that was working for you was without RDB. Could you please attach the .config too?

Also provide all technical information and versions like kernel, wine, which game, and what settings.

Thanks

Sure this one here was from my last compile on 5.13.8 config.txt.

CPU: i5 5200U
GPU: Intel(R) HD Graphics 5500 (using iris)
RAM: 8 GB
Mesa: 21.3.0 (commit c0fc745b78b)
Wine: 6.13 (with some patches from tkg)
Games that I tested with: NovaRO, GTA San Andreas, Path of Exile (this one I'll blame my gpu more than anything else), but it also happen out of nowhere when watching some videos too, or when I'm compiling something.

and when you say settings, you say which ones? the cacule's ones? if it's, it's all the default.

Now let me recompile it, will take some time.

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024

I have such lags in rdr2 (only) and setting kernel.sched_interactivity_factor=50 seems to be helping. It doesnt happen without RDB, but without RDB background load has stronger negative effects. I will test kernel.sched_starve_factor=0, too.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Yes, I can confirm, disabling the RDB did the trick, no more hangs, thank you @hamadmarri.

Also, not related to this issue but may I ask you, is there any straightforward tool to benchmark which of these tunable configs performs better?

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

@hamadmarri, you might be onto somethinmg. This game does ~160k context switches, that might have something to do with it. But BMQ handles it, so its doable. I'm looking forward those fixes. Keep up the good work.

Hi @JohnyPeaN , @ltsdw

To narrow down the troubleshooting, could you please try RDB with:
CONFIG_HZ_PERIODIC=y
to see if it is actually related to no_hz_{idle, full} balancing?
I remember I had nohz_balancer_kick(rq); added in RDB before, but for some reasons that I forgot why I removed it from RDB trigger_load_balance function.

Also, can you try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Or try vise versa, in cause you have most rcu configs are disabled try to enable them.

Based on my RDB code review I have just did 2min ago, I am suspecting it is because nohz balancing. I am assuming that you are using no_hz_full?

Please let me know if any of the above changes fix the freezes so I can propose a fix based on your feedback. If non of the above configs has any positive effects, then I can investigate something else.

Thank you

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

@hamadmarri needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Hi @JohnyPeaN

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

Thank you

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

@hamadmarri

ok, I'll try too, but I'll need some time, thank you!

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

@hamadmarri

while compiling I noticed this:

kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance' [-Werror,-Wimplicit-function-declaration]
                nohz_newidle_balance(this_rq);
                ^
kernel/sched/fair.c:11324:3: note: did you mean 'nohz_run_idle_balance'?
kernel/sched/sched.h:2439:20: note: 'nohz_run_idle_balance' declared here
static inline void nohz_run_idle_balance(int cpu) { }
                   ^
1 error generated.
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make[1]: *** Waiting for unfinished jobs....

and the building failed.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

Nah, I think it was my fault, let me try again.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

strange, kernel/sched/fair.c, in fact has a declaration of nohz_newidle_balance at line 11050. actually I don't know what possible wrong here. Why when called at line 11324 it's not seeing it?

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri

but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

needed to set PREEMPT=n too, elese RCU settings weren't applicable and compile error ocured. I will test it a bit later.

Oh, now I see it.

I would advise you first try with CONFIG_HZ_PERIODIC=y only, and if no effects then try with the rcus.

@hamadmarri
but now I'm confused, should I or not use PREEMPT=n? Apparently it can't be compiled without that!

Hi @ltsdw

Please try first with CONFIG_HZ_PERIODIC=y only. Keep the rest as it was.

Thank you

@hamadmarri

But now there is a compile error happening kernel/sched/fair.c:11324:3: error: implicit declaration of function 'nohz_newidle_balance'

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

@hamadmarri @raykzhao

Ok, I tested with CONFIG_HZ_PERIODIC=y and at least for me the hangs still.
Now I'll try with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

Just a question, should I still use the CONFIG_HZ_PERIODIC=y or not?

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024

@hamadmarri CONFIG_HZ_PERIODIC=y removes the random lags and game is smooth even with RDB. Tried also the other suggested config options, but nothing noticeable happened.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

@hamadmarri @JohnyPeaN

Just tested with:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

and also didn't work, the hangs still happening for me.

from cacule-cpu-scheduler.

raykzhao avatar raykzhao commented on June 3, 2024

Hi @ltsdw

Another thing I would suspect is the autogroup. Have you tried to disable the autogroup? You may try to add noautogroup in your kernel boot command-line parameter.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

hi @raykzhao

thank you for your suggestion, I'll try it out later as I cannot right now

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024

@raykzhao I have autogroup enabled, but before CONFIG_HZ_PERIODIC=y even disabling it didn't help. I was using no_hz_full, but lately without kernel commandline parameter, which if I understand correctly, results in no_hz_idle.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

@hamadmarri @raykzhao

So I just tested with noautogroup and the hangs are gone again.

To summarize, for me, neither:

CONFIG_HZ_PERIODIC=y

nor:

PREEMPT_RCU=n
RCU_BOOST=n
CONFIG_RCU_FAST_NO_HZ=n
TASKS_RCU=n
TASKS_RCU_GENERIC=n

worked so far, but disabling autogroup did the trick. Thank you!

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

@hamadmarri

Just tested the fix, compiled with no_hz_full, and the hangs persists.

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

also, I don't know if it's relevant, but I noticed that when the hangs (freezes) happen, the cpu usage usually drops to 5-10%, from like 60-70%. In other words it drops from 70% to 5% and hangs for like 5-10 seconds and comes back to the normal cpu usage before the hang (around the ~70%).

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024

@hamadmarri can confirm. Lags still happening. Although it seems that they are much shorter, but still noticeable. With periodic ticks, its completely fluid for me.

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

Hi there. I was wondering if having the option CONFIG_NO_HZ=y enabled is really necessary to activate the option CONFIG_NO_HZ_IDLE=y. I've read the first one is really used for older kernels, but, since I'm running 5.13, I guess it's not necessary at all. Also, I've read CONFIG_NO_HZ in recent kernels has divided in CONFIG_NO_HZ_IDLE, CONFIG_HZ_PERIODIC and CONFIZ_NO_HZ_FULL. I've though on running CONFIG_HZ_PERIODIC=y, but it'd drain unnecessary energy from the cpu, even if it's idle. These are the sources I've read:

https://www.linuxquestions.org/questions/linux-kernel-70/timer-tick-handling-4175468487/
https://github.com/torvalds/linux/blob/master/kernel/time/Kconfig

Also, I've read reducing the timer frequency could improve the performance of a kernel. I'm currently running at CONFIG_HZ_1000=y, but I can try RDB with a lower number (CONFIG_HZ_500=y) to see if I notice an improvement.

On the other hand, thanks for your scheduler. It's running incredibly smoother than cfs. Also, the cpu usage is reduced by a lot and the framerates are solid in resource-hungry games. I'm here because I've experienced the same issue: hangs happening at Star Wars Battlefront II each 5 secs at average, since I've activated the RDB feature. I'll stay tuned for improvements, since the RDB feature is really interesting. Also, sorry if my question is something obvious, but could you explain me how the rdb interval works, and what's the difference between running at a lower and a higher interval. I'd appreciate that.

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

Also, sorry if my question is something obvious, but could you explain me how the rdb interval works, and what's the difference between running at a lower and a higher interval. I'd appreciate that.

Hi @MoisesMH

The interval is a number in milliseconds where each cpu runs load balancer every interval
0: load balancing runs every tick
4: load balancing runs every 4ms
and so on.

Low value helps to balance more but with the cost of increasing runqueues locking
High value doesn't balance often but it reduces runqueues locking time.

Thanks

from cacule-cpu-scheduler.

ptr1337 avatar ptr1337 commented on June 3, 2024

@hamadmarri

You can just create a own RDB.patch like you did earlier or move the current patch into experimental.

Since i faced with the current RDB in no issues with autogroup-

from cacule-cpu-scheduler.

SoongVilda avatar SoongVilda commented on June 3, 2024

My experiences linux-cacule-rdb-autogroup

Firefox, telegram, steam and playing Xonotic, no issues stable and high fps.

from cacule-cpu-scheduler.

raykzhao avatar raykzhao commented on June 3, 2024

Hi @hamadmarri,

Since majority of the issues reported here happen during wine/gaming, maybe it is a good idea to look at the locking. I suspect maybe there are some issues in latest rdb/autogroup with futex2. Also some game developers are known to use locking mechanisms in the way that it is not supposed to be used.

from cacule-cpu-scheduler.

ptr1337 avatar ptr1337 commented on June 3, 2024

Even when using games with futex2 i dont face in any issues.

Going to test again, but im sure there is no problem.

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

Hi @hamadmarri,

Since majority of the issues reported here happen during wine/gaming, maybe it is a good idea to look at the locking. I suspect maybe there are some issues in latest rdb/autogroup with futex2. Also some game developers are known to use locking mechanisms in the way that it is not supposed to be used.

Hi @raykzhao

I am not sure actually because most of the feedback are not strongly related. Non fixes worked with @ltsdw but with @JohnyPeaN changing to periodic hz worked. Also, I have tested my proposed fix and it reduces the performance to be worse than CFS balancer. I guess the best way is to make RDB works with periodic hz and without {auto, fair}_group. The locking issues on games could be a reason but why it is ok with CFS balancer and bad on RDB? I thought it was because the CFS balancer goes through softirq but even with the fix where I made the RDB balancer use softirq it didn't fixed the freezes. I am afraid it is due something else that RDB didn't take care of.

If you don't mind @ALL could you please attach the cpu topology with lstopo. It could be related to shared core balancing or number of CPUs in which many locking is an issue.

Thank you

from cacule-cpu-scheduler.

raykzhao avatar raykzhao commented on June 3, 2024

Hi @hamadmarri,

This is my laptop:
Screenshot_2021-08-21_01-04-06

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024

@hamadmarri
this is the machine on which I'm testing:

Screenshot_lstopo_2

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

Hey @hamadmarri

This is the machine I'm testing:
AMD Ryzen 5 3600 6-core processor
2x8GB DDR4 2666 RAM
256GB NVMe M.2 SSD
2TB HDD Drive
4GB GDDR6 VRAM RX 5500 XT

lstopo

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

@hamadmarri
I have made a discovery. The lagging is caused by compositor, not inside the game engine (I noticed that mangohud was showing 60fps constantly). So If I disable the plasma compositor, the game is fluid even with RDB.
With compositor enabled:

cacule = no lags
cacule + rdb = heavy lags
cacule + rdb + fix = very short, but frequent and noticeable lags
cacule + rdb + periodic = no lags

So it seems that the compositor gets neglected under certain circumstances and although game renders its images, they are not shown.

Here is top of perf session:

41.15%  swapper          [kernel.vmlinux]                      [k] acpi_idle_enter
10.13%  swapper          [kernel.vmlinux]                      [k] acpi_processor_ffh_cstate_enter
 1.42%  RDR2.exe         ntdll.so                              [.] __fsync_wait_objects
 1.03%  RDR2.exe         ntdll.so                              [.] __wine_syscall_dispatcher
 1.02%  RDR2.exe         [kernel.vmlinux]                      [k] native_sched_clock

Hi @JohnyPeaN

I think it is related to tick update where RDB-r3 needs to update the highest IS task in every tick. However, previous RDB version was using a bit different approach since enqueue was sorted.

Do the lags happen on previous RDB version (where no sched_group support)?

Thank you for the observation 👍

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

@ltsdw mentioned he used rdb without autogroup and it gave him no spikes.

At the moment, I've compiled the kernel I've used without the patch and these parameters:

RDB Interval: 19 (default).
CONFIG_HZ_1000=y
CONFIG_SCHED_AUTOGROUP=n
CONFIG_NO_HZ=y (I've read it's used for old configs, so I kept it enabled)
CONFIG_NO_HZ_IDLE=y (Tickless idle)

Also I've tweaked some options for the kernel configuration. I'll post it here just in case you want to take a look:

https://drive.google.com/file/d/1eR6NIPe88lc1SCz_nqGjNXPPRPOSavv0/view?usp=sharing

For me it's weird because 15 minutes ago I was testing about 30 minutes of gameplay in Star Wars Battlefront II. I was using Mangohud latest version from AUR (not the mangohud-git one). The first 15 minutes approximately I've experienced no spikes at all and the framerate was constant and smooth, but, since then, I've encountered some little ones every 5 minutes I guess, which lasted 2 seconds each. Then it seemed spikes were gone, until my game froze 5 secs, just like when I've got autogroup enabled. After the freeze, audio and video were unpair for a second and then it turned back to normality. So It's more related to heavy workload, as the title of this forum suggests. My CPU usage was about 54 to 59% during gameplay and GPU at 99%, which is expected because of the graphics card rendering the shaders and everything else. I was using RDB-r2 I guess, because it's included in the linux-tkg kernel provided by @TkGlitch. I put the links below:

Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"):
https://github.com/Frogging-Family/linux-tkg/blob/master/customization.cfg
https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/patches/CacULE/RDB/rdb.patch#L56

So that is the cacule-rdb version I'm using. Should I test RDB-r3 or RDB-r2 is fine? I'm not sure exactly which version that commit belongs to, but I can test compiling it manually. I don't know if the AUR version linux-cacule-rdb presents these problems too, but I'll try first the one included on the linux-tkg kernel. I prefer it, because it has more patches which can increase performance and improve the cpu efficiency. However, I'm starting to think one of those patches could be causing the problem too.

On the other hand, those theories you mention can be possible. I haven't tested without the compositor. I don't know how I could deactivate it. I'll search for that and test without it too. Currently I'm using OpenGL 2.0. There's also OpenGL 3.1 available. I've read many people suggested Compton as a replacement. I could test it too. That's my progress till now. I'll keep testing and I'll notify if CONFIG_HZ_PERIODIC=y and the parameters kernel.sched_cache_factor = 0 and kernel.sched_starve_factor = 0 make any difference. Thanks for the reply!

from cacule-cpu-scheduler.

ltsdw avatar ltsdw commented on June 3, 2024

hi @hamadmarri

Just recompiled here, with CONFIG_HZ_PERIODIC=y and tested with:

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

but no difference, I'm still experiencing the hangs.

Also was mentioned the compositor here, I don't know if disabling the compositor worked for you @JohnyPeaN, but I tried disabling the compositor here and didn't make any difference (but I'm using picom, not plasma).

So far what worked was disabling RDB altogether or using noautogroup.

from cacule-cpu-scheduler.

JohnyPeaN avatar JohnyPeaN commented on June 3, 2024

@hamadmarri i'm not sure which process is responsible for compositing, but I think its kwin_x11. Anyway it has normal priority (0) as the rest of the desktop. I will try to change its priority if it has an effect.

Earlier RDB versions had these problems for me. Maybe they changed a little. Earlier RDB couldn't utilize all cores during compilation with #threads=#cores. This seems to be better now. In regards to these lags in game, it was similar.

I'm also testing if foreground processes are affected by heavy background processes, like mentioned compilation withnice -19. This doesn't work good for me on anything except BMQ (but bmq is changing priorities of processes on the fly, which is maybe a little bit cheating).

Also, I'm not recompiling to test autogroup on/off. Just to confirm does
echo 0 | sudo tee /proc/sys/kernel/sched_autogroup_enabled
switch it off?

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

hey @hamadmarri
I've tried compiling the kernel as you suggested (with CONFIG_HZ_PERIODIC=y instead of CONFIG_NO_HZ_IDLE=y) but, at builiding, some modules gave me errors and, because of that, I was afraid it weren't building adequately. Besides, it finished the compilation in less than 7 minutes. Usually all kernels I've compiled lasted between 15 to 20 minutes to compile. For that reason, it's suspicious. Maybe that error interrupted the whole process. I'm going to attach a fragment where the output errors appear when CONFIG_HZ_PERIODIC=y:

CC kernel/sched/clock.o
CC fs/crypto/keysetup_v1.o
CC fs/verity/signature.o
CC arch/x86/events/amd/uncore.o
CC fs/notify/notification.o
CC mm/maccess.o
AR fs/verity/built-in.a
CC mm/page-writeback.o
CC fs/crypto/policy.o
CC fs/notify/group.o
CC kernel/sched/cputime.o
CC kernel/sched/idle.o
CC arch/x86/events/amd/ibs.o
CC fs/crypto/bio.o
CC fs/notify/mark.o
CC arch/x86/events/amd/iommu.o
CC fs/crypto/inline_crypt.o
CC kernel/sched/fair.o
CC kernel/sched/rt.o
CC fs/notify/fdinfo.o
CC mm/readahead.o
CC [M] arch/x86/events/amd/power.o
AR fs/crypto/built-in.a
CC mm/swap.o
AR fs/notify/built-in.a
CC fs/nfs_common/nfs_ssc.o
kernel/sched/fair.c: In function ‘newidle_balance’:
kernel/sched/fair.c:11324:17: error: implicit declaration of function ‘nohz_newidle_balance’; did you mean ‘nohz_run_idle_balance’? [-Werror=implicit-function-declaration]
11324 | nohz_newidle_balance(this_rq);
| ^~~~~~~~~~~~~~~~~~~~
| nohz_run_idle_balance
CC [M] fs/nfs_common/nfsacl.o
AR arch/x86/events/amd/built-in.a
CC arch/x86/events/intel/core.o
CC arch/x86/events/intel/bts.o
CC arch/x86/events/zhaoxin/core.o
CC [M] fs/nfs_common/grace.o
LD [M] fs/nfs_common/nfs_acl.o
CC mm/truncate.o
AR fs/nfs_common/built-in.a
CC fs/iomap/trace.o
CC mm/vmscan.o
AR arch/x86/events/zhaoxin/built-in.a
CC mm/shmem.o
CC fs/iomap/apply.o
CC arch/x86/events/intel/ds.o
CC fs/iomap/buffered-io.o
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make: *** [Makefile:1862: kernel] Error 2
make: *** Waiting for unfinished jobs....
CC fs/iomap/direct-io.o
CC arch/x86/events/intel/knc.o
CC mm/util.o
CC mm/mmzone.o
CC arch/x86/events/intel/lbr.o

On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings!

NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

hey @hamadmarri
I've tried compiling the kernel as you suggested (with CONFIG_HZ_PERIODIC=y instead of CONFIG_NO_HZ_IDLE=y) but, at builiding, some modules gave me errors and, because of that, I was afraid it weren't building adequately. Besides, it finished the compilation in less than 7 minutes. Usually all kernels I've compiled lasted between 15 to 20 minutes to compile. For that reason, it's suspicious. Maybe that error interrupted the whole process. I'm going to attach a fragment where the output errors appear when CONFIG_HZ_PERIODIC=y:

CC kernel/sched/clock.o
CC fs/crypto/keysetup_v1.o
CC fs/verity/signature.o
CC arch/x86/events/amd/uncore.o
CC fs/notify/notification.o
CC mm/maccess.o
AR fs/verity/built-in.a
CC mm/page-writeback.o
CC fs/crypto/policy.o
CC fs/notify/group.o
CC kernel/sched/cputime.o
CC kernel/sched/idle.o
CC arch/x86/events/amd/ibs.o
CC fs/crypto/bio.o
CC fs/notify/mark.o
CC arch/x86/events/amd/iommu.o
CC fs/crypto/inline_crypt.o
CC kernel/sched/fair.o
CC kernel/sched/rt.o
CC fs/notify/fdinfo.o
CC mm/readahead.o
CC [M] arch/x86/events/amd/power.o
AR fs/crypto/built-in.a
CC mm/swap.o
AR fs/notify/built-in.a
CC fs/nfs_common/nfs_ssc.o
kernel/sched/fair.c: In function ‘newidle_balance’:
kernel/sched/fair.c:11324:17: error: implicit declaration of function ‘nohz_newidle_balance’; did you mean ‘nohz_run_idle_balance’? [-Werror=implicit-function-declaration]
11324 | nohz_newidle_balance(this_rq);
| ^~~~~~~~~~~~~~~~~~~~
| nohz_run_idle_balance
CC [M] fs/nfs_common/nfsacl.o
AR arch/x86/events/amd/built-in.a
CC arch/x86/events/intel/core.o
CC arch/x86/events/intel/bts.o
CC arch/x86/events/zhaoxin/core.o
CC [M] fs/nfs_common/grace.o
LD [M] fs/nfs_common/nfs_acl.o
CC mm/truncate.o
AR fs/nfs_common/built-in.a
CC fs/iomap/trace.o
CC mm/vmscan.o
AR arch/x86/events/zhaoxin/built-in.a
CC mm/shmem.o
CC fs/iomap/apply.o
CC arch/x86/events/intel/ds.o
CC fs/iomap/buffered-io.o
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:273: kernel/sched/fair.o] Error 1
make[1]: *** [scripts/Makefile.build:516: kernel/sched] Error 2
make: *** [Makefile:1862: kernel] Error 2
make: *** Waiting for unfinished jobs....
CC fs/iomap/direct-io.o
CC arch/x86/events/intel/knc.o
CC mm/util.o
CC mm/mmzone.o
CC arch/x86/events/intel/lbr.o

On the other hand, when I compile with just tickless idle (CONFIG_NO_HZ_IDLE=y), the kernel compiles without any errors. I've only applied cacule, uksm, futex2, security, and more uarches patches. It also gave me errors because I've also applied fsync, which is a previous version of the more advanced futex2 approach, but they're other functionalities. That explains why, with the PKGBUILD provided by TkGlitch, gave me those errors too when CONFIG_HZ_PERIODIC=y is applied. I don't know why. I think you should inspect those lines. I don't think the other patches are causing the problem, since there's no other scheduler I've integrated, and CacULE replaces CFS. That's all the information I can provide. Greetings!

NOTE: maybe the aim of your scheduler is only programmed to work exclusively with full tickless and just tickless idle kernels? Maybe I'm confused hehe

Hi @MoisesMH

Could you please try this fix #47 (comment)

I will update the fix in the github soon.

Thanks

EDIT:

bb77376

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

hey @hamadmarri
I've compiled a kernel with your latest commit and applied some additional patches, but the kernel was not appropriately working, because, when gaming, the framerates weren't balanced and the CPU usage was too high (I guess that happened because of esync; futex2 was not working, even if I patched it. So I proceeded to test the fix you suggested me to try in the last message you wrote applied to the TkGlitch's linux-tkg kernel, which has an earlier version of your scheduler I guess:

Linux-tkg kernel configuration (he also quoted the cacule link he's using, which refers to the "latest commit 6f2ede5 on May 20"):
https://github.com/Frogging-Family/linux-tkg/blob/master/customization.cfg
https://github.com/hamadmarri/cacule-cpu-scheduler/blob/master/patches/CacULE/RDB/rdb.patch#L56

I've got to say, in my system, even with CONFIG_HZ_PERIODIC=y, it's still having lag spikes, but they're less frequent than with CONFIG_NO_HZ_IDLE=y. I also used the variables you suggested in my /etc/sysctl.conf

kernel.sched_cache_factor = 0
kernel.sched_starve_factor = 0

and then executed "sudo sysctl --system" to apply the changes to kernel in the document but, still, those hangs are present. Disabling autogroup (kernel.sched_autogroup_enabled=0) helped a little to reduce the frequency of those lag spikes and its duration (lasted up to 2 secs each hang when it happens). Before, when CONFIG_NO_HZ_IDLE=y, they lasted 5 secs at average. In the menu, everything is smooth, even on gameplay, when the hangs are not present, the game runs butter-smoothly. Oh, another detail is that, while a hang is present, the Mangohud overlay reveals the CPU usage soared 10% more on average (from 55% to 65%, even it reached 74%). It's weird that It just happens on heavy workload. For other tasks, like running Audacious or Lutris, it's noticeably faster than without RDB. It surprises me the celerity at opening different applications. For those jobs it's butter-smooth, but just happens when at intensive gameplay. That's all I got. I really don't have an idea why it just happens at intensive workload. Maybe the code is not adapted to deal with it and just with ordinary tasks. I'll remain here for more news. Thanks for the effort. Keep it up!

EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks!

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

EDIT: could you apply the two last commits to the cacule-5.14.patch please? I want to apply it for testing with the futex2-dev kernel from Collabora. Thanks!

Hi @MoisesMH
Updated 5.14
https://github.com/hamadmarri/cacule-cpu-scheduler/tree/master/patches/CacULE/v5.14

Thank you

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

It could be kernel.sched_cacule_yield related to the issue.
Can you please try with

kernel.sched_cacule_yield = 0

Thank you

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

Hey @hamadmarri
I've used kernel.sched_cacule_yield=0 in my sysctl.conf, but it didn't help. Instead, it became unstable and I saw more lag spikes during co-op gameplay, but not at the game menu. So it performs noticeably better when kernel.sched_cacule_yield=1. Oh, I've compiled with CONFIG_NO_HZ_IDLE=y and a rdb interval of 15. Also, I was testing with UKSM. I noticed there are less frequent lag spikes with this configuration. I don't know which configuration helped to neutralize some of the hangs: the new RDB interval with CONFIG_NO_HZ_IDLE=y or UKSM could be helping too. I wonder what the results would be when using periodic ticks and kernel.sched_cacule_yield=0. Cheers!

from cacule-cpu-scheduler.

hamadmarri avatar hamadmarri commented on June 3, 2024

Hey @hamadmarri
That article seems interesting. Later I'm gonna read more. On the other hand, I've made a discovery too haha. I haven't ever though of tweaking those values you provided when you introduced them in a discussion panel. I'm referring to cache_factor and starving_factor (at #43). First, I've changed the cache_factor to 0 as you suggested and the starving factor to 15944. When playing SWBF 2, all of the hangs were apparently gone on a map. Then the match changed to another, which it was more resource hungry I guess because of the more complex graphics (terrain, leaves, ambient occlusion, etc). And then appeared two or three hangs. Then I've changed the cache_value to 8192 as you suggested to increase it. The performance was the same till one big hang appeared (5 secs I guess). Then it became back to normal. The game was fluid. What's important here is that playing with those settings affected the way RDB were performing. I'm not sure if I have to lower the starving_factor to avoid those peaks or not. You mentioned raising it will make the system run smoother, but I'm suspecting raising it too much will lead to starve more groups of applications. I'm not sure about that but I'll try with a better value to see if it's true. Also, I have a doubt: after finding a starve_factor that fits me, then why do you mention we have to raise the cache_factor the most we can? In a less intensive map, when cache_factor=0 and starve_factor=15944, the game was running with no peaks at all, also the framerates were stable. But in these new more intensive map, I've seen just one or two peaks, when cache_factor was 0 or 8192. Indeed with 8192, I've seen one or two peaks more than with 0. How can you explain it to understand the tweaking I've done? Greetings and care yourself!

EDIT: my current RDB value is 15 and running with CONFIG_NO_HZ_IDLE=y and CONFIG_NO_HZ=y. Also, I've disabled the compositor with Alt+Shift+F12 keys combination.

EDIT 2: These are the combinations which gave me almost no spikes on Star Wars Battlefront II:

  1. One or two lag spikes. The framerate was constant, even on an intensive graphical rendering map such as Ajan Kloss)
    sched_cache_factor = 7972
    sched_starve_factor = 19930
  2. No spikes at all (I don't remember if I tested the map Ajan Kloss with this configuration, but other maps were running flawlessly)
    sched_cache_factor = 3986
    sched_starve_factor = 17937

Incrementing cache factor were only worsen things and won't let enjoy a decent gameplay experience, since, in my opinion, too much can be the cause of those lag spikes. My numbers were inspired by my total installed memory, seen with "free" command on console. It returned a total of 15944 MB (16GB). Numbers I've got: 7972 1993 17937 19930 3986 5979 4983.

Hi @MoisesMH

The cache factor seems not working good with RDB design. I need to troubleshoot cache and starve factors too.

Thank you

from cacule-cpu-scheduler.

MoisesMH avatar MoisesMH commented on June 3, 2024

Yeah, it seems to be generating the issue. I've discovered another combination, which is close to the default I think:

kernel.sched_cache_factor = 10629
kernel.sched_starve_factor = 21258

I've experienced no spikes at all with this configuration, but at the beginning of gameplay a peak happened, but I haven't noticed any freezes or big hangs. I think I'll remain with this configuration. Both sums less than the sched_interactivity_factor, also one is the double of the other (1/3 * 31888, 2/3 * 31888). Hope you're doing great with your investigation and development!

from cacule-cpu-scheduler.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.