Giter Site home page Giter Site logo

Comments (7)

vvaltchev avatar vvaltchev commented on August 24, 2024

Hi, nice to meet you!
Thank you for your kind words and for appreciating my work.
I'm glad to hear you're using Tilck to test a new emulator, that's awesome.

Now, back to technical stuff. I've took a look at the conversation on vogons.org (great place, btw).
I'd like to start by explaining how i8042_io_wait() works: for years, it was just an empty inline function because hypervisors don't need any I/O delays: they directly intercept the IN/OUT instructions and emulate the behavior expected from the device, blocking the guest system. On real hardware machines, because of the way 8042 is implemented/emulated today, io_wait() is almost never needed too, but if you wanna be really sure your code will work reliably, it's mandatory to follow the specification and do proper IO waits of 1 microsecond. Now, because wasting CPU cycles is pointless while running on a hypervisor, I added that if(in_hypervisor()), as an optimization. It looks like that on UniPCemu the hypervisor bit in CPUID[1].ecx is simply not set.

Still, even if that bit was set, having __bogo_loops = 0 is bad, because loops_per_us will be 0 as well and delay_us() won't work properly in the general case. Now, straight to the point: how that could happen? Well, that's pretty weird and it should never happen. How the whole thing works:

  1. During boot, in init_timer() a special temporary IRQ0 handler is registered: measure_bogomips_irq_handler().
  2. After that, a kernel worker thread runs do_bogomips_loop()
  3. In that function ctx->started is set.
  4. In the meanwhile all calls to measure_bogomips_irq_handler() do nothing because of the !ctx->started
  5. The first that handler founds ctx->started, resets __bogo_loops and set pass_start = true.
  6. From now, on each timer tick (IRQ 0), ctx->ticks++ will be incremented. At the same time, the CPU spends all the rest of its time iterating in asm_do_bogomips_loop, which increments __bogo_loops once every BOGOMIPS_CONST (10,000) iterations.
  7. After a fixed amount of ticks, determined by MEASURE_BOGOMIPS_TICKS, we decide that's enough so the extra IRQ handler is uninstalled, we calculate loops_per_us and set __bogo_loops = -1: this makes the asm_do_bogomips_loop assembly function to end.
  8. delay_us() as you know uses loops_per_us under the hood

Now, given that, in order to __bogo_loops to remain 0 one of two things have to happen:

  1. We should never get an IRQ 0 again. If that's the case, we'll be stuck in do_bogomips_loop() and never see the Tilck bogoMips: message on the screen. In my understanding that's not the case.
  2. The emulator is so slow that in 1/10th of the second (MEASURE_BOGOMIPS_TICKS) we couldn't finish to perform 10,000 BOGOMIPS_CONST iterations. So, everything seems to go well but we end up with loops_per_us = 0.

In my understanding of the comments on vogons.org, you do see the Tilck bogoMips: message on the screen, so I believe we're in the case 2: the emulator is too slow. In other words, the emulator performs less than 10k iterations in 1/10th of the second, or less than 100k iterations in 1 second. So, the problem could be fixed by optimizing the emulator: generally pure software emulators (without dynamic translation) have no more than 1,000x slowdown. Starting from a realistic value of 2,000 - 3,000 bogoMips on QEMU + KVM on modern machines, a pure software emulator should show a value >= 2.

Now, given your EDIT comment, it looks like you've been able to reach __bogo_loops = 8. That means that the just stated theory of case 2 (less than 100k / second) is not quite correct. What happens is that the emulator performs less than 1 million iterations / second (which is better, of course).

loops_per_tick = 8 * 10,000 / 10 = 8,000 (with TIMER_HZ=100) 
loops_per_us = 8,000 / (1,000,000 / 100) = 0.8 = 0 (truncated)

And loops_per_us == 0 because of truncation. After that, asm_nop_loop() looks like hanging because it ends after reaching 0, but the first instruction is a SUB, so we get to -1 (= 2^32-1) and that's a lot.

So, it looks like the slowdown of the emulator (compared to physical hardware) is more than 3,000x. That's a lot, but it's not surprising. Are you compiling with it -O3? (Btw, what bogoMIPS value Tilck shows when it's run on QEMU with KVM ?)

Anyway, a fix could be to set:

loops_per_tick = 
   MAX(1, __bogo_loops) * BOGOMIPS_CONST/MEASURE_BOGOMIPS_TICKS;

In the measure_bogomips_irq_handler() function. If that turns out to be a good solution for your case, would like to make a pull-request with the fix?

from tilck.

vvaltchev avatar vvaltchev commented on August 24, 2024

An alternative solution might be to:

  1. Rename loops_per_us to loops_per_ns
  2. Initialize loops_per_us = 5 (instead of 5,000)
  3. In measure_bogomips_irq_handler(): loops_per_ns = loops_per_tick / (1000 / TIMER_HZ);
  4. In do_bogomips_loop(): printk("Tilck bogoMips: %u.%03u\n", loops_per_ns/1000, loops_per_ns % 1000);
  5. In delay_us() do:
   if (loops_per_ns >= 1000)
      asm_nop_loop(us * loops_per_ns / 1000);
  1. Optionally, introduce a delay_ns() function

With such a solution, you should see something like:

Tilck bogoMips: 0.800

When running Tilck on the emulator.

from tilck.

misutoneko avatar misutoneko commented on August 24, 2024

OK thx, I'll check those.
Actually the one in the first post won't work unless you use 10 instead of 1 (otherwise it'll still get truncated to 0).
It'd probably be better to use MAX() on loops_per_us, to avoid introducing an extra constant which may cause breakage later on.
(Unless there's some other reason to do it this way - Can't see any, but...)
As for PR - In fact I wasn't planning on doing any PRs when I first joined GitHub, but I may get around to it, eventually...

The emulation speed may be partly because of my lack of understanding with UniPCemu configuration.
The hardware on the host isn't the fastest either lol (AMD FX-series)...
qemu-system-x86_64 (4.2.1) with kvm gives me bogoMips of 1719 and yes, UniPCemu does use -O3 by default.
But to really gain speed the interpretion code would need an overhaul I guess (plus use some dirty tricks to cut corners whenever possible :)

from tilck.

vvaltchev avatar vvaltchev commented on August 24, 2024

Ah yeah, you're right. My first suggestion won't work with 1, and 10 is bad because depends on TIMER_HZ and related.
The problem is that we need the following to hold: loops_per_tick >= (1000000 / TIMER_HZ), so the correct fix should be:

loops_per_tick =
   MAX(1000000 / TIMER_HZ, __bogo_loops * BOGOMIPS_CONST/MEASURE_BOGOMIPS_TICKS);

But that's pretty ugly, I know. The problem with using MAX on just loops_per_us is that it will create an inconsistency with loops_per_tick. Another super-simple solution is to just fix delay_us() this way:

void delay_us(u32 us)
{
   FASTCALL void asm_nop_loop(u32 iters);
   u32 loops = us * loops_per_us;

   if (loops > 0)
      asm_nop_loop(loops);
}

Anyway, I won't insist at all for you to make a pull-request, as it's super easy to fix it myself. Just, it seemed to me that you spent a considerable amount of time playing with this project & debugging it, so I thought you would be happy to put your name on the commit with the fix :-)

from tilck.

misutoneko avatar misutoneko commented on August 24, 2024

The one in your second post did the trick. Here it is in patch form:
Tilck_timer_ns_patch.diff.gz

That's a good point about attribution, and for bigger patches it's a good idea for maintenance's sake. Especially with larger projects with many contributors...
My logic is that the authors will always be more fluent with their own code than me, and the patches will mostly be quite small since I mostly only do testing/bug-hunting.

from tilck.

vvaltchev avatar vvaltchev commented on August 24, 2024

Thanks for the patch!
Unfortunately, at the end I had to modify it heavily mostly because some of my suggestions were not completely correct. For example, the variable loops_per_ns is calculated correctly but represents the loops per millisecond, not nanosecond. Also, after some serious thinking I realized that it's not enough to just check for loops > 0 in delay_us(). That would mean that on slower machines (emulators), delay_us() won't do any work, no matter the value of us. In the patch I've prepared, I believe I've handled all the cases :-)

from tilck.

vvaltchev avatar vvaltchev commented on August 24, 2024

I've merged the patch. Let me know if something still doesn't work.
Thanks for reporting this issue, man! 👍

from tilck.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.