Hey all! For those following along with the series on YouTube, I hope you've been enjo

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

I believe that is what <a class="user-mention notranslate" data-hovercard-type="user"

Multithreading and Optimization Discussion,about thecherno/raytracing

Comments (50)

cvb941 commented on June 22, 2024 12

Speedup from 65ms to 28ms isn't very much on an 8-core machine. The profiler shows a lot of time being spent in the Random::Vec3 function, where the random number generator probably synchronizes the threads, causing a bottleneck there.

Just using a simple rand() in the Rand::Float() function reduces the frame time from 28ms to 10ms. Theoretically, this can be reduced further with a faster random (returning a constant achieves 7ms).

static float Float()
{
    return static_cast <float> (rand()) / static_cast <float> (RAND_MAX);
    // return (float)s_Distribution(s_RandomEngine) / (float)std::numeric_limits<uint32_t>::max();
}

from raytracing.

cvb941 commented on June 22, 2024 8

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

After enabling OpenMP in Project properties > C/C++ > Language, I have added the #pragma omp parallel for clause to the outer loop:

#pragma omp parallel for
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

This gets me from 65ms single threaded to 35ms. However, the for_each implementation runs even lower at 28ms.

The difference comes from the default static schedule which splits the for loop into equal chunks for each CPU thread.
This makes the threads that render the sky finish faster than the ones rendering the ball.
Adding schedule(dynamic, 1) should distribute the work row-by-row to the threads, making it behave similarly to for_each and
speeding up the rendering down to 28ms.

#pragma omp parallel for schedule(dynamic, 1)
for (int y = 0; y < m_FinalImage->GetHeight(); y++)
{
    for (uint32_t x = 0; x < m_FinalImage->GetWidth(); x++)
    {
        glm::vec4 color = PerPixel(x, y);
        m_AccumulationData[x + y * m_FinalImage->GetWidth()] += color;
        ...
    }
}

Code is in my fork at https://github.com/cvb941/RayTracing

from raytracing.

leddoo commented on June 22, 2024 7

i mean, do i need to say anything? :D

was pretty confused about the 8x increase in physical core count only getting a 2x speedup. (i'm getting around 3x on my i7-8750h -- 180ms -> 65ms -- which both seems high, but that's beside the point)

for this kind of simple parallelization of a compute heavy task, i'd expect a speedup of 90% * #physical-cores. (hyperthreading doesn't tend to help much, when you're compute bound. and there are always some losses, hence the 90%)

but yeah, get some thread local RNGs going :D

from raytracing.

athieman commented on June 22, 2024 5

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

Update: @leddoo already covered it.

from raytracing.

xLoge commented on June 22, 2024 2

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each.
https://github.com/xLoge/RayTracing

from raytracing.

mmerabet42 commented on June 22, 2024 2

I stumbled into your video, and i think there is a better way to use iterators for std::for_each.
Basically you could make your own iterator with all the appropriate operators (following the LegacyForwardIt requirements) which simply increments an integer. This way you're not intializing a whole vector with N elements going from X to Y. But instead you'll have two iterators, one representing the moving index and a second one representing the end iterator, and thanks to the != operator of the iterator, std::for_each knows when to stop.

template <typename T>
class ForEachIterator
{
public:
  ForEachIterator(const T &p_value) : _value(p_value) {}

  ForEachIterator<T> &operator++()
  {
    ++this->_value;
    return *this;
  }

  T &operator*()
  {
    return this->_value;
  }

  bool operator!=(const ForEachIterator<T> &p_it)
  {
    return this->_value != p_it._value;
  }

private:
  T _value;
};

And you can you use it this way:

std::for_each(ForEachIterator<uint32_t>(0), ForEachIterator<uint32_t>(width), [](const uint32_t &i) {
  ...
});

from raytracing.

soerenfi commented on June 22, 2024 1

I believe that is what @TheCherno is aiming to do, and if so, I am looking forward to it very much. Since Vulkan takes care of building BVHs and (HW-)accelerating ray-triangle intersections, there is not much point in spending time to implement them on the CPU (still worth understanding them though).

Meanwhile you can check out the NVIDIA guide to VK raytracing (uses NVIDIA HW specific support libraries and this amazing implementation of Peter Shirley's Raytracing in one Weekend using Vulkan Raytracing.

from raytracing.

Paulo-D2000 commented on June 22, 2024 1

Has anyone tried Intel OneAPI TBB for the threads parallelization?

from raytracing.

BenBE commented on June 22, 2024 1

I'm not in the discord channel, thus cannot say too much about the discussion there. So, if there are any important new findings I'd appreciate hearing of them here.

from raytracing.

mgradwohl commented on June 22, 2024 1

For release/distro I am using the following preprocessor defines:
GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE

Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.

For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).

I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2

You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

from raytracing.

mgradwohl commented on June 22, 2024 1

Just an update. I tried two other random number generators Xoshiro256++ and std::ranlux48 and Xoshiro wins. I am at 10 bounces, 60fps, and my CPU is not pegged at all. All three are in my repo and easy to try out (look for // PERF TEST comments). The PRNG that is in the original code really hits the CPU.

from raytracing.

cvb941 commented on June 22, 2024

@TheCherno Hi, I don't know whether you already know, but you should definitely checkout and play around with OpenMP (it would be a great topic for your future video too).

Although my personal experience with it is very limited, it is widely used in the industry and it makes it easy to parallelize (and also vectorize) code using just one or more #pragmas above for loops and other parts of code.

It's really surprising how magical and simple this API is, considering how verbose usually things in C++ are compared to other languages.

from raytracing.

scorpion-26 commented on June 22, 2024

@TheCherno I checked the newest commit out and tried to run it under x64 Linux and using clang as the compiler.
Aside from some difficulties compiling under Linux, std::for_each() by default only runs on 1 thread. Apparently, for multithreading to work with libstdc++, one needs to install Intel TBB and link to it with -ltbb (Docs). With tbb installed and linked, it now runs fairly slowly(Ryzen 7 3800x, default camera angle, 1600x900-> 40-50ms). CPU usage hovers around 90%, so not perfect.
I'll try to implement a manual Thread-Pool style approach to see if that is any better.

from raytracing.

HildingLinden commented on June 22, 2024

For fast non-cryptographically-secure PRNGs I usually go with the xoshiro family of generators since they are robust enough for my applications and very fast. I've often implemented one myself but a wrapper like Xoshiro-cpp seem quite practical.

from raytracing.

LilyIsTrans commented on June 22, 2024

Adding the thread_local keyword to s_RandomEngine and s_Distribution in Walnut::Random in both the header and cpp files significantly improved the performance on my machine. The last render time went from 25-26 ms to 8-9 ms on my laptop with an AMD 5800H. I tried to push a new branch so I could make a pull request in Walnut, but I didn't have permission.

I tried this and it worked quite well. My render times were consistently around 5-6ms, and CPU usage was well below 100%, closer to 10%. It seems that once RNG is thread local, the bottleneck is no longer the CPU, though I don't know how to verify that or figure out what the new bottleneck is.

from raytracing.

LeoMash commented on June 22, 2024

@Rilazy The new "bottleneck" is not a bottleneck - it's a vsync enabled by default in the Imgui renderer. So after your CPU finishes calculation for new frame image, it waits for gpu vsync event. To test that you can increase viewport size (on my CPU current scene renders slower than 60fps in 4k viewport), or disable vsync in Imgui by uncommenting define IMGUI_UNLIMITED_FRAME_RATE in Application.cpp.

from raytracing.

LilyIsTrans commented on June 22, 2024

Interesting, thanks!

from raytracing.

soerenfi commented on June 22, 2024

So what are you guys using for profiling? I set up Tracy but It's not really usable due to the large amount of threads being created and destroyed with the current architecture.
I am going to look into creating a threadpool with number of threads according to available cores. Does that make sense?

from raytracing.

LilyIsTrans commented on June 22, 2024

I've just been using Visual Studio's built in profiler when I need proper profiling, but mostly I'm looking at the frame times and seeing what they hover around

from raytracing.

soerenfi commented on June 22, 2024

Alright, it's done here. Added a thread pool and queue for the rendering jobs. Every Line of the image is submitted to the queue individually. Here is how it looks in Tracy:

performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

from raytracing.

LilyIsTrans commented on June 22, 2024

Have you implemented the optimization making random number generation thread local? If not, it's possible your thread pool just isn't handling synchronization as well as the std::foreach method

from raytracing.

LilyIsTrans commented on June 22, 2024

@soerenfi Disregard my previous comment for now; it looks like it's the same thing I ran into with vsync, since your framerate looks to be locked at pretty much exactly 60fps. What's happening is that ImGui is set up by default to wait for the next time your monitor refreshes the image to start drawing the new one. LeoMash explained it here and ways to get around it when I ran into the same thing.

from raytracing.

soerenfi commented on June 22, 2024

@Rilazy yeah, I saw that. For now I just compared CPU load between the approaches and varied the amount of threads in the pool.

Profiler shows that the workers are idling some of the time so that's probably the difference. Did not expect to write a better scheduler than the OS's in one evening.

No point in optimising this further since the ray tracing algorithm itself is so inefficient right now. We could do the PerPixel() intersection calculations on the GPU using CUDA though, just as an exercise..

from raytracing.

LilyIsTrans commented on June 22, 2024

Another option for moving work to the GPU would be the Vulkan ray tracing pipeline itself. I'm not super familiar with GPU programming, but reading up on the Vulkan API it looks like we could be allowed to still use implicit functions for our spheres by passing them as AABBs that enclose them, then doing the more expensive intersection testing in the intersection shader, and from there I think all of our existing code is valid to port pretty much directly to shader code.

I'd have tested it myself already if it weren't for the fact that I'm not familiar enough with graphics programming to have the first clue on how to actually set up Vulkan to draw to the framebuffer of our viewport, I've only ever done the "Draw a triangle" Vulkan tutorial. I've been reading the docs, but there are a lot of docs.

Of course, that all does depend on a graphics driver that supports the Vulkan Ray Tracing Extensions.

from raytracing.

LilyIsTrans commented on June 22, 2024

Neat! I'll have to check that out.

from raytracing.

soerenfi commented on June 22, 2024

I'll also add this more simplified path tracing example that I'm about to dive into.

from raytracing.

WhoseTheNerd commented on June 22, 2024

With this code i get arround 6-7 ms on my i5 12400F, i also use the 'thread_local' and 'xoshiro128**' for the random class

std::vector<std::future<void>> threads;
threads.reserve(m_ImageVerticalIter.size());

	for (const uint32_t& y : m_ImageVerticalIter)
	{
		auto async_thread = [this, y]() -> void
		{
			for (const uint32_t& x : m_ImageHorizontalIter)
			{
				glm::vec4 pixel = PerPixel(x, y);
				m_AccumulationData[x + y * m_FinalImage->GetWidth()] += pixel;

				glm::vec4 accumulatedColor = (m_AccumulationData[x + y * m_FinalImage->GetWidth()]) / (float)m_FrameIndex;

				accumulatedColor = glm::clamp(accumulatedColor, glm::vec4(0.0f), glm::vec4(1.0f));

				m_ImageData[x + y * m_FinalImage->GetWidth()] = Utils::ConvertToRGBA(accumulatedColor);
			}
		};

		threads.emplace_back(std::async(std::launch::async, async_thread));
	};

#define MT 0
#if MT
	std::for_each(std::execution::par, threads.begin(), threads.end(), [](const std::future<void>& thread) {
		thread.wait();
	});
#else
	for (const std::future<void>& thread : threads)
		thread.wait();
#endif

The xoshiro128** is arround 1ms faster than std::mt19937 and running the threads with std::async is around 0.5ms faster than std::for_each. https://github.com/xLoge/RayTracing

SHISHUA is faster than xoshiro256+x8.

from raytracing.

denizdiktas commented on June 22, 2024

Initializing a vector of random unit normals gave me another 30-40% performance improvement on top of making the random number generator thread_local , as @TheCherno described in his corresponding YouTube episode.

from raytracing.

denizdiktas commented on June 22, 2024

ok here is what I have done later on: I have added a timer that just measures the time it takes for the code that renders the image (basically a timer surrounding just the code with #define MT) by taking the average render time (accumulate the timer and divide by the number of total frames so far). By caching the normals and measuring the performance this way my render time dropped from about 30ms to 24ms on average. there are about 1 million cached random directions, so the image quality more or less remains the same (at least visually).

from raytracing.

denizdiktas commented on June 22, 2024

Regarding different threading and parallelization methods: I think we should first measure the performance of the scheduled rendering code inside the thread and the overall threading method together with the rendering code. The difference will then tell us how much of a room there is for improvement. For this I have done the following: leaving the code as it is, I have measured the time it takes for rendering only (basically, the code inside the lambda function) and associated this measured time with its corresponding thread: now to do this a natural way would be to use an associative container (std::map or std::unordered_map) but I didn't want to add an extra overhead by including a lock for this map as well, instead I used a preallocated array of float values to keep track of the total rendering time inside each thread. This eliminated the overhead of locking & unlocking a map, so it has minimal impact on the measurements.

Now why am I talking about threads here? @TheCherno uses for_each(std::execution::par,..) anyway, right? The std implementation must be using some kind of a thread-pool in the background, I have tried this in a separate project: I have submitted a bunch of trivial jobs and looked at unique thread-ids. It turns out that the number of acutal worker-threads is less than the number of submitted jobs, and if some jobs take more time than others, the number of threads tends to increase.
So applying this method to the ray-tracing code, I have found out that there are about 8-9 threads (I have 8 cores) and there is very little difference between the time it takes just for rendering and the overall time for rendering + scheduling. Of course the average time in each thread is going to vary, but in my case the variation was like 1ms in 22ms. Also the difference between the thread rendering times and the overall rendering time was like about 1-2 ms (overall was about 23-24ms). So I am not sure if this is a significant overhead when using for_each(std::execution::par).

To be honest, first I implemented my own thread-pool and quickly realized that there are lots of improvements that can be done on it. Then I wondered if it was worth it for this particular rendering problem (it is still always a good idea to think about the potential improvements regardless of the use case of course, I am aware of this. I am just talking in the context of this problem). That's when and how I thought of a method for finding out how much room there is for improvement. I would love to hear your thoughts about this and let me know if something is not clear.

from raytracing.

denizdiktas commented on June 22, 2024

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

from raytracing.

LilyIsTrans commented on June 22, 2024

@mmerabet42 Hi Mohammed, I saw your post on this page. It works for for_each() with 2 arguments but when I tried to use it with std::execution::par, it failed to compile. Would you happen to know how to solve it?

I'm not sure, but I think it probably has to do with it being a Forward iterator as opposed to a Random Access iterator. Because the iterator doesn't provide us with Random Access, it must be iterated sequentially, since knowing the index of each pixel/column technically requires knowing the index of the previous, as computed by a mutable, non shareable object

from raytracing.

denizdiktas commented on June 22, 2024

Hi @Rilazy,
Ok thanks, I'll look into it. I was trying to do a tiled-rendering version and didn't want to change the usage of for_each(execution::par). So I had to implement my own tile-iterators, otherwise I have to define a long list of tiles and then for_each() needs to touch each memory location, which seems inefficient. It is also a good exercise to implement such an iterator compatible with for_each, just for its own sake 🙂

from raytracing.

denizdiktas commented on June 22, 2024

another idea: why not sample the ray just over the hemi-sphere in the tangent-space? This has 2 benefits: first we will use only two random samples (in spherical coordinates just two angles are sufficient to determine a unit direction vector). Also since we are sampling over the hemi-sphere looking outwards, all of the rays will go out into the scene, while in @TheCherno implementation all directions are sampled over the whole unit sphere, so on average half of the bouncing rays will hit the surface from which they emanate and won't contribute to the final result. To measure the effectiveness of this, we should define a performance metric that measures how much time it takes for the final image to settle to its 'equilibrium' state. I am saying equilibrium in quotation marks, because there is no true equilibrium in the sense we reach in finite amount of time, but a visually pleasing one should be sufficient. What do you think?

from raytracing.

BenBE commented on June 22, 2024

You could define this equilibrium state to be when the overall (absolute) change of pixel values ((old-new)^2)is pushed below a certain threshold. The metric would then be two-fold:

How many iterations does it take for the threshold to be reached
How much (absolute) time does it take to render that many frames

The first metric favors implementations that converge to their eigenstate quickly, while the second metric favors implementations that avoid overhead in the implementation and thus offer high render performance. Incidently, implementations good at 1 have an advantage in 2 also as fewer iterations have to be computed.

A third approach would divide the time taken to reach equilibrium by the number of frames needed, resulting in a time-per-frame. While high single-frame performance might be of interest,this is by far the least interesting aspect, as bad-looking results at high performance is like watching static being generated at random in response to a task to generate the Mona Lisa.

That said, maybe a quick note re optimizations: One should also take into account the specifics of the hardware things run on. When working with large blocks of memory it might be good practice, to provide the CPU with some information for prefetching, as well as e.g. timing memory accesses in a way to avoid DRAM refreshes or working from cache while memory write-back can present a bottleneck. Have a look at CPU and GPU hardware effects for other types of things that might be relevant for structuring your workloads. HTH.

from raytracing.

denizdiktas commented on June 22, 2024

Hi @BenBE, thank you. I had similar things in mind. I will post here once have initial implemetations. Also it looks like most of the discussions are in the Discord channel, right?

from raytracing.

denizdiktas commented on June 22, 2024

Btw, you can view my fork at the link below (let me know if there are any problems):
https://github.com/denizdiktas/RayTracing/

from raytracing.

denizdiktas commented on June 22, 2024

Hi everyone. I have modified the code to schedule the tasks at different granularity levels. At the beginning of the Rendere.cpp file there are some preprocessor switches that begin with MT_TASK_GRANULARITY_
Comment in the one you want to activate and comment out the others. On my system
'MT_TASK_GRANULARITY_ ROW' seems to be the best performer.

Regarding the timers: The overall average rendering time and the time it takes to execute the code inside each worker-thread are measured individually and printed to the console at 1sec intervals. This is to give an idea on the following

How much variation there is among the code run in worker-threads
The overhead for scheduling the tasks

I would be glad if you could try it and share your results and thoughts.
You can checkout the code at the following link:
https://github.com/denizdiktas/RayTracing/

from raytracing.

denizdiktas commented on June 22, 2024

Hi again, I made further improvement by eliminating whole BUNDLES (BEAMS) of rays by testing them against each sphere in the scene. There is a new preprocessor called 'USE_TILE_BEAM_INTERSECTION_TEST'. Comment it in if you want to use it. However, it can only be used with tiled-rendering for now. Note that I have delibarately used red color to show the parts on the screen where the intersection test fails and just the sky-color should be printed, I am aware of it, so don't kill me for it please 😅

from raytracing.

mgradwohl commented on June 22, 2024

Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster?

from raytracing.

NickPhilomath commented on June 22, 2024

i learned a lot with these series. i wanna thank TheCherno. you'r the best :)
also i learned to render triangles with this program
look how beautiful they are:

from raytracing.

denizdiktas commented on June 22, 2024

Has anyone taken a look at https://learn.microsoft.com/en-us/windows/win32/dxmath/directxmath-portal to see if that's faster?

You mean in Cherno's code?

from raytracing.

mgradwohl commented on June 22, 2024

Alright, it's done [here]

> performance is worse than the std::for_each method though. I was expecting some improvement by removing the overhead of thread creation and destruction.

I tried the same thing expecting a speedup for the same reason and didn't get it either.

from raytracing.

denizdiktas commented on June 22, 2024

@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements.

from raytracing.

mgradwohl commented on June 22, 2024

@mgradwohl I added code to measure the time for each separate thread as well as the total time to find out how much overhead there is (although I have to admit that it there are some issues with it). I still implemented a thread pool (as an exercise in itself) but again there was not much improvement. It looks like that after a certain point further performance improvements come mostly due to algorithmic improvements.

Yes i found that caching values, adding const, moving things out of the loops when they didn't need to be, changing compiler settings, etc helped. I think one big one was a glm force intrinsics

from raytracing.

denizdiktas commented on June 22, 2024

@mgradwohl cool. I think I didn't try compiler settings and glm-intrinsics. which settings did you use? could you share those, if it is ok for you?

from raytracing.

LilyIsTrans commented on June 22, 2024

For release/distro I am using the following preprocessor defines: GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE

Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.

For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).

I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2

You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster?

from raytracing.

mgradwohl commented on June 22, 2024

For release/distro I am using the following preprocessor defines: GLM_FORCE_INTRINSICS; GLM_FORCE_LEFT_HANDED;GLM_FORCE_DEPTH_ZERO_TO_ONE
Remember, I have ported this to use WinUI/WinAppSDK and the image is on a Win2D canvas, so that's what the LEFT_HANDED, and FORCE_DEPTH_ZERO_TO_ONE are for.
For C++ Code Generation I have Advanced Vector Extensions 2 (X86/X64) (/arch:AVX2) for my machines. I haven't tried pushing it higher (not sure any benefit).
I also enabled full link time code generation, string pooling, /O2, /Ot, /GT, /GL and I'm going to try /Ob2
You should be able to see it all here https://github.com/mgradwohl/Butternut/blob/main/Butternut.vcxproj

Given the massively parallel nature of the task, I would expect that pushing to the widest possible SIMD constructs available would always have some benefit. Am I wrong to think that AVX512 would likely go even faster?

I'll test again, my PC at work is only AVX2.

from raytracing.

mgradwohl commented on June 22, 2024

I tried AVX2 vs. AVX512, no difference. However this may mean:

The compiler does not do a good job of taking advantage of AVX512
GLM does not do a good job of taking advantage of AVX512
The compiler and GLM do a good job, but the code is not suited.

from raytracing.

duckdoom4 commented on June 22, 2024

Has anyone tried Intel OneAPI TBB for the threads parallelization?

I'm also curious about this

from raytracing.

Multithreading and Optimization Discussion about raytracing HOT 50 OPEN

Comments (50)

Related Issues (6)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent