Giter Site home page Giter Site logo

Comments (5)

valebes avatar valebes commented on May 12, 2024

Nested parallelism can indeed introduce complexity and unexpected behavior, depending on the parallel library used. I recently worked on a parallel application involving matrices, similar to your project. In our case, we ran into problems where the total number of threads exceeded our expectations when using a parallel_for loop involving matrix multiplication with OpenBlas (with OpenMP enabled).

In our scenario, we didn't use OpenMP to parallelize our loop; instead, we used another library called FastFlow. To address the problems we encountered, such as the higher than expected thread count, we took the step of disabling multithreading in OpenBlas. This helped us avoid potential conflicts.

I also recommend that you take a look at the following resource: OpenBlas FAQ on Using OpenBlas in Multithreaded Applications. It provides additional details on how to effectively use OpenBlas with multithreading enabled, especially within applications that are already multithreaded.

I also took a look at the library you suggested. PPL has a function similar to push_loop, it is called par_for(&mut self, range: Range, chunk_size: usize, mut f: F) (by using cargo doc it is possible to generate the relative documentation, but in the next days I'll try to put the library on crates in order to have the documentation on doc.rs). Although, unlike languages such as C++, it is not trivial to implement a parallel_for similar to the push_loop in Rust as a safe function. That said, in a lot of scenarios like the one you face, a similar function is really convenient and provide a more straightforward way to parallelize operations on matrices.

I hope this information is helpful in solving the problem you're experiencing.

I'm also curious to know if you are facing similar issues with Rayon/Crossbeam or if this problem is specific to PPL. Also, can you confirm which version of multithreaded OpenBlas you're using?

from ppl.

mert-kurttutan avatar mert-kurttutan commented on May 12, 2024

I checked it again this morning, actually, the same problem occurs in C++ library as well.
The only time that this does not occur is when the multithreading is done with OpenMP.

Another problem I have is about the performance. When I use the function that calls to OpenBLAS (compiled with OpenMP) sequentially (i.e. within a usual for loop), it takes around 50 second to complete my script. But, when I use it within a scope method from both PPL and Rayon, it slows down significantly. And it takes around 70 sec to complete.

However, in case of c++ multithreading lib, improves by 5-7 seconds.

from ppl.

mert-kurttutan avatar mert-kurttutan commented on May 12, 2024

Btw, OpenBLAS version is 0.3.21, installed from the source.

with instruction:

make BUILD_SINGLE=1 NO_LAPACK=1 ONLY_CBLAS=1 USE_OPENMP=1 NUM_THREADS=32 NUM_PARALLEL=8

from ppl.

valebes avatar valebes commented on May 12, 2024

I checked it again this morning, actually, the same problem occurs in C++ library as well. The only time that this does not occur is when the multithreading is done with OpenMP.

Another problem I have is about the performance. When I use the function that calls to OpenBLAS (compiled with OpenMP) sequentially (i.e. within a usual for loop), it takes around 50 second to complete my script. But, when I use it within a scope method from both PPL and Rayon, it slows down significantly. And it takes around 70 sec to complete.

However, in case of c++ multithreading lib, improves by 5-7 seconds.

The fact that the issue doesn't occur when using OpenMP is likely due to the nested parallelism behavior of the library. By default, OpenMP disables nested parallelism, meaning that even if parallelism is enabled on OpenBLAS, it won't be utilized in a nested parallel context. More details about this behavior can be found in the Oracle documentation: Nested Parallelism in OpenMP.

Regarding the performance difference you're observing, it's important to consider the overhead introduced by parallelism frameworks like PPL and Rayon. The parallelization process itself requires additional resources and synchronization mechanisms that can impact the overall execution time. This overhead can become more significant for smaller tasks or when the parallel regions are not large enough to offset the cost.

Depending on the size of the matrices you're working with, an alternative approach could be to parallelize only the gemm function itself and not the outer loop.

from ppl.

valebes avatar valebes commented on May 12, 2024

@mert-kurttutan
If there are no further questions, I would proceed to close this issue.

from ppl.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.