Giter Site home page Giter Site logo

Comments (31)

radfordneal avatar radfordneal commented on July 23, 2024

Does it say it's using helper threads? The startup message should say "Using 2 helper threads".

By the way, I've just released a new version, and commented on your earlier issue post.

Thanks for trying this out!

 Radford

from pqr.

armgong avatar armgong commented on July 23, 2024

merege 2013-06-28 and it don't say "Using 2 helper threads" message , just start up like this
win
but I did already add some printf into helpers.c in the helpers_startup() function and printf show helpers_num=2,but can't find why pqR only use master thread
and options(helpers_disable=FALSE)

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Actually, if there were no helper threads, the trace output would be a bit different. That's also the case if you had accidently done options(helpers_disable=TRUE). So my next theory is that it's because of the large size of the matrices. When a full garbage collection is done, pqR currently waits for all tasks to complete (so any inputs they're using can be freed), which could maybe result in the master getting to the tasks before the helper threads. (I plan on doing something cleverer soon, like only waiting for tasks that actually are using inputs that are not otherwise referenced.) You could try it on smaller matrices.

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

There's something wrong with that startup message, maybe related to the locale, since the missing text seems to be the part that would be translated into something other than English in non-English locales.

The number of helper threads (if not zero) should be output right after the "Platform:" line.

from pqr.

armgong avatar armgong commented on July 23, 2024

try 100*100 matrix but nothing happend,and Rgui show this,still no startup message
win

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Have you tried setting the number of helper threads with the --helpers=2 argument? Maybe setting the R_HELPERS environment variable doesn't work for some reason.

from pqr.

armgong avatar armgong commented on July 23, 2024

Now after compiled 2013-06-28,set --helpers=2 ,now even worse, now it show
win

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

You shouldn't use options(helpers_disable=TRUE). That disables helper threads. You want them enabled.

from pqr.

armgong avatar armgong commented on July 23, 2024

sorry for the error,but change it options(helpers_disable=FALSE) , result is always done by master not helpers

thank you so much for help , now it's 1:20 am I need some sleep , I will dig deeper in next week .

thank you

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

I think I know what's happening with the lack of message about the number of threads. The "Using N helper threads" message (or lack of it, indicating N=0) says what the actual number of helper threads is, which may be less than the number asked for with R_HELPERS or --helpers. I get no "Using 2 helper threads" line and trace output like yours when I run pqR with --helpers=2 but with the OMP_THREAD_LIMIT environment variable set to 1. Perhaps 1 is the default limit in your Windows environment, and you need to set it to N+1 if you want to use N helper threads. (The default in Linux seems to be no limit.)

from pqr.

armgong avatar armgong commented on July 23, 2024

Yeah I found why this happen's , it just because Makefile.win in extra/helper not add -fopenmp to CFLAGS
change
DEFS=-DHAVE_CONFIG_H -fopenmp
slove the problem and I also push it to my reps,pls merge it to your reps

win

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Glad you've found the problem! I'll be interested in how well the helper threads work. There could be a different amount of overhead for thread operations in Windows versus Linux.

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

I notice that A+B and A-B weren't actually done in parallel in your screen shot above (though computation of A-B and of f() may have been done in parallel). This is probably because allocating space for the result of A-B causes a full garbage collection, and the way I do it at the moment, that results in a wait for A+B to be computed, as I remarked earlier. So I'd recommend trying somewhat smaller examples as well.

from pqr.

armgong avatar armgong commented on July 23, 2024

yes ,I change matrix 2000*2000
time nohelper time with 3 helpers
0.08 0.06

and run more complex sample as following

rm(list=ls(all=TRUE))
gc()
HELPERS: No wait for all tasks to complete
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 175349 11.4 407500 24.9 350000 21.8
Vcells 218968 1.7 3350726 25.6 20196985 154.1
options(helpers_disable=TRUE)
HELPERS: No wait for all tasks to complete
options(helpers_trace=FALSE)
f = function(n){

  • for(i in 1:n){
    
  • A = matrix(rnorm(1000^2), 1000,1000)
    
  • B = matrix(rnorm(1000^2), 1000,1000)
    
  • list(A%_%B,B%_%A)
    
  • }
    
  • }
    system.time(f(10))

user system elapsed

68.64 0.22 69.01

rm(list=ls(all=TRUE))
gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 176006 11.5 407500 24.9 350000 21.8
Vcells 218968 1.7 5173127 39.5 20196985 154.1
options(helpers_disable=FALSE)
options(helpers_trace=FALSE)
f = function(n){

  • for(i in 1:n){
    
  • A = matrix(rnorm(1000^2), 1000,1000)
    
  • B = matrix(rnorm(1000^2), 1000,1000)
    
  • list(A%_%B,B%_%A)
    
  • }
    
  • }
    system.time(f(10))

user system elapsed

135.62 0.23 52.25

and in R 3.1 svn trunk time is
user system elapsed
34.26 0.23 34.82

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

For this test, I get the following, on a 3.33 GHz Intel Xeon X5680 with six cores, running Ubuntu 12.04 Linux, compiled with gcc 4.6.3, with options -O2 -march=native -mtune=native:

R-3.0.1:

user system elapsed
21.181 0.028 21.276

pqR-2013-06-28 with --helpers=3:

user system elapsed
22.809 0.012 12.280

pqR-2013-06-28 with --helpers=3 but with options(helpers_disable=TRUE):

user system elapsed
17.326 0.020 17.400

Could you tell me what system you are using? In particular, how many cores does it have? And are you comparing against R-3.0.1 as originally released (which I did), or against the latest patched version of R-3.0.1? Did you compile both R-3.0.1 and pqR yourself, with the same compilation options? (I'm puzzled why your R-3.0.1 is so much faster than pqR.)

from pqr.

armgong avatar armgong commented on July 23, 2024

my laptop is I3 M370 which has 2 core and 4 threads, system is windows 7 pro 64 bit.

R 3.1 (svn trunk) and pqR compile by myself with mingw-w64 gcc 4.8.0 (http://sourceforge.net/projects/mingw-w64/files/Toolchains%20targetting%20Win64/Personal%20Builds/rubenvb/gcc-4.8-release)

c compiler flag both is -O3 -Wall -pedantic -mtune=core2 -DENABLE_ISNAN_TRICK -fopenmp

R 3.1 R version is svn trunk (2013-06-29 r63090) not 3.01
c compiler flag is  -O3 -Wall -pedantic  -mtune=core2

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Do the four threads then come from hyperthreading on the two cores? If so, it may be better to use only one helper thread, since hyperthreading effectively gives only about 15% of an extra core, and there is some spin-waiting in the helper threads that could impact another thread on the same core.

Just in case that impacts the test you did, you could compare with what timing you get with --helpers=0 . Using options(helpers_disable) ought to be equivalent (and prevent the other threads from spin waiting), but this might possibly depend on the OpenMP implementation you're using, so it would be good to check.

from pqr.

armgong avatar armgong commented on July 23, 2024

I download vanilla version from cran and recompile pqR twice, now is the result on my laptop

r 3.01 64bit vanilla
user system elapsed
32.78 0.23 33.20

r 2.15.0 64bit vanilla
user system elapsed
31.89 0.34 32.33

r svn trunk 64bit with -O3 -Wall -pedantic -mtune=core2
user system elapsed
32.78 0.31 33.45

recompile pqR with -O3 -Wall -pedantic -mtune=core2 -fopenmp enable helper=1
user system elapsed
85.20 0.27 44.18

recompile pqR with -O3 -Wall -pedantic -mtune=core2 -fopenmp but disable helper
user system elapsed
68.19 0.21 68.57

recompile pqR with -O3 -Wall -pedantic -mtune=core2 totally undefine all HELPER and openmp in code
user system elapsed
68.31 0.22 68.95

so it seem pqR on windows indeed slow than vanilla R and R trunk, I am confused,maybe is there some code downgrade the performance?

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

I tried R-devel r63090, and get

user system elapsed
21.201 0.024 21.292

pretty much the same as for R-3.0.1, and slower than pqR (even with helpers disabled).

By default, pqR uses its own C routines for %%, rather than the BLAS routines. On my system, the two are close to being the same speed. But perhaps the BLAS is faster on your system. You can get pqR to use the BLAS for %% with options(mat_mult_with_BLAS=TRUE), and this will be the default if you configure with --enable-mat-mult-with-BLAS-by-default=yes.

pqR will do %% in helper threads when using the BLAS in your example, but it won't when it would need to pipeline. For example, A %% (B %% A) will be done with both %% happening in parallel when mat_mult_with_BLAS=FALSE, but not when mat_mult_with_BLAS=TRUE.

For your test totally disabling helpers, did you use the --disable-helper-threads option to configure? That's how to really get them disabled, with possible speed improvement (which should be slight, but who knows here...).

Could your Fortran compiler somehow be much better than your C compiler? That would make the BLAS routines faster. Also, all my tests have been with -O2. It's possible that -O3 actually makes things worse.

from pqr.

armgong avatar armgong commented on July 23, 2024

i did't use the --disable-helper-threads ,because on windows ,but manually edit config.h and makefile.win remove openmp and R_HELPER_THREADS

/* Define if you have C OpenMP support. */
#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR >= 2
// has it, but it is too slow to be usable
// #define HAVE_OPENMP 1
// #define SUPPORT_OPENMP 1
#endif
//#undef R_MEMORY_PROFILING
//#undef HELPERS_DISABLED
//#define R_HELPER_THREADS 1
//#define R_MAT_MULT_WITH_BLAS_IN_HELPERS_OK 1

this shuld be equal --disable-helper-threads,so -disable-helper-threads is not helpful,just like the latest test on my test report

R default flags is -O3 and pkg is -O2 , I am not change this ,just default setting on windows,and R svn trunk also use this setting, so this should't cause performance downgrade

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

You could try some other operations that can be done in helper threads, to see whether the poor performance is general, or just for %*%. That would narrow down what might be happening.

For example, for

v<-rep(1.1,100000); u<-0; system.time(for (i in 1:100) u <- u+exp(v^1.1)); print(u[100000])

I get elapsed times of 1.040 for pqR with helpers disabled, 0.627 for pqR with one helper thread, and 1.204 for R-devel r63090.

from pqr.

armgong avatar armgong commented on July 23, 2024

Yeah after define R_MAT_MULT_WITH_BLAS_BY_DEFAULT 1 ,everything is ok.
thank you so much, waste so many time just because I forget define this.

v<-rep(1.1,100000); u<-0; system.time(for (i in 1:100) u <- u+exp(v^1.1)); print(u[100000])

pqr with one helper
2.85 0.05 1.58

pqr without helper
2.46 0.03 2.53

R 2.15.0
2.62 0.07 2.73

R 3.01
2.83 0.08 2.90

R trunk
2.74 0.03 2.78

and

f = function(n){
for(i in 1:n){
A = matrix(rnorm(1000^2), 1000,1000)
B = matrix(rnorm(1000^2), 1000,1000)
list(A%%B,B%%A)
}
}

system.time(f(10))

result is

without helper
user system elapsed
28.41 0.24 29.03

one helper
user system elapsed
49.00 0.31 26.05

from pqr.

armgong avatar armgong commented on July 23, 2024

I also saw some message about HAVE_OPENMP in config.h by R core team , it said
#if defined(__MINGW64_VERSION_MAJOR) && __MINGW64_VERSION_MAJOR >= 2
// has it, but it is too slow to be usable
//#undef HAVE_OPENMP

maybe this is the reason for downgrade performance?mingw-w64's openmp is slow?no idea

from pqr.

armgong avatar armgong commented on July 23, 2024

after change compiler flags to O1 ,I do some test with options(mat_mult_with_BLAS=FALSE)  result is still bad.
so now I am sure that the problem is cause by mingw-w64's openmp(in helper c code).
It seems on mingw-w64 we should avoid use openmp code 


From: Radford Neal [email protected]
To: radfordneal/pqR [email protected]
Cc: Yu Gong [email protected]
Sent: Saturday, June 29, 2013 11:30 PM
Subject: Re: [pqR] helpers seems not work on windows (#12)

You could try some other operations that can be done in helper threads, to see whether the poor performance is general, or just for %*%. That would narrow down what might be happening.
For example, for
v<-rep(1.1,100000); u<-0; system.time(for (i in 1:100) u <- u+exp(v^1.1)); print(u[100000])
I get elapsed times of 1.040 for pqR with helpers disabled, 0.627 for pqR with one helper thread, and 1.204 for R-devel r63090.

Reply to this email directly or view it on GitHub.

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Have you tried compiling with -O2? That's what I've been using. I've been using gcc 4.6.3 and gcc 4.7.3, so if you're using gcc 4.8.0 that's another possible source of the difference. It's quite possible that compiler options and versions could make a difference, since it all comes down to the code generated for a single inner loop in the matrix multiply routine.

It doesn't look like OpenMP efficiency is a major issue, since you get a reasonable speed up on the u <- u + exp(v^1.1) test. It's possible that the comment about OpenMP being slow is out of date, or applies only to OpenMP operations that pqR doesn't use.

Using the BLAS, you get a speed up in pqR by using one helper thread of 29.03/26.05=1.11, which is rather small. One possible explanation is that on your system the multiplies may be limited by bandwidth to main memory, rather than by processor speed. On my system, with mat_mult_with_BLAS=TRUE, one helper gives a speed up by a factor of 17.3/12.7=1.36 on your test.

from pqr.

armgong avatar armgong commented on July 23, 2024

try -O2 -mtune=core2 but no Lucky,maybe next need change gcc to 4.6.3?

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

How about -march=native -mtune=native? That's what I use. Plus -mfpmath=sse for a 32-bit configuration.

gcc 4.7.3 works fine for me, so going all the way back to 4.6.3 doesn't seem necessary.

Of course it's possible that there's just something about my C matrix multiply routine that doesn't work as well as the BLAS routine (also modified from R's a bit) on your system, even though it works as well on mine. It could be some difference in the processor's cache architecture, for instance. Though the algorithms used are pretty similar, so this doesn't seem all that likely...

from pqr.

armgong avatar armgong commented on July 23, 2024

after gcc -O3 -march=native -mtune=native , result finally became normal. test result as following:

options(helpers_disable=TRUE)
options(mat_mult_with_BLAS=FALSE)
options(helpers_trace=FALSE)
user system elapsed
27.50 0.20 27.71

options(helpers_disable=FALSE)
options(mat_mult_with_BLAS=FALSE)
options(helpers_trace=FALSE)
user system elapsed
48.22 0.44 25.67

options(helpers_disable=TRUE)
options(mat_mult_with_BLAS=TRUE)
options(helpers_trace=FALSE)
user system elapsed
27.66 0.28 28.04

options(helpers_disable=FALSE)
options(mat_mult_with_BLAS=TRUE)
options(helpers_trace=FALSE)
user system elapsed
48.40 0.39 25.79

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Glad you've found good compiler options! It's a bit disturbing that the speed of the matrix multiply routine is so sensitive to the exact compiler options. The speedup from using a helper thread is still disappointingly small. You could try smaller matrices to see if there's a larger speedup with them, as might be expected if the problem is with bandwidth to main memory (since the smaller matrices might fit in the cache).

Thanks for trying this out!

from pqr.

armgong avatar armgong commented on July 23, 2024

yeah change matrix to 300*300,with system.time(f(100))

options(mat_mult_with_BLAS=FALSE)
speed up 6.27/4.58=1.368996
options(mat_mult_with_BLAS=TRUE)
speed up 6.21/4.61=1.347072

so your guess is correct

from pqr.

radfordneal avatar radfordneal commented on July 23, 2024

Interesting. There are probably ways of changing the multiply routine to have better cache behaviour (probably speeding it up even when only a single core is used). Of course, there are much more sophisticated BLAS packages that try to do such things. So interfacing well with such BLAS packages is probably the best approach. It should be possible to write a matrix multiply task procedure that can do some pipelining even when using the BLAS matrix multiply rather than my C routine, for instance.

Thanks again for working on this. This thread is getting long, so I'll close it now. But you're welcome to open new issues as you find them. (Or comment on my blog.)

from pqr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.