sable / ostrich Goto Github PK

Benchmark suite for studying the performance of JavaScript and WebCL for numerical computation

License: Other

Python 7.00% C 49.16% JavaScript 13.50% C++ 19.34% Shell 0.01% Cuda 0.21% Objective-C 1.10% Makefile 2.74% HTML 6.90% MATLAB 0.05%

ostrich's Introduction

Ostrich Benchmark Suite

Ostrich is a benchmark suite developed in the Sable Lab at McGill University with the objective of studying the performance of languages used for numerical computing.

We aim to make the suite:

Consistent and Correct by providing self-checking runners for every language that automatically ensure that the computation result of the benchmarks are consistent across all language implementations and correct with regard to the algorithm for known inputs;
Representative of the most important and popular numerical algorithms with a proper choice of representative input data with a wide range of benchmarks across known numerical categories (Dwarfs) ;
Extensible across numerical languages and benchmarks;
Friendly to language implementation research by factorizing the core computation from the runners to minimize the non-core functions necessary to validate the output of compilers;
Easy to use by automating the deployment of benchmarks, their test on virtual (web browser and others) and native platforms, as well as the gathering and reporting of relative performance data;
Fast by making the setup (data generation and loading) and teardown as quick as possible so that most of the time is spent in the core computation in every language;
Small by minimizing the amount of data needed to download the suite;
Simple by minimizing the amount of external dependencies and tools required to run the suite;

Getting Started

Please read our wiki for more details on obtaining the suite, description of the benchmarks and instruction on running the benchmarks.

Citation

   @software{ostrich,
     author = {{Khan, Faiz and Foley-Bourgon, Vincent and Kathrotia, Sujay and Lavoie, Erick}},
     title = {Ostrich Benchmark Suite},
     url = {https://github.com/Sable/Ostrich},
     version = {1.0.0},
     date = {2014-06-09},
   }

Copyright and License

Ostrich: MIT Licence
OpenDwarfs: GNU Lesser General Public License
Rodinia: Rodinia Licence
V8: BSD 3 Licence

ostrich's People

Contributors

Stargazers

Watchers

Forkers

srutis90 mathiasbourgoin mfwei larskotthoff shi27feng sudosalim akshaj5 elavoie rusiqe

ostrich's Issues

Simplify the makefiles by abstracting the common parts amongst all benchmarks

Add support to run JS benchmarks on a standalone VM outside a browser

nqueens js implementation: Investigate performance bug

The JS versions of nqueens are 4-20x slower than the C version rather than the expected factor of 2:

benchmark,implementation,compiler,platform,environments,mean,std,times
nqueens,c,gcc,mba-2011,native,3.7895943333333335,0.03254402787507002,3.766584,3.826829,3.77537
nqueens,js,none-js,mba-2011,chrome,55.388,0.6129510584051558,54.741,55.96,55.463
nqueens,js,none-js,mba-2011,firefox,16.397666666666666,0.1831429314315263,16.519,16.187,16.487
nqueens,js,none-js,mba-2011,safari,36.24433333333334,0.4209659527008464,36.714,36.118,35.901

Maybe it is caused by the usage of eval by the environment/run script to run the benchmark?

Remove two-level directories for benchmarks and add the dwarf that a benchmark belongs to a meta-file inside the benchmark

Establish directory structure for benchmarks

Runner requires Python 2

Python 2 is becoming increasingly unavailable. I think it also depends on an unspecified older version of psutils for the asm.js benchmarks.

nqueens matlab implementation: performance issue

The matlab implementation is a direct port of the C version and uses bitwise operations that are really fast in C but really slow in matlab. The slowdown factor is too high to make any kind of meaningful comparison between the two languages.

We need to investigate alternative implementations for matlab that implement the same algorithm with different primitives (non bitwise...).

Port nqueens to MATLAB

Question about hmm in CUDA

Is the CUDA version of the benchmark hmm available ?

Thanks

Remove Hot Swap of 'createMatrixFromRandom' from matjuice run_template.html

Port page-rank to MATLAB

Improve speed of data generation in FFT's Matlab implementation by generating data from the C version

Port lud to MATLAB

Convert all the MATLAB code for benchmarks to CamelCase rather than under_score.

Moreover, we should choose a consistent convention for function names to make things regular.

I propose:

runName

xi_sum initialization c

File : main_bwa_hmm.c

Line 115: float *xi_sum;

Line 537: xi_sum = malloc(sizeof(*xi_sum)*nstates*nstates);
// sizeof(*xi_sum) is same as sizeof(float)
//xi_sum is a 2D array which is stored as 1D array

Line 336: memset(xi_sum 0, sizeof(float *)*nstates);
// xi_sum is initialized to zero, but it doesn't seem correct because it only initializes some bytes ( nstates * size of pointer) of xi_sum array instead of whole array (nstates * nstates * sizeof(float) )

Port back-propagation to MATLAB

Clean output of makefiles to output only what is strictly necessary, while enabling simple debugging

none-matlab compiler: add validation of source files to find and possibly automatically fix common mistakes when preparing benchmarks

Common mistake 1: Forgetting a newline at the end of each matlab source file to make concatenation in a single script easy

Fork of https://github.com/vtsynergy/OpenDwarfs

45eed25

This is the first commit that uses code from the OpenDwarfs without the associated license file.

The Dwarfs are now exclusively licensed using the LGPL license, you may wish to replace some of the comment-block VT Non-commercial licenses you've added with the LGPL versions of those files for broader usability.

We would appreciate if the repository history could be corrected/rebased to reflect the origin and more accurately document the changes and additions made to our originals.

Port bfs to MATLAB

bitcmp issue

Recently, I have upgraded MATLAB on my laptop from v2009a to v2015b. I found the bitcmp has been changed in the new version. So far, nqueens has been affected. I will fix the problem after Erick's work is done. Please find details below.

bitcmp: bit-wise complement (see http://www.mathworks.com/help/matlab/ref/bitcmp.html)

The second parameter of bitcmp has been changed to an integer type name (see type name list ).

Old:

» bitcmp(12,8)

ans =

   243

New:

» bitcmp(12,'uint8')

ans =

   243

Note: the old code will run into an error in the new MATLAB.

Enable inlining for FFT

The C implementation could be improved considerably if the compiler was able to inline the complex_* functions into the loops. Could the implementations in common/complex_simple.c be moved to fft.c? The inlining, as well as removing the calls, enables vectorization too (at least by clang) and I see a 17% speedup on my Ryzen box. I'd like to make use of this suite for benchmarking WASM, with particular interest in autovec SIMD code, so it would be great if the suite was more friendly to traditional compiler optimisations that aid this.

Port fft to MATLAB

bug: setting values out of Uint32Array boundary

If you have size > 32 you can end up setting values out of the boundary of forbidden = new Uint32Array(32);
This operation is useless and bumps time of this specific for-loop from O(const) to O(N)

Ostrich/branch-and-bound/nqueens/js/nqueens.js

Line 126 in dd06ca4

for(k = 0; k < size; k++) {

Ensure correctness and empirical validity of the benchmarks and tools of the suite

Assess the suite according to the dos and donts from this page:

https://www.cse.unsw.edu.au/~gernot/benchmarking-crimes.html

(copied here for convenience and archival, original content from Gernot Heiser)

Systems Benchmarking Crimes
Gernot Heiser

Contents

Benchmarking Crimes
Selective Benchmarking
Not evaluating potential performance degradation
Subsetting
Selective data set
Microbenchmarks
Throughput only
Downplaying overheads
Calibration = evaluation set
No significance
Benchmarking simulated system
Inappropriate/misleading benchmarks
Relative numbers only
Improper baselinebase
Evaluate against self
Unfairly evaluating competitors
Arithmetic mean
Exercise for the reader
Best practice
Further information
Benchmarking Crimes

When reviewing systems papers (and sometimes even when reading published papers) I frequently come across highly misleading use of benchmarks. I'm not saying that the authors intend to mislead the reader, it's just as likely incompetence. But that isn't an excuse.

I call such cases benchmarking crimes. Not because you can go to jail for them (but probably should?) but because they undermine the integrity of the scientific process. Rest assured, if I'm a reviewer of your paper, and you commit one of those, you're already most of the way into rejection territory. The rest of the work must be pretty damn good to be forgiven a benchmarking crime (and even then you'll be asked to fix it up in the final version).

The following list is work in progress, I'll keep adding to it as I come across (or remember) more systems benchmarking crimes...

Selective benchmarking
This is the mother of all benchmarking crimes: using a biased set of benchmarks to (seemingly) prove a point, which might be contradicted by a broader coverage of the evaluation space. It's a clear indication of at best gross incompetence or at worst an active attempt to deceive.

There are several variants of this crime, I list the most prominent ones. Obviously, not all instances of this are equally bad, in some cases it may just be a matter of degree of thoroughness, but in its most blatant form, this is a truly hideous crime.

Not evaluating potential performance degradation
A fair evaluation of a technique/design/implementation that is supposed to improve performance must actually demonstrate two things:

Progressive criterion: Performance actually does improve significantly in the area of interest
Conservative criterion: Performance does not significantly degrade elsewhere
Both are important! You cannot easily argue that you've gained something if you improve performance at one end and degrade it at another.

Reality is that techniques that improve performance generally require some degree of extra work: extra bookkeeping, caching, etc. These things always have a cost, and it is dishonest to pretend otherwise. This is really at the heart of systems: it's all about picking the right trade-offs. A new technique will therefore almost always introduce some overheads, and you need to demonstrate that they are acceptable.

If you innovation does lead to a degree of degradation, then you need to analyse it, and build a case that it is acceptable given the other benefits. If, however, you only evaluate the scenarios where your approach is beneficial, you are being deceptive. No ifs, no buts.

Benchmark sub-setting without strong justification
I see this variant (which can actually be an instance of the previous one) frequently with SPEC benchmarks. These suites have been designed as suites for a reason: to be representative of a wide range of workloads, and to stress various aspects of a system.

However, it is also true that it is often not possible to run all of SPEC on an experimental system. Some SPEC programs require large memories (they are designed to stress the memory subsystem!) and it may be simply impossible to run them on a particular platform, particularly an embedded system. Others are FORTRAN programs, and a compiler may not be available.

Under such circumstances, it is unavoidable to pick a subset of the suite. However, it must then be clearly understood that the results are of limited value. In particular, it is totally unacceptable to quote an overall figure of merit (such as average speedup) for SPEC if a subset is used!

If a subset is used, it must be well justified. There must be convincing explanation for each missing program. And the discussion must be careful not to read too much into the results, keeping in mind that it is conceivable that any trend observed by the subset used could be reverted by programs not in the subset.

Where the above rules are violated, the reader is bound to suspect that the authors are trying to hide something. I am particularly allergic to formulations like “we picked a representative subset” or “typical results are shown”. There is no such thing as a “representative” subset of SPEC, and the ”typical” results are most likely cherry-picked to look most favourable. Expect no mercy for such a crime!

Lmbench is a bit of a special case. Its license actually forbids reporting partial results, but a complete lmbench run produces so many results that it is impossible to report in a conference paper. On the other hand, as this is a collection of micro-benchmarks which are probing various aspects of the OS, one generally understands what each measures, and may only be interested in a subset for good reasons. In that case, running the particular lmbench test has the advantage of measuring a particular system aspect in a well-defined, standardised way. This is probably OK, as long as not too much is being read into the results (and Larry McVoy doesn't sue you for license violation...)

A variant of this crime is arbitrarily picking benchmarks from a huge set. For example, when describing an approach to debug or optimise Linux drivers, there are obviously thousands of candidates. It may be infeasible to use them all, and you have to pick a subset. However, I want to understand why you picked the particular subset. Note that arbitrary is not the same as random, so a random pick would be fine. However, if your selection contains many obscure or outdated devices, or is heavily biased towards serial and LED drivers, then I suspect that you have something to hide.

Selective data set hiding deficiencies Selective data set crime
This variant can again be viewed as an example of the first. Here the range of the input parameter is picked to make the system look good, but the range is not representative of a real workload. For example, the diagram on the right shows pretty good scalability of throughput as a function of load, and without any further details this looks like a nice result.

Things look a bit different when we put the graph into context. Say this is showing the throughput (number of transactions per second) of a database system with a varying number of clients. So far so good.

Is it still so good if I'm telling you that this was measured on a 32-core machine? What we see then is that the throughput scales almost linearly as long as there is at most one client per core. Now that is not exactly a typical load for a database. A single transaction is normally insufficient for keeping a core busy. In order to get the best of your hardware, you'll want to run the database so that there are in average multiple clients per core.

Selective data set crime
So, the interesting data range starts where the graph ends! What happens if we increase the load into the really interesting range is shown in the graph on the left. Clearly, things no longer look so rosy, in fact, scalability is appalling!

Note that, while somewhat abstracted and simplified, this is not a made-up example, it is taken from a real system, and the first diagram is equivalent to what was in a real publication. And the second diagram is essentially what was measured independently on the same system. Based on a true story, as they say...

Pretending micro-benchmarks represent overall performance
Micro-benchmarks specifically probe a particular aspect of a system. Even if they are very comprehensive, they will not be representative of overall system performance. Macro-benchmarks (representing real-world workloads) must be used to provide a realistic picture of overall performance.

In rare cases, there is a particular operation which is generally accepted to be critical, and where significant improvements are reasonably taken as an indication of real progress. An example is microkernel IPC, which was long known to be a bottleneck, and reducing cost by an order of magnitude can therefore be an important result. And for a new microkernel, showing that it matches the best published IPC performance can indicate that it is competitive.

Such exceptions are rare, and in most cases it is unacceptable to make arguments on system performance based only on micro-benchmarks.

Throughput degraded by x% ⇒ overhead is x%
This vicious crime is committed by probably 10% of papers I get to review. If the throughput of a system is degraded by a certain percentage, it does not at all follow that the same percentage represents the overhead that was added. Quite to the contrary, in many cases the overhead is much higher. Why?

Assume you have a network stack which under certain circumstances achieves a certain throughput, and a modified network stack achieves 10% less throughput. What's the overhead introduced by the modification?

Without further information, it is impossible to answer that question. Why is throughput degraded? In order to answer that question, we need to understand what determines throughput in the first place. Assuming that there's more than enough incoming data to process, the amount of data the stack can handle depends mostly on two factors: processing (CPU) cost and latency.

Changes to the implementation (not protocols!) will affect processing cost as well as latency, but their effect on throughput is quite different. As long as CPU cycles are available, processing cost should have negligible effect on throughput, while latency may (packets will be dropped if not processed quickly enough). On the other hand, if the CPU is fully loaded, increasing processing cost will directly translate into latency.

Networks are actually designed to tolerate a fair amount of latency, so they shouldn't really be very sensitive to it. So, what's going on when throughput drops?

The answer is that either latency has grown substantially to show up in reduced throughput (likely much more than the observed degradation in throughput), or the CPU has maxed out. And if a doubling of latency results in a 10% drop of throughput, calling that “10% overhead” is probably not quite honest, is it?

If throughput was originally limited by CPU power (fully-loaded processor) then a 10% throughput degradation can be reasonably interpreted as 10% increased CPU cost, and that can be fairly called “10% overhead”. However, what if on the original system the CPU was 60% loaded, and on the modified system it's maxed out at 100% (and that leading to the performance degradation)? Is that still “10% overhead”?

Clearly not. A fair way to calculate overhead in this case would be to look at the processing cost per bit, which is proportional to CPU load divided by throughput. And on that measure, cost has gone up by 85%. Consequently, I would call that an 85% overhead!

The bottom line is that incomplete information was presented which prevented us from really assessing the overhead, and lead to a huge under-estimation. Throughput comparisons must always be accompanied by a comparison of CPU load, and the proper way to compare is in terms of processing time per bit!

Downplaying overheads
There are several ways people use to try to make their overheads look smaller than they are.

6% → 13% overhead is a 7% increase
This one is confusing percentage with percentage points, regularly practiced (out of incompetence) by the media. That doesn't excuse doing the same in technical publications.

So the authors' modified system increases processing overheads from 6% (for the original system) to 13% (for theirs) and they sheepishly claim they only added 7% overhead. Of course, that's complete bullocks! They more than doubled the overhead, their system is less than half as good as the original!

Similarly, if your baseline system has a CPU utilisation of 26%, and your changes result in a utilisation of 46%, you haven't increased load by 20%, you almost doubled it! The dishonesty in the 20% claim becomes obvious if you consider what would happen if the same experiments were run on a machine exactly half as powerful: load would go from 52% to 92%, clearly not a 20% increase!

Other creative overhead accounting
A particularly brazen example of creative accounting of overheads is in this paper (published in Usenix ATC, a reputable conference). In Table 3, the latency of the stat system call goes up from 0.39μs to 2.28μs, almost a six-fold increase. Yet the authors call it an “82.89% slowdown”!

Same dataset for calibration and validation
This is a fairly widespread crime, and it's frankly an embarrassment for our discipline.

Systems work frequently uses models which have to be calibrated to operating conditions (eg. platform, workloads, etc). This is done with some calibration workloads. Then the system is evaluated, running an evaluation workload, to show how accurate the model is.

It should go without saying, but apparently doesn't, that the calibration and evaluation workloads must be different! In fact, they must be totally disjoint. It's incredible how many authors blatantly violate this simple rule.

Of course, the results of using the same data for calibration and validation are likely that the model appears accurate, after all, it's been designed to fit the experimental results. But all such an experiment can show is how well the model fits the existing data. It implies nothing about the predictive power of the model, yet prediction of future measurements is what models are all about!

No indication of significance of data
Raw averages, without any indication of variance, can be highly misleading, as there is no indication of the significance of the results. Any difference between results from different systems might be just random.

In order to indicate significance, it is essential that at least standard deviations are quoted. Systems often behave in a highly deterministic fashion, in which case the standard deviation of repeated measurements may be very small. In such a case it might be sufficient to state that, for example, “all standard deviations were below 1%”. In such a case, if the effect we are looking at is, say, 10%, the reader can be reasonably comfortable with the significance of the results.

If in doubt use Student's t-test to check the significance.

Also, if you fit a line to data, quote at least a regression coefficient (unless it's obvious that there are lots of points and the line passes right through all of them).

Benchmarking of simplified simulated system
It is sometimes unavoidable to base an evaluation on a simulated system. However, this is extremely dangerous, as a simulation is always a model, and contains a set of assumptions.

It is therefore essential to ensure that the simulation model does not make any simplifying assumption which will impact the performance aspect you are looking for. And, it is equally important to make it clear to the reader/reviewer that you really have made sure that the model is truly representative with respect to your benchmarks.

It is difficult to give general advice on how to do this. My best advice is to put yourself into the shoes of the reader, and even better to get an outsider to read your paper and check whether you have really made a convincing case.

Inappropriate and misleading benchmarks
I see people using benchmarks that are supposed to prove the point, when in fact they say almost nothing (and the only thing they could possibly show is truly aweful performance). Examples:

Using uniprocessor benchmarks for multiprocessor scalability
This one seems outright childish, but that doesn't mean you don't see it in papers submitted by (supposedly) adults. Someone is trying to demonstrate the multiprocessor scalability of their system by running many copies of SPEC CPU benchmarks.

Of course, these are uniprocessor programs which do not communicate. Further more, they perform very few system calls, and thus do not exercise the OS or underlying communication infrastructure. They should scale perfectly (at least for low processor counts). If not, there's serious brokenness in the OS or hardware. Real scalability tests would run workloads which actually communicate across processors and use system calls.

Using a CPU-intensive benchmark to show networking overheads
Again, this seems idiotic (or rather, is idiotic) but I've seen it nevertheless. People trying to demonstrate that their changes to a NIC driver or networking stack has low performance impact, by measuring the performance degradation of a CPU-intensive benchmark. Again, the only thing this can possibly prove is that performance sux, namely if there is any degradation at all!

Relative numbers only
Always give complete result, not just ratios (unless the denominator is a standard figure). At best, seeing only relative numbers leaves me with a doubt as to whether the figures make sense at all, I'm robbed of a simple way to perform a sanity check. At worst, it can cover up that a result is really bad, or really irrelevant.

One of the worst instances I've seen of this crime was not in a paper I was reviewing, but one that was actually published. It compared the performance of two systems by showing the ratio of overheads: a ratio of two relative differences. This is too much relativity to read anything out of the numbers.

For example, assume that the overhead of one system is twice that of another. By itself, that tells us very little. Maybe we are comparing a tenfold with a twentyfold overhead. If so, who cares? Both are most likely unusable. Or maybe the overhead of one system is 0.1%, who cares if the other one has 0.2% overhead? The bottom line is we have no idea how significant the result is, yet the representation implies that it is highly significant.

No proper baseline
This crime is related to the above. A typical case is comparing different virtualization approaches by only showing the performance of the two virtualized systems, without showing the real baseline case, which obviously is the native system. It's comparison against native which determines what's good or bad, not comparison against an arbitrary virtualization solution!

Consider the baseline carefully. Often it is the state-of-the-art solution. Often it is the optimal (or theoreticaly best) solution or a hardware limit (assuming zero software overhead). The optimal solution is usually impossible to implement in a system, because it requires knowledge of the future or magic zero-cost software, but it can often be computed “outside” the system and is an excellent base for comparison. In other cases the correct baseline it is in some sense an unperturbed system (as in the virtualization example above).

Only evaluate against yourself
This is a variant of the above crime, but that doesn't make it rare. It might be exciting to you that you have improved the performance of your system over last year's paper, but I find it much less exciting. I want to see the significance, and that means comparing against some accepted standard.

At least this crime is less harmful that others in that it is pretty obvious, and rarely will a reviewer fall for it.

There's a variant of this crime which is more subtle: evaluating a model against itself. Someone builds a model of a system, making a number of simplifying assumptions, not all of them obviously valid. They build a solution for that problem, and then evaluate that solution on a simulated system that contains the exact same assumptions. The results look nice, of course, but they are also totally worthless, as there they are lacking the most basic reality check. This one I find a lot in papers which are already published. Depressing...

Unfair benchmarking of competitors
Doing benchmarks on your competitors yourself is tricky, and you must go out of your way to ensure that you do not treat them unfairly. I'm sure you tweaked your system as well as you could, but did you really go through the same effort with the alternative?

In order to reassure the reader/reviewer that you have been fair, describe clearly what you have done with the competitor system, e.g. fully describe all configuration parameters, etc. Be particularly circumspect if your results do not match any published data about the competitor system. If in doubt, contact the authors of that system to confirm that your measurements are fair.

Again, I have seen a case of this case of benchmarking abuse in a published paper, in that case the “competitor” system was mine. The authors of the paper failed to present any data on how they ran my system, and I strongly suspect that they got it wrong. For example, the default configuration of our open-source release had debugging enabled. Turning off that option (which, of course, you would in any production setting and any serious performance evaluation) improves performance massively.

The bottom line is that extreme care must be taken when doing your own benchmarking of a competitor system. It is easy to run someone else's system sub-optimally, and using sub-optimal results as a basis for comparison is highly unethical and probably constitutes scientific misconduct. And sloppiness is no excuse in such a case!

Arithmetic mean for averaging across benchmark scores
The arithmetic mean is generally not suitable for deriving an overall score from a set of different benchmarks (except where the absolute execution times of the various benchmarks have real significance). In particular the arithmetic mean has no meaning if individual benchmark scores are normalised (eg against a baseline system).

The proper way to average (i.e. arrive at a single figure of merit) is to use the geometric mean of the normalised scores [Fleming & Wallace, CACM (29), p 218].

Exercise for the Reader

Count the number of benchmarking crimes in this paper (published in IEEE CCNC'09).
Benchmarking Best Practice

The below benchmarking rules is what I tell my students. It's somewhat OS-oriented, but the basic principles apply generally.

General rules

Make sure that the system is really quiescent when starting an experiment, leave enough time to ensure all previous data is flushed out.
Make your benchmarking rig part of our regression testing suite.
Document what you're doing.
Test data and results

Always verify the data you are transferring. When writing something to disk or network, read it back and compare to what you've written. When reading, check that what you're reading is correct.
There are cases where this would unreasonably lengthen the time a benchmark takes. If that's the case, then at make sure that you least check the data for one complete run, before continuing. Also, prior to collecting final numbers, check again!
Never use the same data over and over. Make sure that each run uses different data. For example, have a timestamp or other unique identifier (like the coordinate and label in the graph) in the data. This is to ensure that you're actually reading the correct data, not some stale cache contents, wrong block, etc.
Use a combination of successive and separate runs for the same data point. E.g., do the same point at least twice in a row (helps to identify caching effects that shouldn't be there) and twice more after some other points were taken (to identify cases of caching where there shouldn't be any). Have a good look at the standard deviations.
Invert the order of measurements. This helps to identify interference between measurements. This and the previous point can together be achieved by traversing the set of data points in both directions.
Don't only use regular strides or powers of two. You may be hitting pathological cases without noticing it. Throwing in some random points might be a good idea. However, don't use only random points, you might be missing pathological cases. Good candidates for pathological cases are 2n, 2n-1, 2n+1.
When comparing measurements of different configurations (which is what you normally do) make sure you use exactly the same points, don't just compare graphs over the same interval.
When getting funny results, check that you are comparing apples with apples. For example, make sure that the system is in as much as possible the same state between runs you want to compare. For example, we had cases where benchmark results on Linux were affected by where the OS allocated them in physical memory, which differed between successive runs (and had massive effects on conflict misses in physically-addressed caches).
Statistics

Always do several runs, and check the standard deviation. Watch out for abnormal variance. In the sort of measurements we do, standard deviations are normally expected to be less than 0.1%. If you see >1% this should ring alarm bells.
In some cases it is reasonable to ignore the highest or lowest point (but this really should only be done after a proper statistical outlayer-detection procedure) or only look at the floor of the points. However, only use such selective use of data if you really know what you are doing, and also state it explicitly in your paper/report.
Timing

Use lots of iterations in order to improve statistics and remove clock granularity.
Run sufficient warm-up iterations which aren't timed.
Isolate the thing you want to time into a function (already done if you're timing system calls).
Eliminate loop overhead (don't just rely on it being small, eliminate it). The most reliable way of doing this is two run two versions of the benchmark, identical except for replacing the actual invocation (function or system call) by a noop. Run the loop without any compiler optimisations (which is why it's important to have the thing you want to time in a function, which however may require you to separately deal with function overhead).
Perform static analysis of the syscall loop above and verify that the timing numbers match your predictions.
Use proper statistics, even if they are not used in the final paper, checking for variance is an important sanity check.

Port nw to MATLAB

Automatically check for all required dependencies when running the tools

Add JSON file for meta properties of benchmarks

Properties such as:

Dwarf
Language supported
Version
Input size supported
etc.

xi_sum initialization js

File : bwa_hmm.js

Line 51: var xi_sum;

Line 457: xi_sum = new Array(nstates*nstates);
\ 2D array stored as 1D array
\ in case of nota, Array replaces Float32Array, and memory is initialized to undefined instead of 0

Line 268: for(t=0; t<nstates;++t)xi_sum[t]=0;
\ xi_sum is initialized to zero only for only one row (nstates), whereas it should be initialized for all rows (nstates * nstates)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.