The files in the busybox-1.22.0 directory should be b

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

The task <a href="https://github.com/sosy-lab/sv-benchmarks/blob/master/c/busybox-1.22

Fix BusyBox benchmarks to make them usable,about sosy-lab/sv-benchmarks

Comments (28)

dbeyer commented on May 31, 2024 1

The remaining question is if we need another sub-category for unreach-call
(I assume not, because that was the conclusion of @mdangl 's analysis)
and if we need another sub-category for no-overflow
(I assume yes, because we wanted to have overflow checks for all programs).

Should we open a new issue on that and close this one, or continue the discussion here?

I would like to hear the opinion of @PhilippWendler, @mdangl, and @zvonimir.

from sv-benchmarks.

mdangl commented on May 31, 2024

Sure. Off the top of my head, I remember the following issues:

The tasks usually expect command-line parameters and were prepared such that instead, these parameters are nondeterministically generated. However, in this preparation step, there are uninitialized local variables, causing undefined behavior. This is easy to fix.
Also in the part where the artificial command-line parameters are generated, the array of parameters is null-terminated (as it should be), but the individual strings are not (but they should be). This is also easy to fix.
The tasks are labelled, but the original labels were just guesses. In the mean time, some of the labels have been adjusted to match the verdicts of verification tools (mainly CPAchecker, I think), but no authoritative answer exists. This is harder to fix.

I'll try to fix issues 1) and 2) and continue contributing towards 3), but for the latter we really need more people to produce and discuss verdicts, otherwise there is a high risk of bias.

from sv-benchmarks.

mdangl commented on May 31, 2024

Now, let me try to analyse some calls to __VERIFIER_error in these tasks:
In several tasks, in the beginning of the main function there is the following assertion (or similar):

argv = argv + (signed long int)optind;
if(!(argv == ((char **)((void *)0))))
    (void)0;
  else
    __VERIFIER_error();

Since optind seems to be 0 at that point in some tasks (in some other tasks, there is not even a statement modifying argv), in these cases, this is basically a complex way of saying the array of command-line arguments argv must not be NULL.
Since we explicitly initialize the command-line arguments before, this does not make any sense and should either be removed or be replaced by __VERIFIER_assume(argv != ((char **)((void *)0))).
I don't think the call to __VERIFIER_error makes sense here.

In some tasks, the code then continues as follows (example):

if(!(*(1l + argv) == ((char *)((void *)0))))
  {
    if(!(argv == ((char **)((void *)0))))
      (void)0;
    else
      __VERIFIER_error();
    src_fd=xopen(argv[(signed long int)0], 0);
    fstat(src_fd, &stat_buf);
    mode = stat_buf.st_mode & (unsigned int)(256 | 128 | 64 | (256 | 128 | 64) >> 3 | ((256 | 128 | 64) >> 3) >> 3);
    argv = argv + 1l;
  }
  if(!(argv == ((char **)((void *)0))))
    (void)0;
  else
    __VERIFIER_error();

Now, the first if condition here I think I understand: Check if argv[1] != NULL, i.e. not the end of the null-terminated array. Right? Note that there is no check if (argv + 1) != NULL before dereferencing it, unless I misunderstood the previous code and optind is 1 instead of 0.
However, the next lines up (if / else up to the call to __VERIFIER_error) are the same as discussed above. Also note that at this point, argv still has not been modified. So if may analysis so far is correct, this part is redundant and should either be removed or dealt with as above.

On to the next lines. There is some seemingly relevant stuff related to the actual program,
followed by incrementing the argv pointer to point to the next element of the array.
Note that we only get there if the dereference of this address in the condition of the surrounding if block succeeded and pointed to a non-NULL element.
Then the code leaves the block, and again there is a NULL-check for argv.
But it cannot be NULL here, because if it was NULL, we would have dereferenced NULL above.

And that's it. Those are all the calls to __VERIFIER_error in this task, meaning the rest of the code in these tasks is basically boilerplate and not relevant to the verification at all.
And looking at some other tasks, there seem to be a lot of nonsensical calls to __VERIFIER_error, and I am not sure how many, if any, are actually genuine checks for bugs in the BusyBox code.

In #64, I already mentioned these issues briefly for some tasks where I first noticed them and later fixed them be changing their labels in #109 when nobody else had any comments.
I don't think this resolved the underlying problem, though, because if I was right, those tasks are trivial, and if I was wrong, their labels may be incorrect.

Any thoughts on how to proceed with these tasks?

from sv-benchmarks.

mdangl commented on May 31, 2024

Another thing I just thought about: Some tasks seem to (reasonably) expect argc to be at least 1, because as far as I know, there is always at least the name of the executable in the array of command-line parameters in C. So __VERIFIER_assume(argc>=0); should probably be __VERIFIER_assume(argc>=1); in all of these tasks.

from sv-benchmarks.

PhilippWendler commented on May 31, 2024

@tautschnig?

from sv-benchmarks.

dbeyer commented on May 31, 2024

What means "and add them to some category"? They are a subcategory of category "SoftwareSystems" (just like what we started with last year).

On 2016-11-03 08:45 AM, Philipp Wendler wrote:

@tautschnig https://github.com/tautschnig?

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #184 (comment), or mute the
thread https://github.com/notifications/unsubscribe-auth/ACzgUK9JOOm01JxyoziNK7ls5cbmz6SXks5q6ZEcgaJpZM4Kl_LW.

from sv-benchmarks.

PhilippWendler commented on May 31, 2024

The task dirname_true-unreach-call does not have a call to __VERIFIER_error.

from sv-benchmarks.

zvonimir commented on May 31, 2024

What are we doing with busy box in the end?
We are still observing numerous problems. The problem that I attempted to fix in pull request #204 still persists.

from sv-benchmarks.

tautschnig commented on May 31, 2024

Apologies for joining the party late. So busybox is just off-the-shelf software and the properties that had been included were those that 1) CBMC could auto-generate and 2) were expressible in C. If anyone has better infrastructure for adding properties to code I'd be very happy to see that applied here.

from sv-benchmarks.

tautschnig commented on May 31, 2024

The main problem is that the "interesting" properties on these benchmarks would be all about checking string accesses/array bounds. And those properties are a lot harder to express in C...

from sv-benchmarks.

mdangl commented on May 31, 2024

To sum up my interpretation of this discussion, so that a decision can be made for SVCOMP'17:

A verification task consists of two things: a system and a specification.
Here in "BusyBox", we have an interesting system, so the first hurdle is taken.
However, we do not have a sensible specification to check, so that "BusyBox" right now does not qualify as a set of verification tasks.

Coming up with a new specification and weaving it into the code does not seem to be feasible before the deadline and the submitter of the system (@tautschnig) seems to agree with me here.

Therefore, I (regrettably) recommend not using the category "BusyBox" in SVCOMP'17.

from sv-benchmarks.

tautschnig commented on May 31, 2024

I do completely agree with @mdangl's summary. Thus: Close #204. I would suggest tagging this issue "SVCOMP'18". @dbeyer I'm not sure what the mechanics of removing a full category are.

from sv-benchmarks.

dbeyer commented on May 31, 2024

I am not in favor of removing those nice programs from the competition.
When I created the .prp file to this category I think I added an incomplete
or inappropriate set of properties, because @tautschnig did not really
mean to check (only) for calls to the error function.
I added the usual properties with 152bf01.
@tautschnig @mdangl Could you please tell me which of the properties are not appropriate?

from sv-benchmarks.

tautschnig commented on May 31, 2024

@dbeyer I suppose in addition to 152bf01 we also need to label each of the files? We could mark them _true-no-overflow_true-valid-memsafety until proven otherwise?

from sv-benchmarks.

dbeyer commented on May 31, 2024

Yes, I agree.

…

On 2016-11-28 12:11 AM, Michael Tautschnig wrote: @dbeyer <https://github.com/dbeyer> I suppose in addition to 152bf01 <152bf01> we also need to label each of the files? We could mark them _true-no-overflow_true-valid-memsafety until proven otherwise? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#184 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACzgUE767TBkOAlM0QBRTA3ZA0RSuL8Lks5rCg48gaJpZM4Kl_LW>.

from sv-benchmarks.

dbeyer commented on May 31, 2024

I changed the property file (commit 3ed8408) to include only the three MemSafety properties,
because I am not sure if everybody would understand combined specifications.

from sv-benchmarks.

tautschnig commented on May 31, 2024

In #265 I'm claiming that this issue would be fixed (by merging that PR). I can amend 0aac33a if people don't think that's the case (or we can manually re-open this issue).

from sv-benchmarks.

zvonimir commented on May 31, 2024

I cannot follow this discussion, there is too many things going on. So I also cannot offer any kind of informed opinions. BusyBox in its current state is pretty messed up, and I have no idea how adding more properties to check will make it more usable. Having overflow checks for all programs will be a demo category I guess. The main thing I am against is making radical changes to categories/properties so late in the process. In my opinion, this should have been done 2 months ago before the first deadline since it can have a significant impact on the competition.

from sv-benchmarks.

mdangl commented on May 31, 2024

I think all calls to __VERIFIER_error() should be removed from the tasks, because we agreed that right now, it does not make much sense to check the unreach-call property anyway and having these calls only serves to confuse, as witnesses by the initial discussion.
Checking other properties, such as no-overflow and valid-memsafety is a good idea in general. However, I fear that the quality of these verification tasks will be too low for inclusion in (non-demo) SVCOMP'17, simply because there is no time for quality assurance: For all other tasks, many teams spent a lot of effort checking and discussing the validity of their verdicts. If we now blindly add these tasks as true-no-overflow and true-valid-memsafety, we risk penalizing tools that find real bugs and rewarding those that miss them. I know that there is a phase for discussing controversial results, but it cannot replace the sort of quality assurance we aimed to achieve in the past few weeks and in particular, it will likely be simply too short for the amount of affected tasks. What do we do if tools disagree on a vast majority of tasks? Count only those where tools agree? If so, why count them at all? Remove the BusyBox category afterwards again? How does this reflect on SVCOMP and on the overall quality of the benchmark set we want to establish?
I agree that we should use this system, because it represents real-world software. I also agree that we need to actually include it in the competition in some way to determine the verdicts in the first place, otherwise the tasks will lie dormant for another twelve months before the same problem arises again. But I do think that we need to either downgrade it to some kind of "demo" category or define a course of action for handling the potentially large amount of issues that may arise:
Starting out with true seems to be a sane starting point (and I certainly would not prefer starting out with false either), but it may also prove to be a naïve assumption.

from sv-benchmarks.

dbeyer commented on May 31, 2024

I am in favor of "define a course of action for handling the potentially large amount of issues that may arise:" If we use only those tasks that the majority of people works on, i.e., we include in SV-COMP what we like most, that would not be a good thing. The benchmarks were submitted, it is real code, so we cannot reject them (for the second time). Re the procedure of tagging them with TRUE first: I like this approach a lot. Our experience from the past tells us that we cannot get an *absolutely reliable* expected result. We had changed many of them. And going from TRUE to FALSE is reasonable thanks to violation witnesses.

…

On 2016-11-28 08:48 AM, Matthias Dangl wrote: 1. I think all calls to __VERIFIER_error() should be removed from the tasks, because we agreed that right now, it does not make much sense to check the unreach-call property anyway and having these calls only serves to confuse, as witnesses by the initial discussion. 2. Checking other properties, such as no-overflow and valid-memsafety is a good idea in general. However, I fear that the quality of these verification tasks will be too low for inclusion in (non-demo) SVCOMP'17, simply because there is no time for quality assurance: For all other tasks, many teams spent a lot of effort checking and discussing the validity of their verdicts. If we now blindly add these tasks as |true-no-overflow| and |true-valid-memsafety|, we risk penalizing tools that find real bugs and rewarding those that miss them. I know that there is a phase for discussing controversial results, but it cannot replace the sort of quality assurance we aimed to achieve in the past few weeks and in particular, it will likely be simply too short for the amount of affected tasks. What do we do if tools disagree on a vast majority of tasks? Count only those were tools agree? If so, why count them at all? Remove the |BusyBox| category afterwards /again/? How does this reflect on SVCOMP and on the overall quality of the benchmark set we want to establish? I agree that we should use this system, because it represents real-world software. I also agree that we need to actually include it in the competition in some way to determine the verdicts in the first place, otherwise the tasks will lie dormant for another twelve months before the same problem arises again. But I do think that we need to either downgrade it to some kind of "demo" category or define a course of action for handling the potentially large amount of issues that may arise: Starting out with |true| seems to be a sane starting point (and I certainly would not prefer starting out with |false| either), but it may also prove to be a naïve assumption. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#184 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACzgUGNwHGFcCh133Gbldw_5rpohnEysks5rCodEgaJpZM4Kl_LW>.

from sv-benchmarks.

tautschnig commented on May 31, 2024

I am happy with either course of action as I am not in a position to fight strongly for having those benchmarks included, given my late contribution. Yet if we go with the set up established in the last few hours then any bugs found (which, as Dirk says, should come with a reliable result) would actually be contributions towards improving the busybox project.

from sv-benchmarks.

zvonimir commented on May 31, 2024

I share the same concerns as @mdangl. Discussing labels of benchmarks after the competition seems like a bad idea. I suggest we run them as a demo category, and we leverage the outcome of this demo run to properly label them for next year.
@dbeyer : In the past, we observed benchmarks labeled as true for which SMACK generates an error witnesses that gets happily verified by CPAChecker. We submitted those as a proof that the benchmark is labeled wrongly. However, you (or someone from CPAChecker team) actually said that CPAChecker can indeed verify wrong error witnesses and that that is expected, and that a witness verified by CPAChecker cannot serve as a proof that a benchmark is false. Now you are suggesting that we should use exactly that strategy, which I think contradicts what you claimed in the past.

from sv-benchmarks.

dbeyer commented on May 31, 2024

@zvonimir This is a misunderstanding: I did not suggest to rely on the verdict of CPAchecker's validator.
I suggested to look at the error path that the witness describes.

Last year, when you found bugs that were not accepted as bugs by the jury,
I think that was due to the Simple memory model, wasn't it?
(Or it was after the freezing.)
And as far as I remember that was even after the final runs when the results tables were done already.

Relabeling is permitted until this Friday without any problem.
We agreed to freeze the benchmarks this Friday.

The BusyBox benchmarks were submitted. So they are in, and our task now is to fix them.
They are not easily removable or downgradable to demo, because they were in long before ...

from sv-benchmarks.

dbeyer commented on May 31, 2024

In my opinion, this issue should be closed and a new one should be opened as soon as a concrete problem is found.

from sv-benchmarks.

zvonimir commented on May 31, 2024

@dbeyer : Above, you asked 3 people for opinions about these benchmarks, myself included. 2/3 people (myself and @mdangl) said that we should not include these benchmarks this year, and yet you still go ahead and include them. I will not waste my time giving my opinions in the future. Feel free to proceed as you wish.

from sv-benchmarks.

dbeyer commented on May 31, 2024

@zvonimir Sorry for the misunderstanding:

@mdangl said "I agree that we should use this system, because it represents real-world software."
and that a course of action should be set. The course of action is to fix the labels until Friday.

@zvonimir said "Discussing labels of benchmarks after the competition seems like a bad idea"
and I agree. The labels are not yet frozen. (Why should they?)

What exactly is your remaining problem with BusyBox?

from sv-benchmarks.

zvonimir commented on May 31, 2024

Got it, all is good. Let's proceed. Thanks!

from sv-benchmarks.

dbeyer commented on May 31, 2024

Thanks. I just sent mail to the mailing list to remind people that category removal is no option here.

from sv-benchmarks.

Fix BusyBox benchmarks to make them usable about sv-benchmarks HOT 28 CLOSED

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent