Hi,
I've built a minimal test which demonstrate that there is a confusion between the fact that a tag/attribute has no value (i.e. it is not present) and it has an empty value (i.e. it is present in the document but with an empty String(*) as a value).
I believe this is a bug, in the same manner as List a = null
in Java is different to List a = new List()
.
Also, mixed with the "new" defaultValue feature of DOMTagger, this mixes things up. Indeed, since a tag that is present but with a defined value of empty String will be considered as not present, it will be replaced by the defaultValue, thus loosing its original (empty) value.
The provided example files demonstrate the problem: they extract separately FirstNames, MiddleNames and LastNames, then merge First&Middle Names into a new FirstName. To ensure we have the same number of values for all attributes of an author, we use the defaultValue option.
If you run this example you get:
EXP_FIRST-NAME-AUX=firstname11^|~firstname12^|~firstname13^|~firstname14^|~firstname21^|~firstname22^|~firstname23^|~firstname24^|~firstname31^|~firstname32^|~firstname33^|~firstname34
EXP_MIDDLE-NAME-AUX=middlename11^|~middlename12^|~middlename13^|~middlename14^|~middlename21^|~NO_MIDDLE-NAME^|~NO_MIDDLE-NAME^|~middlename24^|~NO_MIDDLE-NAME^|~middlename32^|~NO_MIDDLE-NAME^|~middlename34
EXP_FIRST-NAME=firstname11 middlename11^|~firstname12 middlename12^|~firstname13 middlename13^|~firstname14 middlename14^|~firstname21 middlename21^|~firstname22 NO_MIDDLE-NAME^|~firstname23 NO_MIDDLE-NAME^|~firstname24 middlename24^|~firstname31 NO_MIDDLE-NAME^|~firstname32 middlename32^|~firstname33 NO_MIDDLE-NAME^|~firstname34 middlename34
As can be seen, the empty String used for middlename22 and middlename23 are replaced by "NO_MIDDLE-NAME", in the same manner as for the non-present middlename31 and middlename33, which is not what I would expect.
Also this is not in concordance with what JSoup resturns:
$ ./test_csssel_expr.sh crawlers/testDefVal/example2.xml "publication authors author middlename" # I can send you this script is necessary
middlename11
middlename12
middlename13
middlename14
middlename21
middlename24
middlename32
middlename34
As can be seen, JSoup makes a difference between when the tag is there and set to an empty string (empty lines are returned in place of middlename22 and middlename23) and when the tag is not set (no line is returned for middlename31 and middlename33).
By the way, I think this is a general problem in DOMTagger (and maybe other Norconex's products?), since this confusion also occurs in the treatment of the "defaultValue" itself.
Indeed, in the same example files, if you replace the defaultValue
with an empty string (as in the first commented-out line), then the defaultValue
replaces a non-existent value by another non-existent (thus dropped) value, instead of an empty String(*). As a consequence, the defaultValue
principle has absolutely no effect and we end up with different number of MiddleNames as of First&LastNames... Which is not what I would expect...
(*)Finally, note that, since you seem to trim() all the entries in the XML files, the problem does not only occurr for empty Strings, butalso for any sequence of spaces chars E.g., defining the defaultValue to any (possibly empty) sequence of spaces, will result in this option having no effect (this is what the 2nd commented-out line demonstrates).
example2.xml.txt
testDefVal2.xml.txt