Giter Site home page Giter Site logo

adsfulltext's Issues

HTML extractor cannot decode certain characters

Some HTML files contain characters that codecs can't decode which results in extraction failure. See 2010A&A...514A..98C for example. Characters throwing this error include 0xf3, 0xf6, 0xe1, 0xe0, etc.

Improve namespace handling in XML extraction

Currently the Xpath expressions for extracting fulltext do not take into account any namespaces, thus failing to extract the body in an XML document which uses a namespaced element such as <ja:body>

One way to fix this is to come up with a list of namespaced Xpaths as we find them, e.g. for the case above //{http://www.elsevier.com/xml/ja/schema}body, or attempt a wildcard match such as //*[local-name()="body"]

XML articles with valid empty body should end up with an empty body field in solr

When an XML document is well formatted and it does not have content within its body tag, the expected result for the pipeline is to have an empty fulltext/body field.

With our current implementation, we would end up with the full xml (without the tags, thus the tittle, authors, affiliations...) in the body solr field despite that the xml document was well formatted and it had an empty body.

We need to:

  • Get from the logs how many documents show the Parsing XML in non-standard way log message
  • From those, check how many do really have empty bodies (inspect them all or a random sample if it's too large)

If the majority are well formatted XML with body tags that do not have any useful content, consider changing our current implementation:

for parser_name in preferred_parser_names:
parsed_xml = self._parse_xml(parser_name)
logger.debug("Checking if the parser '{}' succeeded".format(parser_name))
success = False
for xpath in META_CONTENT[self.meta_name].get('fulltext', {}).get('xpath', []):
fulltext = None
fulltext_elements = parsed_xml.xpath(xpath)
if len(fulltext_elements) > 0:
fulltext = u" ".join(map(unicode.strip, map(unicode, fulltext_elements[0].itertext())))
fulltext = TextCleaner(text=fulltext).run(decode=False, translate=True, normalise=True, trim=True)
if not fulltext:
continue
else:
logger.debug("The parser '{}' succeeded".format(parser_name))
success = True
break
if not success:
logger.debug("The parser '{}' failed".format(parser_name))
else:
break
if not success:
logger.warn('Parsing XML in non-standard way')
parsed_xml = lxml.html.document_fromstring(self.raw_xml.encode('utf-8'))

And instead iterate through all the parsers (as we do now) but if all fail, let the body be empty and not use that "non-standard way". Also, the check could be expanded not only to verify if the body is empty but we could also check if the acknowledgements or the facilities are empty, so if any of these fields is not empty, we probably can consider that the parser succeeded.

Do not remove facilities from acknowledgements

We are removing facilities from acknowledgements:

# move facilities out of acknowledgments
for e in parsed_xml.xpath(" | ".join(META_CONTENT['xml']['facility']['xpath'])):
self._append_tag_outside_parent(e)

but this information is useful to be kept there so that users that search the ack field can find the facilities (if that's what they are looking for). I think it would be good to extract the facilities without removing them from the acknowledgment.

PDFBox TrueTypeFont exceptions

A number of exceptions appear in the PDF extraction log file which indicate a bug in reading true font files. The issue is discussed here: https://issues.apache.org/jira/browse/PDFBOX-2428

We should try to see if a PDFBox upgrade fixes the problem for us.

2016-12-01 12:03:49 ERROR TrueTypeFont:286 - An error occured when reading table hmtx
java.io.EOFException
        at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
        at org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
        at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
        at org.apache.fontbox.ttf.TrueTypeFont.getHorizontalMetrics(TrueTypeFont.java:204)
        at org.apache.fontbox.ttf.TrueTypeFont.getAdvanceWidth(TrueTypeFont.java:346)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:677)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
        at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
        at org.adslabs.adsfulltext.PDFExtract.extract(PDFExtract.java:112)
        at org.adslabs.adsfulltext.PDFExtractList.f(PDFExtractList.java:53)
        at org.adslabs.adsfulltext.Worker.process(Worker.java:144)
        at org.adslabs.adsfulltext.Worker.subscribe(Worker.java:187)
        at org.adslabs.adsfulltext.Worker.run(Worker.java:279)
        at org.adslabs.adsfulltext.App.main(App.java:82)

Facilities not properly tokenized

I'm finding that some SOLR records have fields for facilities but those contents are not properly fielded, for example:

        "facility":["PROBA2 ; SDO (AIA",
          "HMI); SOHO ; STEREO (SECCHI",
          "IMPACT",
          "PLASTIC",
          "WAVES); Wind (MFI",
          "SWE",
          "WAVES); NDA; NRH."],
        "bibcode":"2019ApJ...878...37P"

Facilities appear outside of acknowledgements in XML

For the following 23 bibcodes the facility tags are placed outside of the acknowledgment tags. We need to move facility tags during extraction/parsing into the acknowledgments if we want this information available to users.

2011ApJ...738..120A
<p>
<italic>Facilities:</italic> <named-content content-type="facility" xlink:href="HST"><italic>HST</italic> (COS, STIS)</named-content>, <named-content content-type="facility" xlink:href="ROSAT"><italic>ROSAT</italic> (PSPC)</named-content></p>
</sec>
</body>
<back>
<ack>
<p>This work was supported by grant HST-GO-11687.01-A from STScI, and has made use of public databases hosted by SIMBAD and VizieR, both maintained by CDS, Strasbourg, France, and the High Energy Astrophysics Science Archive Research Center, at the NASA Goddard Space Flight Center.</p>
</ack>

2011ApJ...740..109O
2011ApJS..192....4A
2011ApJ...738...27B
2010ApJ...721.1933S
2013ApJ...773...15W
2012ApJ...744...20S
2012ApJ...746..128S
2010ApJS..189...37E
2015ApJ...805...92O
2015ApJ...799...52W
2014ApJ...782...74H
2015ApJ...798...61T
2010ApJ...708..868C
2014ApJS..211....9L
2011ApJ...726...95D
2016ApJ...824...11G
2012ApJ...750...99T
2010ApJS..191..376S
2011ApJ...735...76K
2011ApJ...735...48S
2014ApJS..213....1T
2011ApJ...739....5W

long text

sizes over 32766 chars cause solr to reject the document

incidentally, I'm seeing that the extraction is full of \u00a in place of spaces; perhaps that could be treated

example: 2009arXiv0906.5028K

this\\u00a0research\\u00a0project\\u00a0was\\u00a0conducted\\u00a0at\\u00a0St.\\u00a0Louis\\u00a0University\\u00a0School\\u00a0of\\u00a0Medicine.\\u00a0\\nTK\\u00a0received\\u00a0salary\\u00a0support\\u00a0from\\u00a0Sankyo\\u00a0Co,\\u00a0Ltd

Extract facilities from AAS XML

For ApJ, ApJS, ApJL, AJ, we should look for end extract the following tags:

<named-content content-type="facility" xlink:href="foo"> 

These should then be used to populate the "facility" field and facets.

Fulltext query for "Mars" and "HiRISE" returns less papers than before

A user reported that querying for fulltext:"Mars" and fulltext:"HiRISE" previously returned 1,649 papers in Fall 2019, but as of Jan 2020 only 1,250 papers are returned for the same query. We are currently waiting on the user to respond back with an example of a paper that used to show up but now doesn't. I am documenting this here in case they don't respond or are not able to find an example paper.

Multi-file parse

It appears fulltext generates an error when the /proj/ads/abstracts/config/links/fulltext/all.links contains multiple files, for example, a single line from all.links:

2003A&A...402..531C /proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/aah3724.right.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/tableE.1.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table2.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table1.html EDP

Force override

Force the pipeline to do full text extraction regardless of the CheckIfExtract worker.

Should we include an extra field 'appendix'

Should we extract the appendix of the paper as a separate entity that would allow it to be indexed and searched by the users, or stick it on to the end of the full text?

Wiley seperation of acknowledgements from body content

Hi Jonny,

Didn't know where to create an issue for this, so I'm just letting you know about it. It looks like the XML parser for content we got from Wiley is not working as well as it should. As an example, see the parsed content from 2012MNRAS.419.3018C, which is in /proj/ads/articles/fulltext/extracted/20/12/MN/RA/S,/41/9,/30/18/C/

The fulltext clearly includes all of the acknowledgments as well as the bibliography, as a result that the acknowledgements are defined within the tags.

Update parser for Nature XML

The DTD used in Nature articles does not match any of the extraction rules currently used to find an article body.

More content extraction

This is a general comment on content extraction. It should be possible to extract more sensical content from files that are ingested. For example, PDF, OCR and TXT files do not differentiate between their content unlike HTML and XML files. Once a schema is in place, it can be applied to all of the simple-text files that are extracted.

Extract useful data from tables

A user requested to be able to search the contents of a table using full, for example, in this table the user wants to be able to search full:"2007 TK422" from the "Designation/Names" column. Currently, our protocol is not to extract the contents of a table, so for this table we extract the title and footer, which means the only thing we have in fulltext for this table is "Table 3 Colors of Centaurs" and "Note. a Two different epochs. See Table 2". See paper here.

The reason our protocol is to not extract and index table data is to avoid needlessly increasing the load on solr. Tables can be formatted poorly, or not at all, so the intent was to not index chunks of numbers and characters that lose their meaning upon extraction.

XML for this table:

<table>
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="center"/>
<col align="char" char="."/>
<col align="char" char="."/>
<col align="char" char="."/>
<col align="char" char="."/>
</colgroup>
<thead>
<tr>
<th align="left">Number</th>
<th align="left">Designation/Name</th>
<th align="center">UT Date</th>
<th align="center">Span</th>
<th align="center"><italic>r</italic></th>
<th align="center">&Delta;</th>
<th align="center"><italic>&agr;</italic></th>
</tr>
<tr>
<th align="left"/>
<th align="left"/>
<th align="center"/>
<th align="center">(hr)</th>
<th align="center">(au)</th>
<th align="center">(au)</th>
<th align="center">(deg)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">469372</td>
<td align="left">2001 QF298</td>
<td align="center">2003 Nov 22.18</td>
<td align="char" char=".">1.4</td>
<td align="char" char=".">42.704</td>
<td align="char" char=".">42.373</td>
<td align="char" char=".">1.25</td>
</tr> 

The output of this table would look like this:

Table 3 Colors of Centaurs Number Designation/Name Telescope H V obs 167P CINEOS VATT 9.81 21.06 0.76 ± 0.03 0.53 ± 0.02 1.29 ± 0.03 55576 Amycus VATT 8.09 19.94 1.11 ± 0.02 0.70 ± 0.02 1.82 ± 0.03 60558 Echeclus VATT 9.66 ⋯ a 0.85 ± 0.04 0.55 ± 0.03 1.39 ± 0.04 120181 2003 UR292 VATT 7.51 22.03 1.03 ± 0.06 0.64 ± 0.05 1.67 ± 0.06 136204 2003 WL7 VATT 8.86 21.06 0.74 ± 0.04 0.49 ± 0.02 1.23 ± 0.04 145486 2005 UJ438 VATT 11.57 21.01 1.01 ± 0.03 0.63 ± 0.03 1.64 ± 0.04 248835 2006 SX368 VATT 9.50 20.54 0.74 ± 0.02 0.47 ± 0.02 1.22 ± 0.02 250112 2002 KY14 VATT 10.41 20.15 1.06 ± 0.02 0.69 ± 0.02 1.75 ± 0.02 281371 2008 FC76 DCT 9.69 19.75 0.97 ± 0.02 0.63 ± 0.02 1.60 ± 0.03 309139 2006 XQ51 VATT 10.33 21.47 0.74 ± 0.03 0.41 ± 0.03 1.15 ± 0.03 309737 2008 SJ236 VATT 12.50 20.10 1.01 ± 0.02 0.58 ± 0.02 1.60 ± 0.02 309741 2008 UZ6 VATT 11.23 21.46 0.92 ± 0.04 0.59 ± 0.03 1.52 ± 0.04 315898 2008 QD4 VATT 11.41 19.17 0.74 ± 0.02 0.46 ± 0.02 1.20 ± 0.02 336756 2010 NV1 VATT 10.71 20.85 0.74 ± 0.03 0.50 ± 0.02 1.24 ± 0.02 341275 2007 RG283 VATT 8.77 21.26 0.79 ± 0.03 0.47 ± 0.03 1.26 ± 0.03 342842 2008 YB3 VATT 9.59 18.34 0.77 ± 0.02 0.46 ± 0.02 1.23 ± 0.02 346889 Rhiphonos VATT 11.99 20.05 0.82 ± 0.02 0.55 ± 0.02 1.37 ± 0.02 349933 2009 YF7 DCT 11.00 20.88 0.72 ± 0.02 0.46 ± 0.02 1.18 ± 0.03 382004 2010 RM64 VATT 11.24 19.52 1.00 ± 0.02 0.55 ± 0.02 1.56 ± 0.02 447178 2005 RO43 VATT 7.26 21.29 0.77 ± 0.03 0.47 ± 0.03 1.24 ± 0.03 449097 2012 UT68 DCT 9.81 20.92 1.02 ± 0.02 0.66 ± 0.02 1.68 ± 0.03 459865 2013 XZ8 DCT 9.77 20.73 0.72 ± 0.02 0.45 ± 0.02 1.17 ± 0.03 459971 2014 ON6 DCT 12.10 20.53 0.97 ± 0.02 0.58 ± 0.02 1.55 ± 0.03 463368 2012 VU85 DCT 8.73 22.96 1.07 ± 0.06 0.63 ± 0.04 1.70 ± 0.07 471339 2011 ON45 DCT 11.94 22.32 1.11 ± 0.03 0.71 ± 0.02 1.81 ± 0.04 2002 PQ152 DCT 9.91 23.52 1.13 ± 0.04 0.72 ± 0.05 1.85 ± 0.06 2002 QX47 VATT 8.85 22.15 0.70 ± 0.04 0.38 ± 0.04 1.08 ± 0.04 2007 RH283 VATT 8.60 21.38 0.72 ± 0.03 0.43 ± 0.02 1.15 ± 0.03 2007 TJ422 DCT 11.55 24.25 1.74 ± 0.06 2007 TK422 DCT 9.35 22.89 0.71 ± 0.03 0.51 ± 0.02 1.22 ± 0.04 2007 UM126 VATT 10.26 21.20 0.74 ± 0.03 0.39 ± 0.03 1.13 ± 0.03 2007 VH305 VATT 11.91 21.02 0.69 ± 0.03 0.49 ± 0.02 1.18 ± 0.02 2010 BK118 VATT 10.42 18.47 0.81 ± 0.02 0.50 ± 0.02 1.32 ± 0.02 2010 TH DCT 9.27 21.43 0.72 ± 0.02 0.46 ± 0.02 1.18 ± 0.03 2013 UL10 DCT 13.46 21.60 0.97 ± 0.02 0.67 ± 0.02 1.64 ± 0.03 Note. a Two different epochs. See Table 2 .

doesn't detect situation when a broker is not running

while running the following command root@adsvm05:/app# python run.py -s -f /proj/ads/abstracts/config/links/fulltext/all.links

the script didn't give any warning that messages were being thrown away - becuase of a wrong config, the celery was not connected to any existing broker (this is likely a case in our other pipelines, i just noticed now)

I think we should print warnings (and stop if there is too many errors; say 100)

Java tests

Modify so that they can be run as independent suites. More specifically, that each suite of the following can be declared:

1. Unit tests
1. Integration tests

This is low priority.

Java workers

Have a way for the workers to sleep for a certain time if they cannot connect to the pipeline or modify the timeout for supervisord. Anyway, it should behave sensibly when RabbitMQ goes offline on ADSX.

Publish list of updated bibcodes to ADSimportpipeline

If a fulltext file has been updated (either meta.json or any of the text files), send the bibcode to a queue that causes it to be reprocessed by the ingest pipeline.
This will save us from generating the json fingerprint with fulltext file information.

Enhancements to the settings.yml file

It would be nicer to have the settings.yml next to the settings.py, and allow for a local_settings.yml to be consistent with the python pipeline. However, this is a low priority enhancement.

Invalid empty body resulting from <!-- body endbody --> syntax

This structure

<body>
    <!-- body
        <p>content</p>
    endbody -->
</body>

is used in AGU articles and results in empty body tags due to the removal of comments before extraction (although, comments would still be ignored during extraction). See this google doc for more details and this article's XML file 1942TeMAE..47..251B for an example. Based on a random sample, it's probable that over 4,000 articles are affected by this issue.

Handle facility tags without xlink:href

Some articles have facilities but are missing the xlink:href field which results in "None", see 2016PASP..128e4201C for example.

Facility tags are expected to have the following structure:

<<named-content content-type="facility xlink:href="facilityid">
facility name
</named-content>

however, it is now apparent that there are at least 48 instances where the xlink:href is missing. Per Sergi, we need to keep extracting xlink:href as we are doing now but modify it so that if xlink:href is empty or does not exist (see note below), then we store nothing instead of None (as if the article did not have any facilities).

note: xpath returns None when xlink:href does not exist, but if it is empty xlink:href="" it will return an empty string, so we need to make sure to handle both cases.

Document Similarity

Calculate the difference between the file associated with the original bibcode and the file associated with the new bibcode for all bibcode changes/updates.

MNRAS XML acronyms

MNRAS seems to be using markup to identify acronyms, which are otherwise entered in lowercase. In the visible text (HTML/PDF) you see "TOPCAT", but in the XML it says "topcat", so I guess TOPCAT is not being picked up as acronym with indexing. See e.g.

/proj/ads/articles/fulltext/sources/MNRAS/0452/stv1276.xml

fulltext contains serialized json in the body

example:

[2017-11-15 00:54:32,498: ERROR/ForkPoolWorker-2] Error sending data to solr
url=http://adsvm02.cfa.harvard.edu:9983/solr/collection1/update
response=[{"read_count": 0, "doctype_facet_hier": ["0/Article", "1/Article/Proceedings Article"], "isbn": ["9789290922575"], "update_timestamp": "2017-11-14T07:43:53.448036Z", "first_author": "Altemose, George", "abstract": "This paper describes a new Battery Interface and Electronics (BIE) assembly used to monitor battery and cell voltages, as well as provide overvoltage (overcharge) protection for Lithium Ion batteries with up to 8-cells in series. The BIE performs accurate measurement of the individual cell voltages, the total battery voltage, and the individual cell temperatures. In addition, the BIE provides an independent over-charge protection (OCP) circuit that terminates the charging process by isolating the battery from the charging source in the event that the voltage of any cell exceeds a preset limit of +4.500V. The OCP circuit utilizes dual redundancy, and is immune to single-point failures in the sense that no single-point failure can cause the battery to become isolated inadvertently. A typical application of the BIE in a spacecraft electrical power subsystem is shown in Figure 1. The BIE circuits have been designed with Chip On Board (COB) technology. Using this technology, integrated circuit die, Field Effect Transistors (FETs) and diodes are mounted and wired directly on a multi-layer printed wiring board (PWB). For those applications where long term reliability can be achieved without hermeticity, COB technology provides many benefits such as size and weight reduction while lowering production costs. The BIE was designed, fabricated and tested to meet the specifications provided by Orbital Sciences Corporation (OSC) for use with Lithium-Ion batteries in the Commercial Orbital Transportation System (COTS). COTS will be used to deliver cargo to the International Space Station at low earth orbit (LEO). Aeroflex has completed the electrical and mechanical design of the BIE and fabricated and tested the Engineering Model (EM), as well as the Engineering Qualification Model (EQM). Flight units have also been fabricated, tested and delivered to OSC.", "citation": [], "links_data": ["{\"access\": \"open\", \"instances\": \"\", \"title\": \"\", \"type\": \"gif\", \"url\": \"http://articles.adsabs.harvard.edu/full/2011ESASP.690E..24A\"}", "{\"access\": \"open\", \"instances\": \"\", \"title\": \"\", \"type\": \"article\", \"url\": \"http://articles.adsabs.harvard.edu/full/2011ESASP.690E..24A?defaultprint=YES\"}"], "page": ["24"], "doctype": "inproceedings", "date": "2011-10-01T00:00:00.000000Z", "nedid": [], "id": "14588779", "bibstem": ["ESASP", "ESASP.690"], "simbid": [], "bibcode": "2011ESASP.690E..24A", "classic_factor": 0.0, "reference": [], "data_count": 0, "grant": [], "aff": ["Aeroflex Plainview Plainview, New York 11803", "Aeroflex Plainview Plainview, New York 11803"], "orcid_pub": ["-", "-"], "metrics_mtime": "2017-11-10T07:25:23.740661Z", "esources": ["ADS_PDF", "ADS_SCAN"], "entry_date": "2015-07-18T00:00:00.000000Z", "reader": [], "email": ["-", "-"], "body": "{\"body\": \"\\u0003\\n\\nOVERCHARGE PROTECTION AND CELL VOLTAGE MONITORING FOR\\nLITHIUM-ION BATTERIES\\n\\n\\u0003\\n\\u0003\\n\\n*HRUJH\\u0003$OWHPRVH\\u000f\\u00036HQLRU\\u00036WDII\\u0003(QJLQHHU\\n$EEDV\\u00036DOLP\\u000f\\u00036HQLRU\\u00036FLHQWLVW\\u0003\\nAeroflex Plainview\\nPlainview, New York 11803\\n$EVWUDFW\\n\\nEffect Transistors (FETs
) and diodes are\\nmounted and wired directly on a multi-layer\\nprinted wiring board (PWB). For those\\napplications where long term reliability can\\nbe achieved without hermeticity, COB\\ntechnology provides many benefits such as\\nsize and weight reduction while lowering\\nproduction costs.\\n\\nThis paper describes a new Battery\\nInterface and Electronics (BIE) assembly\\nused to monitor battery and cell voltages, as\\nwell as provide overvoltage (overcharge)\\nprotection for Lithium Ion batteries with up to\\n8-cells in series. The BIE performs accurate\\nmeasurement of the individual cell voltages,\\nthe total battery voltage, and the individual\\ncell temperatures.\\nIn addition, the BIE\\nprovides an independent over-charge\\nprotection (OCP) circuit that terminates the\\ncharging process by isolating the battery\\nfrom the charging source in the event that\\nthe voltage of any cell exceeds a preset limit\\nof +4.500V. The OCP circuit utilizes dual\\nredundancy, and is immune to single-point\\nfailures in the sense that no single-poin
t\\nfailure can cause the battery to become\\nisolated inadvertently. A typical application\\nof the BIE in a spacecraft electrical power\\nsubsystem is shown in Figure 1.\\n\\nThe BIE was designed, fabricated an
d\\ntested to meet the specifications provided by\\nOrbital Sciences Corporation (OSC) for use\\nwith Lithium-Ion batteries in the Commercial\\nOrbital Transportation System (COTS).\\nCOTS will be used to deliver
 cargo to the\\nInternational Space Station at low earth orbit\\n(LEO).\\nAeroflex has completed the electrical and\\nmechanical design of the BIE and fabricated\\nand tested the Engineering Model (EM), as\\nwell as the Engineering Qualification Model\\n(EQM). Flight units have also been\\nfabricated, tested and delivered to OSC.\\n\\nThe BIE circuits have been designed with\\nChip On Board (COB) technology. Using this\\ntechnology, integrated circuit die, Field\\n\\u0003\\n\\nPower\\nRegulation\\n& Control\\n\\nSolar\\nArray\\n\\nLoad\\n\\nBIE\\nInterface &\\nControl\\nElectronics\\n\\nBattery\\nIsolation\\nSwitch\\n\\nLithium-Ion\\nBattery\\nAssembly\\n\\nBalancing\\nElectronics\\nUnit\\n\\n\\u0003\\n\\n\\u0003\\n\\n)LJXUH\\u0003\\u0014\\u0003\\u00b1\\u00037\\\\SLFDO\\u0003%,(\\u0003$SSOLFDWLRQ\\u0003LQ\\u0003D\\u00036SDFHFUDIW\\u0003(OHFWULFDO\\u00033RZHU\\u00036\\\\VWHP\\u0003\\n_________________________________________________\\nProc. \\u20189th European Space Power Conference\\u2019, Saint Rapha\\u00ebl, France,\\n6\\u201310 June 2011 (ESA SP-690, October 2011)\\n\\u0003\\n\\n\\f\\u0003\\n,QWURGXFWLRQ\\n\\nLithium-ion batteries are now oft
....

Improve text extraction of large documents

Here is a list of records which have a large fulltext contents. Due to limitations in SOLR, we currently throw away anything beyond 32k bytes. Under these circumstances, it would be nice to be more sensible when generating fulltext so that we throw away things which are not interesting (e.g. numeric tables) and keep the text that we want.

['1998astro.ph..7308B', '/proj/ads/articles/fulltext/extracted/19/98/as/tr/o,/ph/,,/73/08/B/fulltext.txt', 105938]
['2003astro.ph..4480S', '/proj/ads/articles/fulltext/extracted/20/03/as/tr/o,/ph/,,/44/80/S/fulltext.txt', 62708]
['2004ADNDT..88...83L', '/proj/ads/articles/fulltext/extracted/20/04/AD/ND/T,/,8/8,/,,/83/L/fulltext.txt', 319850]
['2004JMoSp.228..593B', '/proj/ads/articles/fulltext/extracted/20/04/JM/oS/p,/22/8,/,5/93/B/fulltext.txt', 113945]
['2005ADNDT..89....1G', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,,/,1/G/fulltext.txt', 204757]
['2005ADNDT..89..101Z', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/01/Z/fulltext.txt', 126118]
['2005ADNDT..89..139E', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/39/E/fulltext.txt', 238924]
['2005ADNDT..89..195L', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/95/L/fulltext.txt', 280702]
['2005ADNDT..89..267G', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,2/67/G/fulltext.txt', 90409]
['2005ADNDT..90..177L', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,9/0,/,1/77/L/fulltext.txt', 313130]
['2005ADNDT..90..259Z', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,9/0,/,2/59/Z/fulltext.txt', 248951]
['2005NewA...10..325Z', '/proj/ads/articles/fulltext/extracted/20/05/Ne/wA/,,/,1/0,/,3/25/Z/fulltext.txt', 127388]
['2006ADNDT..92..105B', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,1/05/B/fulltext.txt', 269501]
['2006ADNDT..92..305L', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,3/05/L/fulltext.txt', 272532]
['2006ADNDT..92..481Z', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,4/81/Z/fulltext.txt', 550232]
['2006JMoSt.780..182L', '/proj/ads/articles/fulltext/extracted/20/06/JM/oS/t,/78/0,/,1/82/L/fulltext.txt', 68805]
['2006math......9485D', '/proj/ads/articles/fulltext/extracted/20/06/ma/th/,,/,,/,,/94/85/D/fulltext.txt', 127583]
['2007ADNDT..93....1L', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,,/,1/L/fulltext.txt', 207798]
['2007ADNDT..93..275B', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,2/75/B/fulltext.txt', 303136]
['2007ADNDT..93..615A', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,6/15/A/fulltext.txt', 468786]
['2007ADNDT..93..742B', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,7/42/B/fulltext.txt', 143947]
['2007ADNDT..93..864H', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,8/64/H/fulltext.txt', 169796]
['2007cmd..book....1C', '/proj/ads/articles/fulltext/extracted/20/07/cm/d,/,b/oo/k,/,,/,1/C/fulltext.txt', 52302]
['2008ADNDT..94....1L', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,,/,1/L/fulltext.txt', 128320]
['2008ADNDT..94..561D', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,5/61/D/fulltext.txt', 189230]
['2008ADNDT..94..807L', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,8/07/L/fulltext.txt', 410355]
['2009ADNDT..95....1S', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,,/,1/S/fulltext.txt', 285454]
['2009ADNDT..95..547L', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,5/47/L/fulltext.txt', 121969]
['2009ADNDT..95..607A', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,6/07/A/fulltext.txt', 920690]
['2009arXiv0904.2782S', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/90/4,/27/82/S/fulltext.txt', 352583]
['2009arXiv0910.1690A', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/91/0,/16/90/A/fulltext.txt', 79066]
['2009arXiv0910.5784S', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/91/0,/57/84/S/fulltext.txt', 918828]
['2010ADNDT..96....1T', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,,/,1/T/fulltext.txt', 175280]
['2010ADNDT..96..123A', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,1/23/A/fulltext.txt', 678572]
['2010ADNDT..96..481H', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,4/81/H/fulltext.txt', 219803]
['2010ADNDT..96..759S', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,7/59/S/fulltext.txt', 315417]
['2011ADNDT..97...50B', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,,/50/B/fulltext.txt', 308131]
['2011ADNDT..97..225A', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,2/25/A/fulltext.txt', 702513]
['2011ADNDT..97..587L', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,5/87/L/fulltext.txt', 309957]
['2011ESASP.690E..24A', '/proj/ads/articles/fulltext/extracted/20/11/ES/AS/P,/69/0E/,,/24/A/fulltext.txt', 80391]
['2011PhDT........92J', '/proj/ads/articles/fulltext/extracted/20/11/Ph/DT/,,/,,/,,/,,/92/J/fulltext.txt', 918724]
['2012ADNDT..98..149M', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,1/49/M/fulltext.txt', 185016]
['2012ADNDT..98..437D', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,4/37/D/fulltext.txt', 163161]
['2012ADNDT..98..779W', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,7/79/W/fulltext.txt', 104213]
['2013ADNDT..99..249T', '/proj/ads/articles/fulltext/extracted/20/13/AD/ND/T,/,9/9,/,2/49/T/fulltext.txt', 362192]
['2013ADNDT..99..459O', '/proj/ads/articles/fulltext/extracted/20/13/AD/ND/T,/,9/9,/,4/59/O/fulltext.txt', 141759]
['2013arXiv1308.5199C', '/proj/ads/articles/fulltext/extracted/20/13/ar/Xi/v1/30/8,/51/99/C/fulltext.txt', 579391]
['2013arXiv1312.4478L', '/proj/ads/articles/fulltext/extracted/20/13/ar/Xi/v1/31/2,/44/78/L/fulltext.txt', 443883]
['2014ADNDT.100..651M', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,6/51/M/fulltext.txt', 437029]
['2014ADNDT.100..802F', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,8/02/F/fulltext.txt', 132232]
['2014ADNDT.100..986T', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,9/86/T/fulltext.txt', 335214]
['2014ADNDT.100.1156T', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/11/56/T/fulltext.txt', 135543]
['2014ADNDT.100.1292F', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/12/92/F/fulltext.txt', 132797]
['2014ADNDT.100.1357X', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/13/57/X/fulltext.txt', 142772]
['2014ADNDT.100.1399A', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/13/99/A/fulltext.txt', 515880]
['2014ADNDT.100.1519L', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/15/19/L/fulltext.txt', 339602]
['2014ADNDT.100.1603A', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/16/03/A/fulltext.txt', 716084]
['2015ADNDT.101...41Z', '/proj/ads/articles/fulltext/extracted/20/15/AD/ND/T,/10/1,/,,/41/Z/fulltext.txt', 631607]
['2015arXiv150309147G', '/proj/ads/articles/fulltext/extracted/20/15/ar/Xi/v1/50/30/91/47/G/fulltext.txt', 61666]
['2016ADNDT.107..140A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/7,/,1/40/A/fulltext.txt', 295055]
['2016ADNDT.107..221A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/7,/,2/21/A/fulltext.txt', 554068]
['2016ADNDT.108...15W', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/8,/,,/15/W/fulltext.txt', 145479]
['2016ADNDT.111..280A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/11/1,/,2/80/A/fulltext.txt', 204495]
['2016JPhCS.758a2002D', '/proj/ads/articles/fulltext/extracted/20/16/JP/hC/S,/75/8a/20/02/D/fulltext.txt', 19531]
['2016JPhCS.761a2034C', '/proj/ads/articles/fulltext/extracted/20/16/JP/hC/S,/76/1a/20/34/C/fulltext.txt', 30861]
['2016JQSRT.168..102S', '/proj/ads/articles/fulltext/extracted/20/16/JQ/SR/T,/16/8,/,1/02/S/fulltext.txt', 121291]
['2016JVGR..311...79G', '/proj/ads/articles/fulltext/extracted/20/16/JV/GR/,,/31/1,/,,/79/G/fulltext.txt', 99296]
['2016NJPh...18j3050C', '/proj/ads/articles/fulltext/extracted/20/16/NJ/Ph/,,/,1/8j/30/50/C/fulltext.txt', 22097]
['2016Tectp.677....1L', '/proj/ads/articles/fulltext/extracted/20/16/Te/ct/p,/67/7,/,,/,1/L/fulltext.txt', 137039]

False positive OCR files for "NASA" queries

All of our OCR files have text at the bottom of the file that indicates it was "provided by the NASA Astrophysics Data System". This is resulting in false positives for queries like full:"NASA", and is especially apparent because NASA was not founded until 1958 yet we have ~41,000 results for this query before 1957. See (( full:"NASA") AND year:1543-1957) for example.

There are about ~2,000 where the OCR is producing undesired results where this string is interpreted incorrectly, for example, "Provided by the NASA Astrophysicsflata System" (1957AnLun..14B...1R) or "ProvidelYby the" (1957MNSSA..16...44T). In these cases removing the string from the fulltext pipeline would be more complicated.

Fix broken tests on TravisCI

The tests are failing mainly due to the a newer version of RabbitMQ that TravisCI is using. The relevant command should be updated for the new release.

There are also additional errors such as:

ERROR: test_that_we_can_extract_acknowledgments (__main__.TestTEIXMLExtractor)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "tests/test_unit/test.py", line 698, in test_that_we_can_extract_acknowledgments

    parsed_xml = self.extractor.parse_xml()

  File "/home/travis/build/adsabs/ADSfulltext/lib/StandardFileExtract.py", line 407, in parse_xml

    parsed_content = soupparser.fromstring(self.raw_xml)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 33, in fromstring

    return _parse(data, beautifulsoup, makeelement, **bsargs)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 79, in _parse

    root = _convert_tree(tree, makeelement)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 151, in _convert_tree

    converted = convert_node(e)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 212, in convert_node

    return handler(bs_node, parent)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 269, in convert_pi

    res = etree.ProcessingInstruction(*bs_node.split(' ', 1))

  File "src/lxml/lxml.etree.pyx", line 3039, in lxml.etree.ProcessingInstruction (src/lxml/lxml.etree.c:75920)

ValueError: Invalid PI name 'xml'

Some parsers are only extracting the acknowledgements from XML files

The definition of success was changed in PR #106 to be if any part of the XML is extracted (ex. acknowledgments, facilities, etc.) then we have succeeded, rather than only if we have extracted the body. As a result, there are 1,909 files throwing errors because checker.check_if_extract(message, app.conf['FULLTEXT_EXTRACT_PATH']) is not able to find the corresponding fulltext.txt files because only the acknowledgments were extracted for these files.

The body and acknowledgments are able to be extracted by more XML based parsers.

Parsers that are only extracting the acknowledgments:
html5lib, lxml-html, direct-lxml-html

Parsers that are extracting the body and acknowledgements:
html.parser, lxml-xml, direct-lxml-xml

The body not being extracted for certain parsers is due to attributes being listed in the body tag, for example:
<body xml:id="asna201913710-body-0001" sectionsNumbered="yes">

This can be resolved with another regex, or reordering of the parsers. A complication of using regex is that we are using the attribute to identify the body and acknowledgment in some cases, so we will need to be careful not to interfere with that.

This problem was also found in the XML Parser Analysis.

Incorrect parsing of A&A XML

At least for this record: 2017A&A...597A..55S, the fulltext parser is skipping over parts of the text:

<p>One spectrum of HSS<E2><80><89>348 was taken with the <inline-formula specific-use="simple-math">2  <C3>
<97>  8.4</inline-formula> m Large Binocular Telescope (LBT) during commissioning of the PEPSI spectrograph (Potsdam Echelle Polarimetric and Spectroscopic Instrument; Strassmeier et al. <xref id="InR66"/><xref ref-type="bibr" rid="R66">2015b</xref>).

generates the following parsed text:

One spectrum of HSS 348 was taken with the 2015b).

Fix data set to extract all

Currently, it is only extracting one data set at the moment.

adsabs/ADSExports should be updated to expect lists.

html5lib parser failing due to less than symbol in XML files

243 Errors such as "Invalid HTML tag name", "Invalid tag name", "Empty tag name", "Invalid namespace URI" are occurring as a result of "<" symbols that are mistaken for the beginning of an XML tag by the html5lib parser.

Part of the problem is due to uncaught latex formulas that we remove using a regular expression. See 2015GeoJI.200.1466Z for example.

A number of these errors are slightly more difficult to resolve as they occur from images. See this google doc for more details.

Fulltext extraction problem?

This query returns 0 results for some reason. Can you please figure out where things break down?

bibcode:2015ApJ...801..127V body:DAOPHOT

Fix missing appendix text

A user reports they cannot search for text in the appendix portion for 2012A&A...539A..88C. For at least some types of xml files we are not extracting text from appendices and appending to the body field. For another example see 2012A&A...539A.119F.

For at least some pdf files, we do extract text from appendices and append it to the body field.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.