Giter Site home page Giter Site logo

adsfulltext's Introduction

Coverage Status

ADSfulltext

Article full text extraction pipeline. Set of workers that check the filesystem and convert binary files into text.

What does the pipeline extract?
  • body
    • includes table and figure captions, appendixes and supplements
  • acknowledgements
  • any dataset(s)
  • any facilities

We do not include the list of references because those are processed separately by the reference resolver to generate citations.

Where does this data go?

All of these fields are sent to ADSMasterPipeline, and all fields except the dataset are sent to Solr. Each bibcode has a fulltext.txt and meta.json file in the live folder in a directory constructed from the actual bibcode.

Facilties and datasets are not in use by any other pipeline.

A note on PDFs:

When the fulltext is extracted from PDFs, we don’t necessarily have the different portions of the article properly fielded, and we end up throwing everything that comes out in the body field. This will include things such as title, author list, abstract, keyword, bibliography, etc. This is a bug and not a feature, of course, if we could use grobid to properly segment the source document we would only pick the relevant pieces

Dependencies

In GNU/Linux (Debian based) we need the following packages in order to be able to compile the lxml package specified in the requirements:

apt install libxml2-dev libxslt1-dev

Purpose

To extract text from the source files of our articles, which may be in any of the listed formats:

  • PDF
  • XML (often malformed)
  • OCR
  • HTML
  • TXT

The extracted text then gets sent to SOLR to be indexed.

Text Extraction

The main file for this pipeline is extraction.py. This file takes a message (in the form of a dictionary) containing the bibcode, directory of the source file to be extracted, file format, and provider/publisher. This information determines which parser will be used to extract the contents, for example this message:

m = {
  'bibcode': '2019arXiv190105463B',
  'ft_source': /some/directory/file.xml,
  'file_format': 'xml',
  'provider': 'Elsevier'
}

would use the Elsevier XML parser because XML files from that publisher need to be extracted using different methods than regular XML files. In this case Elsevier requires the use of lxml.html.document_fromstring() instead of lxml.html.soupparser.fromstring().

XML Files

We utilize the lxml.html.soupparser library to extract content from our XML files, which is an lxml interface to the BeautifulSoup HTML parser. By default when using BeautifulSoup3 (which is the version we currently use) this library uses the lxml.html parser. This parser is fast but more importantly lenient enough for our data as a lot of our XML files are not valid XML. You can find a breakdown of the different types of parsers here.

Functions:

  • open_xml()
    • This function is used to open/read an XML file and store its content as a string. To not lose data we decode this string using the encoding detected by UnicodeDammit. This is important to do before the next step as our string before decoding is in bytecode and the next step inserts unicode - mixing these two will cause nothing but problems. The next step converts HTML entities into unicode, for example Å -> Å. We do this even though soupparser has HTML entity conversion capabilties because our list is much more exhaustive and as of right now there is no functionality built in to pass a customized HTML entity map/dictionary as a parameter to this parser. Our dictionary of HTML entities can be found in entitydefs.py.
  • parse_xml()
  • Here we pass the string returned by open_xml() to soupparser's fromstring() function. We then remove some tags to get rid of potential garbage/nonsense strings using the xpath function which lxml has made available to us.
  • extract_string()
  • Here we use the xpath function to get all matches for a specific tag, and return text for the first one found.
  • extract_list()
  • This function is similar to extract_string() but it returns a list and is only used for datasets.
  • extract_multi_content()
  • This is basically the main function for this class. It loops through the xpaths found in rules.py and collects the content for each one using extract_string() and extract_list(). It returns a dictionary containing fulltext, acknowledgments, and optionally dataset(s).

In the past we have used regular expressions, string.replace() and re.sub() to fix issues that should really be fixed inside the parser. For example, parsers may try to wrap our XML files with html and body tags to attempt to reconcile the invalid/broken HTML. This is actually normal behavior of a lenient parser, but in our case it results in content for the entire file being returned for the fulltext instead of just the content inside the body. We could replace the body tag before parsing with a different name and just extract the string from that tag instead, but this is more of a workaround than a solution. Sometimes this is the only way as it's also not a good idea to edit the lxml/BeautifulSoup code as this can cause a lot of complications down the line, but if it can be avoided I highly recommend not using regular expressions and string replacements to fix these types of issues. I defer to this humorous stackoverflow answer to deter you.

Eventually we will need to upgrade to BeautifulSoup4 as python3 is not compatible with BeautifulSoup3.

PDF Files

Our PDF extractor is mainly composed of of the pdfminer tool pdf2txt. We are exploring other options such a GROBID to find if we can improve the performance of this extractor, but as of right now pdf2txt is our best option. The main downfall of pdf2txt is that it does not allow for easy extraction of things like figures, formulas and tables which are known to produce useless strings and garbage. Some documentation on the journey to improve other parsers to outperform pdf2txt can be found here.

Development

For development/debugging purposes, it can be useful to run the whole pipeline in synchronous mode on our local machine. This can be achieved by copying config.py to local_config.py enabling the following lines:

### Testing:
# When 'True', it converts all the asynchronous calls into synchronous,
# thus no need for rabbitmq, it does not forward to master
# and it allows debuggers to run if needed:
CELERY_ALWAYS_EAGER = True
CELERY_EAGER_PROPAGATES_EXCEPTIONS = True

When these two variables are set to True, we can run the pipeline (via run.py) and we do not need to run workers (no need for RabbitMQ either) or master pipeline (no message will be forwarded outside this pipeline). This allows us to debug more easily (e.g., import pudb; pudb.set_trace()), we can explore the output in the live/ directory or the logs in the logs/ directory.

Time-Capsule

If you stop here, oh tired traveller, please don't judge us too harshly, mere mortals. We tried to simplify the chaos we didn't create. Blame the universe for its affinity for chaos.

adsfulltext's People

Contributors

aaccomazzi avatar jonnybazookatone avatar kelockhart avatar krisbukovi avatar marblestation avatar nasaads avatar nemanjamart avatar romanchyla avatar spacemansteve avatar tjacovich avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

adsfulltext's Issues

Multi-file parse

It appears fulltext generates an error when the /proj/ads/abstracts/config/links/fulltext/all.links contains multiple files, for example, a single line from all.links:

2003A&A...402..531C /proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/aah3724.right.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/tableE.1.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table2.html,/proj/ads/fulltext/sources/A+A/backdata/2003/17/aah3724/table1.html EDP

long text

sizes over 32766 chars cause solr to reject the document

incidentally, I'm seeing that the extraction is full of \u00a in place of spaces; perhaps that could be treated

example: 2009arXiv0906.5028K

this\\u00a0research\\u00a0project\\u00a0was\\u00a0conducted\\u00a0at\\u00a0St.\\u00a0Louis\\u00a0University\\u00a0School\\u00a0of\\u00a0Medicine.\\u00a0\\nTK\\u00a0received\\u00a0salary\\u00a0support\\u00a0from\\u00a0Sankyo\\u00a0Co,\\u00a0Ltd

Some parsers are only extracting the acknowledgements from XML files

The definition of success was changed in PR #106 to be if any part of the XML is extracted (ex. acknowledgments, facilities, etc.) then we have succeeded, rather than only if we have extracted the body. As a result, there are 1,909 files throwing errors because checker.check_if_extract(message, app.conf['FULLTEXT_EXTRACT_PATH']) is not able to find the corresponding fulltext.txt files because only the acknowledgments were extracted for these files.

The body and acknowledgments are able to be extracted by more XML based parsers.

Parsers that are only extracting the acknowledgments:
html5lib, lxml-html, direct-lxml-html

Parsers that are extracting the body and acknowledgements:
html.parser, lxml-xml, direct-lxml-xml

The body not being extracted for certain parsers is due to attributes being listed in the body tag, for example:
<body xml:id="asna201913710-body-0001" sectionsNumbered="yes">

This can be resolved with another regex, or reordering of the parsers. A complication of using regex is that we are using the attribute to identify the body and acknowledgment in some cases, so we will need to be careful not to interfere with that.

This problem was also found in the XML Parser Analysis.

doesn't detect situation when a broker is not running

while running the following command root@adsvm05:/app# python run.py -s -f /proj/ads/abstracts/config/links/fulltext/all.links

the script didn't give any warning that messages were being thrown away - becuase of a wrong config, the celery was not connected to any existing broker (this is likely a case in our other pipelines, i just noticed now)

I think we should print warnings (and stop if there is too many errors; say 100)

PDFBox TrueTypeFont exceptions

A number of exceptions appear in the PDF extraction log file which indicate a bug in reading true font files. The issue is discussed here: https://issues.apache.org/jira/browse/PDFBOX-2428

We should try to see if a PDFBox upgrade fixes the problem for us.

2016-12-01 12:03:49 ERROR TrueTypeFont:286 - An error occured when reading table hmtx
java.io.EOFException
        at org.apache.fontbox.ttf.MemoryTTFDataStream.readSignedShort(MemoryTTFDataStream.java:139)
        at org.apache.fontbox.ttf.HorizontalMetricsTable.initData(HorizontalMetricsTable.java:62)
        at org.apache.fontbox.ttf.TrueTypeFont.initializeTable(TrueTypeFont.java:280)
        at org.apache.fontbox.ttf.TrueTypeFont.getHorizontalMetrics(TrueTypeFont.java:204)
        at org.apache.fontbox.ttf.TrueTypeFont.getAdvanceWidth(TrueTypeFont.java:346)
        at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.getFontWidth(PDTrueTypeFont.java:677)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getFontWidth(PDSimpleFont.java:231)
        at org.apache.pdfbox.pdmodel.font.PDSimpleFont.getSpaceWidth(PDSimpleFont.java:533)
        at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
        at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)
        at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:557)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
        at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
        at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
        at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
        at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:385)
        at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:344)
        at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
        at org.adslabs.adsfulltext.PDFExtract.extract(PDFExtract.java:112)
        at org.adslabs.adsfulltext.PDFExtractList.f(PDFExtractList.java:53)
        at org.adslabs.adsfulltext.Worker.process(Worker.java:144)
        at org.adslabs.adsfulltext.Worker.subscribe(Worker.java:187)
        at org.adslabs.adsfulltext.Worker.run(Worker.java:279)
        at org.adslabs.adsfulltext.App.main(App.java:82)

Handle facility tags without xlink:href

Some articles have facilities but are missing the xlink:href field which results in "None", see 2016PASP..128e4201C for example.

Facility tags are expected to have the following structure:

<<named-content content-type="facility xlink:href="facilityid">
facility name
</named-content>

however, it is now apparent that there are at least 48 instances where the xlink:href is missing. Per Sergi, we need to keep extracting xlink:href as we are doing now but modify it so that if xlink:href is empty or does not exist (see note below), then we store nothing instead of None (as if the article did not have any facilities).

note: xpath returns None when xlink:href does not exist, but if it is empty xlink:href="" it will return an empty string, so we need to make sure to handle both cases.

Java tests

Modify so that they can be run as independent suites. More specifically, that each suite of the following can be declared:

1. Unit tests
1. Integration tests

This is low priority.

Publish list of updated bibcodes to ADSimportpipeline

If a fulltext file has been updated (either meta.json or any of the text files), send the bibcode to a queue that causes it to be reprocessed by the ingest pipeline.
This will save us from generating the json fingerprint with fulltext file information.

fulltext contains serialized json in the body

example:

[2017-11-15 00:54:32,498: ERROR/ForkPoolWorker-2] Error sending data to solr
url=http://adsvm02.cfa.harvard.edu:9983/solr/collection1/update
response=[{"read_count": 0, "doctype_facet_hier": ["0/Article", "1/Article/Proceedings Article"], "isbn": ["9789290922575"], "update_timestamp": "2017-11-14T07:43:53.448036Z", "first_author": "Altemose, George", "abstract": "This paper describes a new Battery Interface and Electronics (BIE) assembly used to monitor battery and cell voltages, as well as provide overvoltage (overcharge) protection for Lithium Ion batteries with up to 8-cells in series. The BIE performs accurate measurement of the individual cell voltages, the total battery voltage, and the individual cell temperatures. In addition, the BIE provides an independent over-charge protection (OCP) circuit that terminates the charging process by isolating the battery from the charging source in the event that the voltage of any cell exceeds a preset limit of +4.500V. The OCP circuit utilizes dual redundancy, and is immune to single-point failures in the sense that no single-point failure can cause the battery to become isolated inadvertently. A typical application of the BIE in a spacecraft electrical power subsystem is shown in Figure 1. The BIE circuits have been designed with Chip On Board (COB) technology. Using this technology, integrated circuit die, Field Effect Transistors (FETs) and diodes are mounted and wired directly on a multi-layer printed wiring board (PWB). For those applications where long term reliability can be achieved without hermeticity, COB technology provides many benefits such as size and weight reduction while lowering production costs. The BIE was designed, fabricated and tested to meet the specifications provided by Orbital Sciences Corporation (OSC) for use with Lithium-Ion batteries in the Commercial Orbital Transportation System (COTS). COTS will be used to deliver cargo to the International Space Station at low earth orbit (LEO). Aeroflex has completed the electrical and mechanical design of the BIE and fabricated and tested the Engineering Model (EM), as well as the Engineering Qualification Model (EQM). Flight units have also been fabricated, tested and delivered to OSC.", "citation": [], "links_data": ["{\"access\": \"open\", \"instances\": \"\", \"title\": \"\", \"type\": \"gif\", \"url\": \"http://articles.adsabs.harvard.edu/full/2011ESASP.690E..24A\"}", "{\"access\": \"open\", \"instances\": \"\", \"title\": \"\", \"type\": \"article\", \"url\": \"http://articles.adsabs.harvard.edu/full/2011ESASP.690E..24A?defaultprint=YES\"}"], "page": ["24"], "doctype": "inproceedings", "date": "2011-10-01T00:00:00.000000Z", "nedid": [], "id": "14588779", "bibstem": ["ESASP", "ESASP.690"], "simbid": [], "bibcode": "2011ESASP.690E..24A", "classic_factor": 0.0, "reference": [], "data_count": 0, "grant": [], "aff": ["Aeroflex Plainview Plainview, New York 11803", "Aeroflex Plainview Plainview, New York 11803"], "orcid_pub": ["-", "-"], "metrics_mtime": "2017-11-10T07:25:23.740661Z", "esources": ["ADS_PDF", "ADS_SCAN"], "entry_date": "2015-07-18T00:00:00.000000Z", "reader": [], "email": ["-", "-"], "body": "{\"body\": \"\\u0003\\n\\nOVERCHARGE PROTECTION AND CELL VOLTAGE MONITORING FOR\\nLITHIUM-ION BATTERIES\\n\\n\\u0003\\n\\u0003\\n\\n*HRUJH\\u0003$OWHPRVH\\u000f\\u00036HQLRU\\u00036WDII\\u0003(QJLQHHU\\n$EEDV\\u00036DOLP\\u000f\\u00036HQLRU\\u00036FLHQWLVW\\u0003\\nAeroflex Plainview\\nPlainview, New York 11803\\n$EVWUDFW\\n\\nEffect Transistors (FETs
) and diodes are\\nmounted and wired directly on a multi-layer\\nprinted wiring board (PWB). For those\\napplications where long term reliability can\\nbe achieved without hermeticity, COB\\ntechnology provides many benefits such as\\nsize and weight reduction while lowering\\nproduction costs.\\n\\nThis paper describes a new Battery\\nInterface and Electronics (BIE) assembly\\nused to monitor battery and cell voltages, as\\nwell as provide overvoltage (overcharge)\\nprotection for Lithium Ion batteries with up to\\n8-cells in series. The BIE performs accurate\\nmeasurement of the individual cell voltages,\\nthe total battery voltage, and the individual\\ncell temperatures.\\nIn addition, the BIE\\nprovides an independent over-charge\\nprotection (OCP) circuit that terminates the\\ncharging process by isolating the battery\\nfrom the charging source in the event that\\nthe voltage of any cell exceeds a preset limit\\nof +4.500V. The OCP circuit utilizes dual\\nredundancy, and is immune to single-point\\nfailures in the sense that no single-poin
t\\nfailure can cause the battery to become\\nisolated inadvertently. A typical application\\nof the BIE in a spacecraft electrical power\\nsubsystem is shown in Figure 1.\\n\\nThe BIE was designed, fabricated an
d\\ntested to meet the specifications provided by\\nOrbital Sciences Corporation (OSC) for use\\nwith Lithium-Ion batteries in the Commercial\\nOrbital Transportation System (COTS).\\nCOTS will be used to deliver
 cargo to the\\nInternational Space Station at low earth orbit\\n(LEO).\\nAeroflex has completed the electrical and\\nmechanical design of the BIE and fabricated\\nand tested the Engineering Model (EM), as\\nwell as the Engineering Qualification Model\\n(EQM). Flight units have also been\\nfabricated, tested and delivered to OSC.\\n\\nThe BIE circuits have been designed with\\nChip On Board (COB) technology. Using this\\ntechnology, integrated circuit die, Field\\n\\u0003\\n\\nPower\\nRegulation\\n& Control\\n\\nSolar\\nArray\\n\\nLoad\\n\\nBIE\\nInterface &\\nControl\\nElectronics\\n\\nBattery\\nIsolation\\nSwitch\\n\\nLithium-Ion\\nBattery\\nAssembly\\n\\nBalancing\\nElectronics\\nUnit\\n\\n\\u0003\\n\\n\\u0003\\n\\n)LJXUH\\u0003\\u0014\\u0003\\u00b1\\u00037\\\\SLFDO\\u0003%,(\\u0003$SSOLFDWLRQ\\u0003LQ\\u0003D\\u00036SDFHFUDIW\\u0003(OHFWULFDO\\u00033RZHU\\u00036\\\\VWHP\\u0003\\n_________________________________________________\\nProc. \\u20189th European Space Power Conference\\u2019, Saint Rapha\\u00ebl, France,\\n6\\u201310 June 2011 (ESA SP-690, October 2011)\\n\\u0003\\n\\n\\f\\u0003\\n,QWURGXFWLRQ\\n\\nLithium-ion batteries are now oft
....

Facilities not properly tokenized

I'm finding that some SOLR records have fields for facilities but those contents are not properly fielded, for example:

        "facility":["PROBA2 ; SDO (AIA",
          "HMI); SOHO ; STEREO (SECCHI",
          "IMPACT",
          "PLASTIC",
          "WAVES); Wind (MFI",
          "SWE",
          "WAVES); NDA; NRH."],
        "bibcode":"2019ApJ...878...37P"

XML articles with valid empty body should end up with an empty body field in solr

When an XML document is well formatted and it does not have content within its body tag, the expected result for the pipeline is to have an empty fulltext/body field.

With our current implementation, we would end up with the full xml (without the tags, thus the tittle, authors, affiliations...) in the body solr field despite that the xml document was well formatted and it had an empty body.

We need to:

  • Get from the logs how many documents show the Parsing XML in non-standard way log message
  • From those, check how many do really have empty bodies (inspect them all or a random sample if it's too large)

If the majority are well formatted XML with body tags that do not have any useful content, consider changing our current implementation:

for parser_name in preferred_parser_names:
parsed_xml = self._parse_xml(parser_name)
logger.debug("Checking if the parser '{}' succeeded".format(parser_name))
success = False
for xpath in META_CONTENT[self.meta_name].get('fulltext', {}).get('xpath', []):
fulltext = None
fulltext_elements = parsed_xml.xpath(xpath)
if len(fulltext_elements) > 0:
fulltext = u" ".join(map(unicode.strip, map(unicode, fulltext_elements[0].itertext())))
fulltext = TextCleaner(text=fulltext).run(decode=False, translate=True, normalise=True, trim=True)
if not fulltext:
continue
else:
logger.debug("The parser '{}' succeeded".format(parser_name))
success = True
break
if not success:
logger.debug("The parser '{}' failed".format(parser_name))
else:
break
if not success:
logger.warn('Parsing XML in non-standard way')
parsed_xml = lxml.html.document_fromstring(self.raw_xml.encode('utf-8'))

And instead iterate through all the parsers (as we do now) but if all fail, let the body be empty and not use that "non-standard way". Also, the check could be expanded not only to verify if the body is empty but we could also check if the acknowledgements or the facilities are empty, so if any of these fields is not empty, we probably can consider that the parser succeeded.

Update parser for Nature XML

The DTD used in Nature articles does not match any of the extraction rules currently used to find an article body.

Enhancements to the settings.yml file

It would be nicer to have the settings.yml next to the settings.py, and allow for a local_settings.yml to be consistent with the python pipeline. However, this is a low priority enhancement.

Java workers

Have a way for the workers to sleep for a certain time if they cannot connect to the pipeline or modify the timeout for supervisord. Anyway, it should behave sensibly when RabbitMQ goes offline on ADSX.

False positive OCR files for "NASA" queries

All of our OCR files have text at the bottom of the file that indicates it was "provided by the NASA Astrophysics Data System". This is resulting in false positives for queries like full:"NASA", and is especially apparent because NASA was not founded until 1958 yet we have ~41,000 results for this query before 1957. See (( full:"NASA") AND year:1543-1957) for example.

There are about ~2,000 where the OCR is producing undesired results where this string is interpreted incorrectly, for example, "Provided by the NASA Astrophysicsflata System" (1957AnLun..14B...1R) or "ProvidelYby the" (1957MNSSA..16...44T). In these cases removing the string from the fulltext pipeline would be more complicated.

Force override

Force the pipeline to do full text extraction regardless of the CheckIfExtract worker.

Extract facilities from AAS XML

For ApJ, ApJS, ApJL, AJ, we should look for end extract the following tags:

<named-content content-type="facility" xlink:href="foo"> 

These should then be used to populate the "facility" field and facets.

Improve namespace handling in XML extraction

Currently the Xpath expressions for extracting fulltext do not take into account any namespaces, thus failing to extract the body in an XML document which uses a namespaced element such as <ja:body>

One way to fix this is to come up with a list of namespaced Xpaths as we find them, e.g. for the case above //{http://www.elsevier.com/xml/ja/schema}body, or attempt a wildcard match such as //*[local-name()="body"]

Incorrect parsing of A&A XML

At least for this record: 2017A&A...597A..55S, the fulltext parser is skipping over parts of the text:

<p>One spectrum of HSS<E2><80><89>348 was taken with the <inline-formula specific-use="simple-math">2  <C3>
<97>  8.4</inline-formula> m Large Binocular Telescope (LBT) during commissioning of the PEPSI spectrograph (Potsdam Echelle Polarimetric and Spectroscopic Instrument; Strassmeier et al. <xref id="InR66"/><xref ref-type="bibr" rid="R66">2015b</xref>).

generates the following parsed text:

One spectrum of HSS 348 was taken with the 2015b).

Do not remove facilities from acknowledgements

We are removing facilities from acknowledgements:

# move facilities out of acknowledgments
for e in parsed_xml.xpath(" | ".join(META_CONTENT['xml']['facility']['xpath'])):
self._append_tag_outside_parent(e)

but this information is useful to be kept there so that users that search the ack field can find the facilities (if that's what they are looking for). I think it would be good to extract the facilities without removing them from the acknowledgment.

Facilities appear outside of acknowledgements in XML

For the following 23 bibcodes the facility tags are placed outside of the acknowledgment tags. We need to move facility tags during extraction/parsing into the acknowledgments if we want this information available to users.

2011ApJ...738..120A
<p>
<italic>Facilities:</italic> <named-content content-type="facility" xlink:href="HST"><italic>HST</italic> (COS, STIS)</named-content>, <named-content content-type="facility" xlink:href="ROSAT"><italic>ROSAT</italic> (PSPC)</named-content></p>
</sec>
</body>
<back>
<ack>
<p>This work was supported by grant HST-GO-11687.01-A from STScI, and has made use of public databases hosted by SIMBAD and VizieR, both maintained by CDS, Strasbourg, France, and the High Energy Astrophysics Science Archive Research Center, at the NASA Goddard Space Flight Center.</p>
</ack>

2011ApJ...740..109O
2011ApJS..192....4A
2011ApJ...738...27B
2010ApJ...721.1933S
2013ApJ...773...15W
2012ApJ...744...20S
2012ApJ...746..128S
2010ApJS..189...37E
2015ApJ...805...92O
2015ApJ...799...52W
2014ApJ...782...74H
2015ApJ...798...61T
2010ApJ...708..868C
2014ApJS..211....9L
2011ApJ...726...95D
2016ApJ...824...11G
2012ApJ...750...99T
2010ApJS..191..376S
2011ApJ...735...76K
2011ApJ...735...48S
2014ApJS..213....1T
2011ApJ...739....5W

Fix broken tests on TravisCI

The tests are failing mainly due to the a newer version of RabbitMQ that TravisCI is using. The relevant command should be updated for the new release.

There are also additional errors such as:

ERROR: test_that_we_can_extract_acknowledgments (__main__.TestTEIXMLExtractor)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "tests/test_unit/test.py", line 698, in test_that_we_can_extract_acknowledgments

    parsed_xml = self.extractor.parse_xml()

  File "/home/travis/build/adsabs/ADSfulltext/lib/StandardFileExtract.py", line 407, in parse_xml

    parsed_content = soupparser.fromstring(self.raw_xml)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 33, in fromstring

    return _parse(data, beautifulsoup, makeelement, **bsargs)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 79, in _parse

    root = _convert_tree(tree, makeelement)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 151, in _convert_tree

    converted = convert_node(e)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 212, in convert_node

    return handler(bs_node, parent)

  File "/home/travis/virtualenv/python2.7.9/lib/python2.7/site-packages/lxml/html/soupparser.py", line 269, in convert_pi

    res = etree.ProcessingInstruction(*bs_node.split(' ', 1))

  File "src/lxml/lxml.etree.pyx", line 3039, in lxml.etree.ProcessingInstruction (src/lxml/lxml.etree.c:75920)

ValueError: Invalid PI name 'xml'

Fulltext extraction problem?

This query returns 0 results for some reason. Can you please figure out where things break down?

bibcode:2015ApJ...801..127V body:DAOPHOT

Fix missing appendix text

A user reports they cannot search for text in the appendix portion for 2012A&A...539A..88C. For at least some types of xml files we are not extracting text from appendices and appending to the body field. For another example see 2012A&A...539A.119F.

For at least some pdf files, we do extract text from appendices and append it to the body field.

Wiley seperation of acknowledgements from body content

Hi Jonny,

Didn't know where to create an issue for this, so I'm just letting you know about it. It looks like the XML parser for content we got from Wiley is not working as well as it should. As an example, see the parsed content from 2012MNRAS.419.3018C, which is in /proj/ads/articles/fulltext/extracted/20/12/MN/RA/S,/41/9,/30/18/C/

The fulltext clearly includes all of the acknowledgments as well as the bibliography, as a result that the acknowledgements are defined within the tags.

Invalid empty body resulting from <!-- body endbody --> syntax

This structure

<body>
    <!-- body
        <p>content</p>
    endbody -->
</body>

is used in AGU articles and results in empty body tags due to the removal of comments before extraction (although, comments would still be ignored during extraction). See this google doc for more details and this article's XML file 1942TeMAE..47..251B for an example. Based on a random sample, it's probable that over 4,000 articles are affected by this issue.

Document Similarity

Calculate the difference between the file associated with the original bibcode and the file associated with the new bibcode for all bibcode changes/updates.

MNRAS XML acronyms

MNRAS seems to be using markup to identify acronyms, which are otherwise entered in lowercase. In the visible text (HTML/PDF) you see "TOPCAT", but in the XML it says "topcat", so I guess TOPCAT is not being picked up as acronym with indexing. See e.g.

/proj/ads/articles/fulltext/sources/MNRAS/0452/stv1276.xml

Extract useful data from tables

A user requested to be able to search the contents of a table using full, for example, in this table the user wants to be able to search full:"2007 TK422" from the "Designation/Names" column. Currently, our protocol is not to extract the contents of a table, so for this table we extract the title and footer, which means the only thing we have in fulltext for this table is "Table 3 Colors of Centaurs" and "Note. a Two different epochs. See Table 2". See paper here.

The reason our protocol is to not extract and index table data is to avoid needlessly increasing the load on solr. Tables can be formatted poorly, or not at all, so the intent was to not index chunks of numbers and characters that lose their meaning upon extraction.

XML for this table:

<table>
<colgroup>
<col align="left"/>
<col align="left"/>
<col align="center"/>
<col align="char" char="."/>
<col align="char" char="."/>
<col align="char" char="."/>
<col align="char" char="."/>
</colgroup>
<thead>
<tr>
<th align="left">Number</th>
<th align="left">Designation/Name</th>
<th align="center">UT Date</th>
<th align="center">Span</th>
<th align="center"><italic>r</italic></th>
<th align="center">&Delta;</th>
<th align="center"><italic>&agr;</italic></th>
</tr>
<tr>
<th align="left"/>
<th align="left"/>
<th align="center"/>
<th align="center">(hr)</th>
<th align="center">(au)</th>
<th align="center">(au)</th>
<th align="center">(deg)</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">469372</td>
<td align="left">2001 QF298</td>
<td align="center">2003 Nov 22.18</td>
<td align="char" char=".">1.4</td>
<td align="char" char=".">42.704</td>
<td align="char" char=".">42.373</td>
<td align="char" char=".">1.25</td>
</tr> 

The output of this table would look like this:

Table 3 Colors of Centaurs Number Designation/Name Telescope H V obs 167P CINEOS VATT 9.81 21.06 0.76 ± 0.03 0.53 ± 0.02 1.29 ± 0.03 55576 Amycus VATT 8.09 19.94 1.11 ± 0.02 0.70 ± 0.02 1.82 ± 0.03 60558 Echeclus VATT 9.66 ⋯ a 0.85 ± 0.04 0.55 ± 0.03 1.39 ± 0.04 120181 2003 UR292 VATT 7.51 22.03 1.03 ± 0.06 0.64 ± 0.05 1.67 ± 0.06 136204 2003 WL7 VATT 8.86 21.06 0.74 ± 0.04 0.49 ± 0.02 1.23 ± 0.04 145486 2005 UJ438 VATT 11.57 21.01 1.01 ± 0.03 0.63 ± 0.03 1.64 ± 0.04 248835 2006 SX368 VATT 9.50 20.54 0.74 ± 0.02 0.47 ± 0.02 1.22 ± 0.02 250112 2002 KY14 VATT 10.41 20.15 1.06 ± 0.02 0.69 ± 0.02 1.75 ± 0.02 281371 2008 FC76 DCT 9.69 19.75 0.97 ± 0.02 0.63 ± 0.02 1.60 ± 0.03 309139 2006 XQ51 VATT 10.33 21.47 0.74 ± 0.03 0.41 ± 0.03 1.15 ± 0.03 309737 2008 SJ236 VATT 12.50 20.10 1.01 ± 0.02 0.58 ± 0.02 1.60 ± 0.02 309741 2008 UZ6 VATT 11.23 21.46 0.92 ± 0.04 0.59 ± 0.03 1.52 ± 0.04 315898 2008 QD4 VATT 11.41 19.17 0.74 ± 0.02 0.46 ± 0.02 1.20 ± 0.02 336756 2010 NV1 VATT 10.71 20.85 0.74 ± 0.03 0.50 ± 0.02 1.24 ± 0.02 341275 2007 RG283 VATT 8.77 21.26 0.79 ± 0.03 0.47 ± 0.03 1.26 ± 0.03 342842 2008 YB3 VATT 9.59 18.34 0.77 ± 0.02 0.46 ± 0.02 1.23 ± 0.02 346889 Rhiphonos VATT 11.99 20.05 0.82 ± 0.02 0.55 ± 0.02 1.37 ± 0.02 349933 2009 YF7 DCT 11.00 20.88 0.72 ± 0.02 0.46 ± 0.02 1.18 ± 0.03 382004 2010 RM64 VATT 11.24 19.52 1.00 ± 0.02 0.55 ± 0.02 1.56 ± 0.02 447178 2005 RO43 VATT 7.26 21.29 0.77 ± 0.03 0.47 ± 0.03 1.24 ± 0.03 449097 2012 UT68 DCT 9.81 20.92 1.02 ± 0.02 0.66 ± 0.02 1.68 ± 0.03 459865 2013 XZ8 DCT 9.77 20.73 0.72 ± 0.02 0.45 ± 0.02 1.17 ± 0.03 459971 2014 ON6 DCT 12.10 20.53 0.97 ± 0.02 0.58 ± 0.02 1.55 ± 0.03 463368 2012 VU85 DCT 8.73 22.96 1.07 ± 0.06 0.63 ± 0.04 1.70 ± 0.07 471339 2011 ON45 DCT 11.94 22.32 1.11 ± 0.03 0.71 ± 0.02 1.81 ± 0.04 2002 PQ152 DCT 9.91 23.52 1.13 ± 0.04 0.72 ± 0.05 1.85 ± 0.06 2002 QX47 VATT 8.85 22.15 0.70 ± 0.04 0.38 ± 0.04 1.08 ± 0.04 2007 RH283 VATT 8.60 21.38 0.72 ± 0.03 0.43 ± 0.02 1.15 ± 0.03 2007 TJ422 DCT 11.55 24.25 1.74 ± 0.06 2007 TK422 DCT 9.35 22.89 0.71 ± 0.03 0.51 ± 0.02 1.22 ± 0.04 2007 UM126 VATT 10.26 21.20 0.74 ± 0.03 0.39 ± 0.03 1.13 ± 0.03 2007 VH305 VATT 11.91 21.02 0.69 ± 0.03 0.49 ± 0.02 1.18 ± 0.02 2010 BK118 VATT 10.42 18.47 0.81 ± 0.02 0.50 ± 0.02 1.32 ± 0.02 2010 TH DCT 9.27 21.43 0.72 ± 0.02 0.46 ± 0.02 1.18 ± 0.03 2013 UL10 DCT 13.46 21.60 0.97 ± 0.02 0.67 ± 0.02 1.64 ± 0.03 Note. a Two different epochs. See Table 2 .

Should we include an extra field 'appendix'

Should we extract the appendix of the paper as a separate entity that would allow it to be indexed and searched by the users, or stick it on to the end of the full text?

Fulltext query for "Mars" and "HiRISE" returns less papers than before

A user reported that querying for fulltext:"Mars" and fulltext:"HiRISE" previously returned 1,649 papers in Fall 2019, but as of Jan 2020 only 1,250 papers are returned for the same query. We are currently waiting on the user to respond back with an example of a paper that used to show up but now doesn't. I am documenting this here in case they don't respond or are not able to find an example paper.

Improve text extraction of large documents

Here is a list of records which have a large fulltext contents. Due to limitations in SOLR, we currently throw away anything beyond 32k bytes. Under these circumstances, it would be nice to be more sensible when generating fulltext so that we throw away things which are not interesting (e.g. numeric tables) and keep the text that we want.

['1998astro.ph..7308B', '/proj/ads/articles/fulltext/extracted/19/98/as/tr/o,/ph/,,/73/08/B/fulltext.txt', 105938]
['2003astro.ph..4480S', '/proj/ads/articles/fulltext/extracted/20/03/as/tr/o,/ph/,,/44/80/S/fulltext.txt', 62708]
['2004ADNDT..88...83L', '/proj/ads/articles/fulltext/extracted/20/04/AD/ND/T,/,8/8,/,,/83/L/fulltext.txt', 319850]
['2004JMoSp.228..593B', '/proj/ads/articles/fulltext/extracted/20/04/JM/oS/p,/22/8,/,5/93/B/fulltext.txt', 113945]
['2005ADNDT..89....1G', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,,/,1/G/fulltext.txt', 204757]
['2005ADNDT..89..101Z', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/01/Z/fulltext.txt', 126118]
['2005ADNDT..89..139E', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/39/E/fulltext.txt', 238924]
['2005ADNDT..89..195L', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,1/95/L/fulltext.txt', 280702]
['2005ADNDT..89..267G', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,8/9,/,2/67/G/fulltext.txt', 90409]
['2005ADNDT..90..177L', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,9/0,/,1/77/L/fulltext.txt', 313130]
['2005ADNDT..90..259Z', '/proj/ads/articles/fulltext/extracted/20/05/AD/ND/T,/,9/0,/,2/59/Z/fulltext.txt', 248951]
['2005NewA...10..325Z', '/proj/ads/articles/fulltext/extracted/20/05/Ne/wA/,,/,1/0,/,3/25/Z/fulltext.txt', 127388]
['2006ADNDT..92..105B', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,1/05/B/fulltext.txt', 269501]
['2006ADNDT..92..305L', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,3/05/L/fulltext.txt', 272532]
['2006ADNDT..92..481Z', '/proj/ads/articles/fulltext/extracted/20/06/AD/ND/T,/,9/2,/,4/81/Z/fulltext.txt', 550232]
['2006JMoSt.780..182L', '/proj/ads/articles/fulltext/extracted/20/06/JM/oS/t,/78/0,/,1/82/L/fulltext.txt', 68805]
['2006math......9485D', '/proj/ads/articles/fulltext/extracted/20/06/ma/th/,,/,,/,,/94/85/D/fulltext.txt', 127583]
['2007ADNDT..93....1L', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,,/,1/L/fulltext.txt', 207798]
['2007ADNDT..93..275B', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,2/75/B/fulltext.txt', 303136]
['2007ADNDT..93..615A', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,6/15/A/fulltext.txt', 468786]
['2007ADNDT..93..742B', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,7/42/B/fulltext.txt', 143947]
['2007ADNDT..93..864H', '/proj/ads/articles/fulltext/extracted/20/07/AD/ND/T,/,9/3,/,8/64/H/fulltext.txt', 169796]
['2007cmd..book....1C', '/proj/ads/articles/fulltext/extracted/20/07/cm/d,/,b/oo/k,/,,/,1/C/fulltext.txt', 52302]
['2008ADNDT..94....1L', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,,/,1/L/fulltext.txt', 128320]
['2008ADNDT..94..561D', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,5/61/D/fulltext.txt', 189230]
['2008ADNDT..94..807L', '/proj/ads/articles/fulltext/extracted/20/08/AD/ND/T,/,9/4,/,8/07/L/fulltext.txt', 410355]
['2009ADNDT..95....1S', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,,/,1/S/fulltext.txt', 285454]
['2009ADNDT..95..547L', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,5/47/L/fulltext.txt', 121969]
['2009ADNDT..95..607A', '/proj/ads/articles/fulltext/extracted/20/09/AD/ND/T,/,9/5,/,6/07/A/fulltext.txt', 920690]
['2009arXiv0904.2782S', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/90/4,/27/82/S/fulltext.txt', 352583]
['2009arXiv0910.1690A', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/91/0,/16/90/A/fulltext.txt', 79066]
['2009arXiv0910.5784S', '/proj/ads/articles/fulltext/extracted/20/09/ar/Xi/v0/91/0,/57/84/S/fulltext.txt', 918828]
['2010ADNDT..96....1T', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,,/,1/T/fulltext.txt', 175280]
['2010ADNDT..96..123A', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,1/23/A/fulltext.txt', 678572]
['2010ADNDT..96..481H', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,4/81/H/fulltext.txt', 219803]
['2010ADNDT..96..759S', '/proj/ads/articles/fulltext/extracted/20/10/AD/ND/T,/,9/6,/,7/59/S/fulltext.txt', 315417]
['2011ADNDT..97...50B', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,,/50/B/fulltext.txt', 308131]
['2011ADNDT..97..225A', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,2/25/A/fulltext.txt', 702513]
['2011ADNDT..97..587L', '/proj/ads/articles/fulltext/extracted/20/11/AD/ND/T,/,9/7,/,5/87/L/fulltext.txt', 309957]
['2011ESASP.690E..24A', '/proj/ads/articles/fulltext/extracted/20/11/ES/AS/P,/69/0E/,,/24/A/fulltext.txt', 80391]
['2011PhDT........92J', '/proj/ads/articles/fulltext/extracted/20/11/Ph/DT/,,/,,/,,/,,/92/J/fulltext.txt', 918724]
['2012ADNDT..98..149M', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,1/49/M/fulltext.txt', 185016]
['2012ADNDT..98..437D', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,4/37/D/fulltext.txt', 163161]
['2012ADNDT..98..779W', '/proj/ads/articles/fulltext/extracted/20/12/AD/ND/T,/,9/8,/,7/79/W/fulltext.txt', 104213]
['2013ADNDT..99..249T', '/proj/ads/articles/fulltext/extracted/20/13/AD/ND/T,/,9/9,/,2/49/T/fulltext.txt', 362192]
['2013ADNDT..99..459O', '/proj/ads/articles/fulltext/extracted/20/13/AD/ND/T,/,9/9,/,4/59/O/fulltext.txt', 141759]
['2013arXiv1308.5199C', '/proj/ads/articles/fulltext/extracted/20/13/ar/Xi/v1/30/8,/51/99/C/fulltext.txt', 579391]
['2013arXiv1312.4478L', '/proj/ads/articles/fulltext/extracted/20/13/ar/Xi/v1/31/2,/44/78/L/fulltext.txt', 443883]
['2014ADNDT.100..651M', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,6/51/M/fulltext.txt', 437029]
['2014ADNDT.100..802F', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,8/02/F/fulltext.txt', 132232]
['2014ADNDT.100..986T', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/,9/86/T/fulltext.txt', 335214]
['2014ADNDT.100.1156T', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/11/56/T/fulltext.txt', 135543]
['2014ADNDT.100.1292F', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/12/92/F/fulltext.txt', 132797]
['2014ADNDT.100.1357X', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/13/57/X/fulltext.txt', 142772]
['2014ADNDT.100.1399A', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/13/99/A/fulltext.txt', 515880]
['2014ADNDT.100.1519L', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/15/19/L/fulltext.txt', 339602]
['2014ADNDT.100.1603A', '/proj/ads/articles/fulltext/extracted/20/14/AD/ND/T,/10/0,/16/03/A/fulltext.txt', 716084]
['2015ADNDT.101...41Z', '/proj/ads/articles/fulltext/extracted/20/15/AD/ND/T,/10/1,/,,/41/Z/fulltext.txt', 631607]
['2015arXiv150309147G', '/proj/ads/articles/fulltext/extracted/20/15/ar/Xi/v1/50/30/91/47/G/fulltext.txt', 61666]
['2016ADNDT.107..140A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/7,/,1/40/A/fulltext.txt', 295055]
['2016ADNDT.107..221A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/7,/,2/21/A/fulltext.txt', 554068]
['2016ADNDT.108...15W', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/10/8,/,,/15/W/fulltext.txt', 145479]
['2016ADNDT.111..280A', '/proj/ads/articles/fulltext/extracted/20/16/AD/ND/T,/11/1,/,2/80/A/fulltext.txt', 204495]
['2016JPhCS.758a2002D', '/proj/ads/articles/fulltext/extracted/20/16/JP/hC/S,/75/8a/20/02/D/fulltext.txt', 19531]
['2016JPhCS.761a2034C', '/proj/ads/articles/fulltext/extracted/20/16/JP/hC/S,/76/1a/20/34/C/fulltext.txt', 30861]
['2016JQSRT.168..102S', '/proj/ads/articles/fulltext/extracted/20/16/JQ/SR/T,/16/8,/,1/02/S/fulltext.txt', 121291]
['2016JVGR..311...79G', '/proj/ads/articles/fulltext/extracted/20/16/JV/GR/,,/31/1,/,,/79/G/fulltext.txt', 99296]
['2016NJPh...18j3050C', '/proj/ads/articles/fulltext/extracted/20/16/NJ/Ph/,,/,1/8j/30/50/C/fulltext.txt', 22097]
['2016Tectp.677....1L', '/proj/ads/articles/fulltext/extracted/20/16/Te/ct/p,/67/7,/,,/,1/L/fulltext.txt', 137039]

More content extraction

This is a general comment on content extraction. It should be possible to extract more sensical content from files that are ingested. For example, PDF, OCR and TXT files do not differentiate between their content unlike HTML and XML files. Once a schema is in place, it can be applied to all of the simple-text files that are extracted.

HTML extractor cannot decode certain characters

Some HTML files contain characters that codecs can't decode which results in extraction failure. See 2010A&A...514A..98C for example. Characters throwing this error include 0xf3, 0xf6, 0xe1, 0xe0, etc.

Fix data set to extract all

Currently, it is only extracting one data set at the moment.

adsabs/ADSExports should be updated to expect lists.

html5lib parser failing due to less than symbol in XML files

243 Errors such as "Invalid HTML tag name", "Invalid tag name", "Empty tag name", "Invalid namespace URI" are occurring as a result of "<" symbols that are mistaken for the beginning of an XML tag by the html5lib parser.

Part of the problem is due to uncaught latex formulas that we remove using a regular expression. See 2015GeoJI.200.1466Z for example.

A number of these errors are slightly more difficult to resolve as they occur from images. See this google doc for more details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.