Giter Site home page Giter Site logo

science-parse's Introduction

Science Parse

Science Parse parses scientific papers (in PDF form) and returns them in structured form. As of today, it supports these fields:

  • Title
  • Authors
  • Abstract
  • Sections (each with heading and body text)
  • Bibliography, each with
    • Title
    • Authors
    • Venue
    • Year
  • Mentions, i.e., places in the paper where bibliography entries are mentioned

In JSON format, the output looks like this (or like this, if you want sections). The easiest way to get started is to use the output from this server.

New version: SPv2

There is a new version of science-parse out that works in a completely different way. It has fewer features, but higher quality in the output. Check out the details at https://github.com/allenai/spv2.

Get started

There are three different ways to get started with SP. Each has its own document:

  • Server: This contains the SP server. It's useful for PDF parsing as a service. It's also probably the easiest way to get going.
  • CLI: This contains the command line interface to SP. That's most useful for batch processing.
  • Core: This contains SP as a library. It has all the extraction code, plus training and evaluation. Both server and CLI use this to do the actual work.

How to include into your own project

The current version is 3.0.0. If you want to include it in your own project, use this:

For SBT:

libraryDependencies += "org.allenai" %% "science-parse" % "3.0.0"

For Maven:

<dependency>
  <groupId>org.allenai</groupId>
  <artifactId>science-parse_2.12</artifactId>
  <version>3.0.0</version>
</dependency>

The first time you run it, SP will download some rather large model files. Don't be alarmed! The model files are cached, and startup is much faster the second time.

For licensing reasons, SP does not include libraries for some image formats. Without these libraries, SP cannot process PDFs that contain images in these formats. If you have no licensing restrictions in your project, we recommend you add these additional dependencies to your project as well:

  "com.github.jai-imageio" % "jai-imageio-core" % "1.2.1",
  "com.github.jai-imageio" % "jai-imageio-jpeg2000" % "1.3.0", // For handling jpeg2000 images
  "com.levigo.jbig2" % "levigo-jbig2-imageio" % "1.6.5", // For handling jbig2 images

Development

This project is a hybrid between Java and Scala. The interaction between the languages is fairly seamless, and SP can be used as a library in any JVM-based language.

Our build system is sbt. To build science-parse, you have to have sbt installed and working. You can find details about that at https://www.scala-sbt.org.

Once you have sbt set up, just start sbt in the main project folder to launch sbt's shell. There are many things you can do in the shell, but here are the most important ones:

  • +test runs all the tests in all the projects across Scala versions.
  • cli/assembly builds a runnable superjar (i.e., a jar with all dependencies bundled) for the project. You can run it (from bash, not from sbt) with java -Xmx10g -jar <location of superjar>.
  • server/assembly builds a runnable superjar for the webserver.
  • server/run starts the server directly from the sbt shell.

Lombok

This project uses Lombok which requires you to enable annotation processing inside of an IDE. Here is the IntelliJ plugin and you'll need to enable annotation processing (instructions here).

Lombok has a lot of useful annotations that give you some of the nice things in Scala:

  • val is equivalent to final and the right-hand-side class. It gives you type-inference via some tricks
  • Check out @Data

Thanks

Special thanks goes to @kermitt2, whose work on kermitt2/grobid inspired Science Parse, and helped us get started with some labeled data.

Releasing new versions

This project releases to BinTray. To make a release:

  1. Pull the latest code on the master branch that you want to release
  2. Tag the release git tag -a vX.Y.Z -m "Release X.Y.Z" replacing X.Y.Z with the correct version
  3. Push the tag back to origin git push origin vX.Y.Z
  4. Release the build on Bintray sbt +publish (the "+" is required to cross-compile)
  5. Verify publication on bintray.com
  6. Bump the version in build.sbt on master (and push!) with X.Y.Z+1 (e.g., 2.5.1 after releasing 2.5.0)

If you make a mistake you can rollback the release with sbt bintrayUnpublish and retag the version to a different commit as necessary.

science-parse's People

Contributors

amosjyng avatar aria42 avatar bbstilson avatar chrisc36 avatar dcdowney avatar dirkgr avatar dirkraft avatar jpowerwa avatar nalourie-ai2 avatar rjpower avatar rodneykinney avatar rreas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

science-parse's Issues

Timeout

I have followed the instruction as per cli and I get the folloiwng error and i am not sure why, could you please help me with the issue.

java -Xmx6g -jar science-parse-cli-assembly-1.3.2-SNAPSHOT.jar -o /home/usr/papers/ -f ox.json /home/usr/papers/OX.pdf
00:05:29.631 [main] DEBUG com.amazonaws.AmazonWebServiceClient - Internal logging successfully configured to commons logger: true
00:05:30.001 [main] DEBUG com.amazonaws.metrics.AwsSdkMetrics - Admin mbean registered under com.amazonaws.management:type=AwsSdkMetrics
00:05:30.159 [main] DEBUG c.a.internal.config.InternalConfig - Configuration override awssdk_config_override.json not found.
00:05:31.776 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.Parser - Loading gazetteer from /home/mahmoud/.ai2/datastore/public/org.allenai.scienceparse/gazetteer-v5.json
00:05:31.777 [ModelLoaderThread] INFO org.allenai.scienceparse.Parser - Loading model from /home/mahmoud/.ai2/datastore/public/org.allenai.scienceparse/productionModel-v9.dat
00:05:31.781 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.Parser - Loading bib model from /home/mahmoud/.ai2/datastore/public/org.allenai.scienceparse/productionBibModel-v7.dat
00:05:31.828 [ForkJoinPool-1-worker-1] INFO org.allenai.scienceparse.Parser - Creating gazetteer cache at /tmp/gazetteer-v5.json-57500763.gazetteerCache.bin
00:05:45.620 [ForkJoinPool-1-worker-1] INFO o.a.scienceparse.ParserGroundTruth - Read 1609659 papers.
00:14:28.071 [ModelLoaderThread] INFO org.allenai.datastore.Datastore - Starting to wait on /home/mahmoud/.ai2/datastore/public/org.allenai.scienceparse/Word2VecModel-v1.bin.lock
00:21:09.760 [ForkJoinPool-1-worker-1] INFO o.a.scienceparse.ExtractReferences - could not load kermit gazetter
Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed out after [15 minutes]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:116)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:116)
at org.allenai.scienceparse.RunSP$$anonfun$main$1.apply(RunSP.scala:195)
at org.allenai.scienceparse.RunSP$$anonfun$main$1.apply(RunSP.scala:83)
at scala.Option.foreach(Option.scala:257)
at org.allenai.scienceparse.RunSP$.main(RunSP.scala:83)
at org.allenai.scienceparse.RunSP.main(RunSP.scala)

How do I change the default port 8080 within the docker container

Hi,
How can I change the port of the service inside the docker container from 8080 to something different let's say 8081. I want this feature because there is another container in my docker network that runs in 8080 port thus causing problems for intercommunication

Thanks

Parsing mathematics in pdf

Using science-parse to extract text from technical literature does not output the mathematical symbols in a structured manner. It mostly maps them to ASCII symbols.

We are interested in obtaining the mathematical formulas in LaTeX or MathML format and are curious if the science-parse tool is capable of providing the same with minor adjustments? In case it is not that straightforward, are you aware of any other open-source tool for this task?

Extract more field and meta data?

Hi,

Are there any plans to extract more fields in the foreseeable future?

For example:

  • the body text could include more markup
  • there is an affiliations field that seems to be always empty (but is also not documented as a field)
  • references may include more properties
  • coordinates to the location in the source document would help to reconcile it with other extraction tools

Another related question, do you have a bit more information on the scope of the_venue_ field? It looks like it could be a conference name or a journal for example. Perhaps anything that doesn't fit into the other reference fields (title, author, year)?

Thank you

Can't use the parser if I have the libraries locally

I have the project locally and it created the libraries it needed.
When I try to use the Parser I get this error

00:13:12.723 [run-main-4] DEBUG com.amazonaws.metrics.AwsSdkMetrics - Admin mbean registered under com.amazonaws.management:type=AwsSdkMetrics/1
00:13:13.090 [run-main-4] DEBUG c.a.internal.config.InternalConfig - Configuration override awssdk_config_override.json not found.
[error] (run-main-4) java.lang.NoClassDefFoundError: scala/Product$class
[error] java.lang.NoClassDefFoundError: scala/Product$class
[error] at org.allenai.datastore.Datastore$Locator.(Datastore.scala:96)
[error] at org.allenai.datastore.Datastore.filePath(Datastore.scala:343)
[error] at org.allenai.scienceparse.Parser.getDefaultProductionModel(Parser.java:107)
[error] at org.allenai.scienceparse.Parser.(Parser.java:127)
[error] at org.allenai.scienceparse.Parser.getInstance(Parser.java:122)
[error] at ai.lum.paperreader.Test$.main(test.scala:21)
[error] at ai.lum.paperreader.Test.main(test.scala)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.lang.reflect.Method.invoke(Method.java:498)
[error] at sbt.Run.invokeMain(Run.scala:89)
[error] at sbt.Run.run0(Run.scala:83)
[error] at sbt.Run.execute$1(Run.scala:61)
[error] at sbt.Run.$anonfun$run$4(Run.scala:73)
[error] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error] at sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
[error] at sbt.TrapExit$App.run(TrapExit.scala:252)
[error] at java.lang.Thread.run(Thread.java:748)
[error] Caused by: java.lang.ClassNotFoundException: scala.Product$class

Few Queries

Is it possible to parse a pdf into text and get the following details of the paper?
"""
Title
Authors
Abstract
Sections (each with heading and body text)
"""
Is your PDFreader similar to Grobid?

The reason why I do not want to process the following modules is network issues.

  1. it is always failed when downloading the training models.
08:15:03.951 [main] WARN  org.allenai.datastore.Datastore - java.net.SocketTimeoutException: Read timed out while downloading org.allenai.scienceparse/productionBibModel-v7.dat. 6 retries left.
  1. Then, I want to try your service. However, the URL in this domain does not work for me. http://scienceparse.allenai.org
    e.g http://scienceparse.allenai.org/v1/498bb0efad6ec15dd09d941fb309aa18d6df9f5f?skipFields=sections
504 Gateway Time-out
nginx/1.4.6 (Ubuntu)

Science Parse Server - .jar not able to download model files

I created the jar from sbt server\assembly and I got the super jar as science-parse-server-assembly-1.3.3-SNAPSHOT.jar .
As mentioned over this link I need to run
java -Xmx6g -jar science-parse-server-assembly-1.3.3-SNAPSHOT.jar

When I am running java -Xmx6g -jar science-parse-server-assembly-1.3.3-SNAPSHOT.jar

I am getting timeout.

WARN  org.allenai.datastore.Datastore: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to public.store.dev.allenai.org.s3.amazonaws.com:443 [public.store.dev.allenai.org.s3.amazonaws.com/52.218.209.58] failed: connect timed out while downloading org.allenai.scienceparse/productionModel-v9.dat. 1 retries left.
Exception in thread "main" com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to public.store.dev.allenai.org.s3.amazonaws.com:443 [public.store.dev.allenai.org.s3.amazonaws.com/54.231.184.226] failed: connect timed out
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1113)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1063)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247)
	at com.amazonaws.services.s3.AmazonS3Client.getBucketRegionViaHeadRequest(AmazonS3Client.java:5008)
	at com.amazonaws.services.s3.AmazonS3Client.fetchRegionFromCache(AmazonS3Client.java:4982)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4231)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1398)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1259)
	at org.allenai.datastore.Datastore$$anonfun$org$allenai$datastore$Datastore$$getS3Object$1.apply(Datastore.scala:215)
	at org.allenai.datastore.Datastore$$anonfun$org$allenai$datastore$Datastore$$getS3Object$1.apply(Datastore.scala:215)
	at org.allenai.datastore.Datastore.org$allenai$datastore$Datastore$$accessDeniedWrapper(Datastore.scala:202)
	at org.allenai.datastore.Datastore.org$allenai$datastore$Datastore$$getS3Object(Datastore.scala:214)
	at org.allenai.datastore.Datastore$$anonfun$path$1.apply$mcV$sp(Datastore.scala:389)
	at org.allenai.datastore.Datastore$$anonfun$path$1.apply(Datastore.scala:387)
	at org.allenai.datastore.Datastore$$anonfun$path$1.apply(Datastore.scala:387)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:48)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.withRetries(Datastore.scala:61)
	at org.allenai.datastore.Datastore.path(Datastore.scala:386)
	at org.allenai.datastore.Datastore.filePath(Datastore.scala:343)
	at org.allenai.scienceparse.Parser.getDefaultProductionModel(Parser.java:99)
	at org.allenai.scienceparse.SPServer$$anonfun$main$1$$anonfun$9.apply(SPServer.scala:73)
	at org.allenai.scienceparse.SPServer$$anonfun$main$1$$anonfun$9.apply(SPServer.scala:73)
	at scala.Option.getOrElse(Option.scala:121)
	at org.allenai.scienceparse.SPServer$$anonfun$main$1.apply(SPServer.scala:73)
	at org.allenai.scienceparse.SPServer$$anonfun$main$1.apply(SPServer.scala:71)
	at scala.Option.foreach(Option.scala:257)
	at org.allenai.scienceparse.SPServer$.main(SPServer.scala:71)
	at org.allenai.scienceparse.SPServer.main(SPServer.scala)
Caused by: org.apache.http.conn.ConnectTimeoutException: Connect to public.store.dev.allenai.org.s3.amazonaws.com:443 [public.store.dev.allenai.org.s3.amazonaws.com/54.231.184.226] failed: connect timed out
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:150)
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353)
	at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
	at com.amazonaws.http.conn.$Proxy3.connect(Unknown Source)
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380)
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1235)
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055)
	... 42 more
Caused by: java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:337)
	at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:132)
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:141)
	... 57 more
INFO  org.allenai.datastore.TempCleanup$: Cleaning up file at /home/aman/.ai2/datastore/tmp/ai2-datastore-org.allenai.scienceparse%productionModel-v9.dat4393750941617116906.tmp

My proxy is working fine, I also tried in different system but I am getting same error.

Appreciate your time.

Api

hello ,how python use it

Extract tables in structured format

Hello!

As far as I understand, science-parse extracts metadata and raw text from articles. However I would like to get also structured information from tables, for example json-like format

[['dataset', 'model', 'metrics'], ['MNIST', 'CNN', '0.98'], ...]

I understand that sometimes there are difficult-to-extract tables, e. g. when some rows/column are shared. However, I believe that in simplest case, this can be done quite easily.

Any ideas?

Error when building by sbt outside of a git repository

There is an error when building a super-jar using sbt both in server and cli.

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
[info] Loading project definition from D:\science-parse-master\project
Missing bintray credentials C:\Users\v-chutan\.bintray\.credentials. Some bintray features depend on this.
Missing bintray credentials C:\Users\v-chutan\.bintray\.credentials. Some bintray features depend on this.
Missing bintray credentials C:\Users\v-chutan\.bintray\.credentials. Some bintray features depend on this.
Missing bintray credentials C:\Users\v-chutan\.bintray\.credentials. Some bintray features depend on this.
[info] Set current project to science-parse-master (in build file:/D:/science-parse-master/)
[warn] Credentials file C:\Users\v-chutan\.bintray\.credentials does not exist
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
[warn] Credentials file C:\Users\v-chutan\.bintray\.credentials does not exist
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
fatal: not a git repository (or any of the parent directories): .git
java.lang.RuntimeException: Nonzero exit code: 128
        at scala.sys.package$.error(package.scala:27)
        at sbt.Streamed$.sbt$Streamed$$next$1(ProcessImpl.scala:429)
        at sbt.Streamed$$anonfun$apply$10.apply(ProcessImpl.scala:432)
        at sbt.Streamed$$anonfun$apply$10.apply(ProcessImpl.scala:432)
        at sbt.AbstractProcessBuilder.lines(ProcessImpl.scala:151)
        at sbt.AbstractProcessBuilder.lines(ProcessImpl.scala:141)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$7.apply(VersionInjectorPlugin.scala:88)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$7.apply(VersionInjectorPlugin.scala:87)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$4.work(System.scala:63)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
        at sbt.Execute.work(Execute.scala:237)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
        at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
java.lang.RuntimeException: Nonzero exit value: 128
        at scala.sys.package$.error(package.scala:27)
        at sbt.AbstractProcessBuilder.getString(ProcessImpl.scala:134)
        at sbt.AbstractProcessBuilder.$bang$bang(ProcessImpl.scala:136)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$6.apply(VersionInjectorPlugin.scala:86)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$6.apply(VersionInjectorPlugin.scala:86)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$4.work(System.scala:63)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
        at sbt.Execute.work(Execute.scala:237)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
        at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
java.lang.RuntimeException: Nonzero exit value: 128
        at scala.sys.package$.error(package.scala:27)
        at sbt.AbstractProcessBuilder.getString(ProcessImpl.scala:134)
        at sbt.AbstractProcessBuilder.$bang$bang(ProcessImpl.scala:136)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$1.apply$mcJ$sp(VersionInjectorPlugin.scala:85)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$1.apply(VersionInjectorPlugin.scala:85)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$1.apply(VersionInjectorPlugin.scala:85)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$4.work(System.scala:63)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
        at sbt.Execute.work(Execute.scala:237)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
        at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
java.lang.RuntimeException: Nonzero exit code: 128
        at scala.sys.package$.error(package.scala:27)
        at sbt.Streamed$.sbt$Streamed$$next$1(ProcessImpl.scala:429)
        at sbt.Streamed$$anonfun$apply$10.apply(ProcessImpl.scala:432)
        at sbt.Streamed$$anonfun$apply$10.apply(ProcessImpl.scala:432)
        at sbt.AbstractProcessBuilder.lines(ProcessImpl.scala:151)
        at sbt.AbstractProcessBuilder.lines(ProcessImpl.scala:141)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$7.apply(VersionInjectorPlugin.scala:88)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$7.apply(VersionInjectorPlugin.scala:87)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$4.work(System.scala:63)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
        at sbt.Execute.work(Execute.scala:237)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
        at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
java.lang.RuntimeException: Nonzero exit value: 128
        at scala.sys.package$.error(package.scala:27)
        at sbt.AbstractProcessBuilder.getString(ProcessImpl.scala:134)
        at sbt.AbstractProcessBuilder.$bang$bang(ProcessImpl.scala:136)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$6.apply(VersionInjectorPlugin.scala:86)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$6.apply(VersionInjectorPlugin.scala:86)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$4.work(System.scala:63)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
        at sbt.Execute.work(Execute.scala:237)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
        at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
java.lang.RuntimeException: Nonzero exit value: 128
        at scala.sys.package$.error(package.scala:27)
        at sbt.AbstractProcessBuilder.getString(ProcessImpl.scala:134)
        at sbt.AbstractProcessBuilder.$bang$bang(ProcessImpl.scala:136)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$1.apply$mcJ$sp(VersionInjectorPlugin.scala:85)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$1.apply(VersionInjectorPlugin.scala:85)
        at org.allenai.plugins.VersionInjectorPlugin$$anonfun$1.apply(VersionInjectorPlugin.scala:85)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$3$$anonfun$apply$2.apply(System.scala:44)
        at sbt.std.Transform$$anon$4.work(System.scala:63)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
        at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
        at sbt.Execute.work(Execute.scala:237)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
        at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
        at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
[error] (cli/*:gitRemotes) Nonzero exit code: 128
[error] (cli/*:gitSha1) Nonzero exit value: 128
[error] (cli/*:gitCommitDate) Nonzero exit value: 128
[error] (core/*:gitRemotes) Nonzero exit code: 128
[error] (core/*:gitSha1) Nonzero exit value: 128
[error] (core/*:gitCommitDate) Nonzero exit value: 128
[error] Total time: 1 s, completed Aug 3, 2018 8:18:55 AM

Headers and Footers appearing in text

I am using the web API to run science parse on a PDF which has headers and footers. The text in the headers and footers is just appearing in between the text itself, making it impossible to remove them using any pattern.

Below is a example of it.

3.2. The GUI main window:The flowchart in the main GUI of LS-OPT (LS-OPT Version 5.2 20CHAPTER 3: Graphical User InterfaceThe control bar menus are described in Table 3-1.CHAPTER 3: Graphical User InterfaceSummary Report Open the lsopt_report fileWarnings Open the WARNING_MESSAGE fileErrors Open the EXIT_STATUS fileOpens up the working directoryOther fileรขโ‚ฌยฆ Option to open any other text fileAdd Sampling Add additional Sampling. The name of the sampling will be used as the name of a subdirectory used for sampling related databases such as Experiments_n.csv

The text in Bold actually belongs in the header. Its just the heading of the chapter. Also page numbers from footers is also making its way into the text. Could somebody solve this problem, so that data from headers and footers is not given in the output.

fontbox error

Hi,
I built the cli/assembly version, and successfully downloaded the model. However when I try to parse documents by running
java -jar science-parse-cli-assembly-2.0.3.jar D:\Documents\pdfs, I get the following error:

java.lang.NoClassDefFoundError: Could not initialize class org.apache.pdfbox.pdmodel.font.PDType1Font
        at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:62)
        at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
        at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:60)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:848)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:503)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:477)
        at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:150)
        at org.apache.pdfbox.text.LegacyPDFStreamEngine.processPage(LegacyPDFStreamEngine.java:139)
        at org.apache.pdfbox.text.PDFTextStripper.processPage(PDFTextStripper.java:391)
        at org.apache.pdfbox.text.PDFTextStripper.processPages(PDFTextStripper.java:319)
        at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:266)
        at org.apache.pdfbox.text.PDFTextStripper.getText(PDFTextStripper.java:227)
        at org.allenai.scienceparse.pdfapi.PDFExtractor.extractResultFromPDDocument(PDFExtractor.java:158)
        at org.allenai.scienceparse.Parser.doParse(Parser.java:1030)
        at org.allenai.scienceparse.Parser.doParse(Parser.java:976)
        at org.allenai.scienceparse.Parser.doParseWithTimeout(Parser.java:951)
        at org.allenai.scienceparse.RunSP$$anonfun$main$1$$anonfun$22.apply(RunSP.scala:204)
        at org.allenai.scienceparse.RunSP$$anonfun$main$1$$anonfun$22.apply(RunSP.scala:200)
        at org.allenai.common.ParIterator$ParIteratorEnrichment$$anonfun$parForeach$extension$1.apply$mcV$sp(ParIterator.scala:47)
        at org.allenai.common.ParIterator$ParIteratorEnrichment$$anonfun$parForeach$extension$1.apply(ParIterator.scala:46)
        at org.allenai.common.ParIterator$ParIteratorEnrichment$$anonfun$parForeach$extension$1.apply(ParIterator.scala:46)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
        at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
        at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I checked my .ivy2 cache and found the correct jar .ivy2\cache\org.apache.pdfbox\fontbox\bundles\fontbox-2.0.9.jar.

I am running 64-bit Windows 10.

Any help would be greatly appreciated.
Thank you.

Font Information Missing

There are a few cases where visually different fonts will have identical PDFFontMetrics. The ones that come to mind are:

-Fonts as stored in PDFBox have a variety of flags that can be set and that will change the Font's appearance (for example they have Bold and Italic flags).
-I believe in some cases effects like italics can be implemented by changing the FontMatrix.

There might be others as well, I am not super familiar with depths of PDFBox's font processing. Not sure if you already aware of this, but I wanted to flag this in case of confusion later when comparing fonts.

Which encoding are science-parse output files written in?

I am using science-parse for getting sections of PDF papers.
I assumed that the resulting Json files are in UTF-8 format,
but have encountered several cases of surrogate pairs encoded as "\uxxxx\uxxxx",
which is characteristic of UTF-16.

In https://en.wikipedia.org/wiki/JSON I found the following paragraph:

JSON exchange in an open ecosystem must be encoded in UTF-8.[18] The encoding supports the full Unicode character set, including those characters outside the Basic Multilingual Plane (U+10000 to U+10FFFF). However, if escaped, those characters must be written using UTF-16 surrogate pairs, a detail missed by some JSON parsers. For example, to include the Emoji character U+1F602 ๐Ÿ˜‚ FACE WITH TEARS OF JOY in JSON:

{ "face": "๐Ÿ˜‚" }
// or
{ "face": "\uD83D\uDE02" }

So, how does science-parse encode these out-of-BMP Unicode code points?
Is this configurable?
Thanks.

build failing

getting the following error on running sbt cli/assembly
platform used - CentOs
[error] (*:update) sbt.ResolveException: unresolved dependency: org.allenai.plugins#allenai-sbt-plugins;1.5.2: not found [error] unresolved dependency: com.eed3si9n#sbt-assembly;0.14.3: not found [error] unresolved dependency: com.jsuereth#sbt-pgp;1.0.1: not found

Connection error

Hi

I'm wondering if the server is up. Because I'm getting connection error meanwhile trying to get it working.

Command:
curl -v -H "Content-type: application/pdf" --data-binary 45601881.pdf "http://scienceparse.allenai.org/v1"

Error:
curl: (7) Failed to connect to scienceparse.allenai.org port 80: Connection refused

Capturing Footers/Metadata in Section Text

After some brief experimentation, it seems science-parse does a great job dealing with unicode issues and the parsing of chemical formulae for full text extraction. However, there are multiple cases of extracted text including footers. For example, searching for

"NATURE COMMUNICATIONS | 5:3949 | DOI:"

in the parsed output (txt) yields multiple results stitched within the section content. The source of this text can easily be seen in this pdf. This behavior can be seen in most attempted parsed PDFs (seemingly independent of publisher/date), but it is quite difficult to create rules that will remove these inclusions post-parsing. Is this a known issue/limitation of science-parse, and if so is there a work-around to intelligently ignore footers for different publishers/paper types?

I can upload more PDFs/parsed examples if that would be helpful. Thanks!

Edit: Reuploaded txt file with improved readability/formatting.

Pdfbox throws an exception for a paper

For this paper (https://homes.cs.washington.edu/~mausam/papers/aimag10.pdf), Pdfbox throws an exception:

java.lang.NullPointerException
    at org.allenai.scienceparse.pdfapi.PDFExtractor$PDFCaptureTextStripper.endPage(PDFExtractor.java:132)
    at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:460)
    at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:383)
    at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:342)
    at org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:275)
    at org.allenai.scienceparse.pdfapi.PDFExtractor.extractFromInputStream(PDFExtractor.java:203)

Why does it keep downloading?

img
I'm confused about this when running the following code for parsing pdf๏ผš

java -Xmx6g -jar science-parse-cli-assembly-3.0.1.jar my_pdf.pdf -f parsed.json

it seems that there are still some package left so it must donwload it first?

production model-v9 not downloading

I am not able to download the production model-v9 while trying to execute the jar file with a PDF. This is the error I am getting

WARN org.allenai.datastore.Datastore - com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to public.store.dev.allenai.org.s3.amazonaws.com:443 [public.store.dev.allenai.org.s3.amazonaws.com/52.218.144.30] failed: connect timed out while downloading org.allenai.scienceparse/productionModel-v9.dat

mvn dependency, missing dependencies

Using this dependency from the README.md file:

  <dependency>
       <groupId>org.allenai</groupId>
       <artifactId>science-parse_2.11</artifactId>
       <version>2.0.1</version>  
  </dependency>

Will result in a build error. The resources defined in this project are not uploaded to maven repository I think.

[WARNING] The POM for org.allenai.common:common-core_2.11:jar:1.4.9 is missing, no dependency information available
[WARNING] The POM for org.allenai:ml:jar:0.16 is missing, no dependency information available
[WARNING] The POM for org.allenai.datastore:datastore_2.11:jar:1.0.9 is missing, no dependency information available
[WARNING] The POM for com.medallia.word2vec:word2vecjava_2.11:jar:1.0-ALLENAI-4 is missing, no dependency information available
[WARNING] The POM for org.allenai:pdffigures2_2.11:jar:0.0.11 is missing, no dependency information available
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.477 s
[INFO] Finished at: 2018-09-07T12:14:39+02:00
[INFO] Final Memory: 18M/303M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project loarindexer: Could not resolve dependencies for project dk.kb.loar.loarindexer:loarindexer:war:1.0-SNAPSHOT: The following artifacts could not be resolved: org.allenai.common:common-core_2.11:jar:1.4.9, org.allenai:ml:jar:0.16, org.allenai.datastore:datastore_2.11:jar:1.0.9, com.medallia.word2vec:word2vecjava_2.11:jar:1.0-ALLENAI-4, org.allenai:pdffigures2_2.11:jar:0.0.11:....
[ERROR]

Non-working links

You have a few non-working links here.

eg. in the sentence: In JSON format, the output looks like this (or like this, if you want sections). The easiest way to get started is to use the output from this server.
'output looks like this' gives ERR_EMPTY_RESPONSE
'this, if you want sections' gives ERR_EMPTY_RESPONSE

on https://github.com/allenai/science-parse/blob/master/server/README.md
'http://scienceparse.allenai.org/v1/498bb0efad6ec15dd09d941fb309aa18d6df9f5f' gives ERR_EMPTY_RESPONSE

It looks like http://scienceparse.allenai.org is generally not available.

Regards

Colin Goldberg

failing to import SP maven dependency

hi dirkgr,
I'm trying to use SP in a maven project. When I try to inject maven dependency ,ide warns me that missing artifact.It seems that there's no this dependency in the maven repository.
Could you tell me how to fix this problem?
Thank you.

Error While running sbt cli/assembly

science-parse/cli/build.sbt:5: error: not found: value assembly
mainClass in assembly := Some("org.allenai.scienceparse.RunSP")
^
[error] Type error in expression

Release 2.0.3 tries to fetch a paper from S3 instead of parsing the PDF

According --help this should parse the provided PDF

$ java -Xmx6g -jar science-parse-cli-assembly-2.0.3.jar 1910.05346.pdf 

However, SP tries to download a publication from S3. I tried renaming the file to _1910.05346.pdf, foo.pdf and providing an absolute path, in case a regular expression would decide where to find the pdf. None worked.

Error message:

11:23:30.342 [main] DEBUG com.amazonaws.AmazonWebServiceClient - Internal logging successfully configured to commons logger: true
11:23:30.384 [main] DEBUG com.amazonaws.metrics.AwsSdkMetrics - Admin mbean registered under com.amazonaws.management:type=AwsSdkMetrics
11:23:30.433 [main] DEBUG c.a.internal.config.InternalConfig - Configuration override awssdk_config_override.json not found.
Exception in thread "main" org.allenai.datastore.Datastore$AccessDeniedException: You don't have access to the public datastore. Check https://github.com/allenai/wiki/wiki/Getting-Started#setting-up-your-developer-environment for information about configuring your system to get access.
    at org.allenai.datastore.Datastore.org$allenai$datastore$Datastore$$accessDeniedWrapper(Datastore.scala:205)
    at org.allenai.datastore.Datastore.org$allenai$datastore$Datastore$$getS3Object(Datastore.scala:214)
    at org.allenai.datastore.Datastore$$anonfun$path$1.apply$mcV$sp(Datastore.scala:389)
    at org.allenai.datastore.Datastore$$anonfun$path$1.apply(Datastore.scala:387)
    at org.allenai.datastore.Datastore$$anonfun$path$1.apply(Datastore.scala:387)
    at org.allenai.datastore.Datastore.withRetries(Datastore.scala:56)
    at org.allenai.datastore.Datastore.path(Datastore.scala:386)
    at org.allenai.datastore.Datastore.filePath(Datastore.scala:343)
    at org.allenai.scienceparse.Parser.getDefaultProductionModel(Parser.java:99)
    at org.allenai.scienceparse.RunSP$$anonfun$main$1$$anonfun$11.apply(RunSP.scala:84)
    at org.allenai.scienceparse.RunSP$$anonfun$main$1$$anonfun$11.apply(RunSP.scala:84)
    at scala.Option.getOrElse(Option.scala:121)
    at org.allenai.scienceparse.RunSP$$anonfun$main$1.apply(RunSP.scala:84)
    at org.allenai.scienceparse.RunSP$$anonfun$main$1.apply(RunSP.scala:83)
    at scala.Option.foreach(Option.scala:257)
    at org.allenai.scienceparse.RunSP$.main(RunSP.scala:83)
    at org.allenai.scienceparse.RunSP.main(RunSP.scala)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The AWS Access Key Id you provided does not exist in our records. (Service: Amazon S3; Status Code: 403; Error Code: InvalidAccessKeyId; Request ID: F43FC3F6585745DC; S3 Extended Request ID: 0YcyReyCUU0uhojQxcVukuje4m+8qqvUwfBfRrEkPvqzD6Ir6buLVehyOHC3MXMp9AOwoEB5+Ck=), S3 Extended Request ID: 0YcyReyCUU0uhojQxcVukuje4m+8qqvUwfBfRrEkPvqzD6Ir6buLVehyOHC3MXMp9AOwoEB5+Ck=
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1638)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1303)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1055)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4247)
    at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4194)
    at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1398)
    at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1259)
    at org.allenai.datastore.Datastore$$anonfun$org$allenai$datastore$Datastore$$getS3Object$1.apply(Datastore.scala:215)
    at org.allenai.datastore.Datastore$$anonfun$org$allenai$datastore$Datastore$$getS3Object$1.apply(Datastore.scala:215)
    at org.allenai.datastore.Datastore.org$allenai$datastore$Datastore$$accessDeniedWrapper(Datastore.scala:202)
    ... 16 more
11:23:33.407 [Thread-1] INFO  org.allenai.datastore.TempCleanup$ - Cleaning up file at /home/herbert/.ai2/datastore/tmp/ai2-datastore-org.allenai.scienceparse%productionModel-v9.dat52491073689376676.tmp

Access Docker

Can anyone provide a documentation on API calling via Docker.
My docker.bintray.io/s2/scienceparse:1.3.2 is running, but the readme doesn't have enough documentation.

Thanks

memory usage of docker container is increasing ~10GB

Hi,
I am using the science parse (docker) for my work and I found that every time I start the memory consumption increases after some parsing the pdfs.

Is the docker is saving any data when a parsing (pdf -> json) request is been requested?

Thanks

Font size pt quality

Running PDFExtractor on test/resources/P14-1059.pdf:

The font size pt returned by PDFExtractor is the same for both title and authors, but in the PDF source it appears the title is in a noticeably larger font. Seems likely this is a pdfbox limitation, but can we work around it?

Specifics:

Page 0 Lines 2 and 3 (title lines) have font size 14
Page 0 Line 4 (author) has font size 14 for author names, w/ size 8 for the single token "and"

Font name (ZPNUSQ+NimbusRomNo9L-Medi) is the same for all lines except the "and" .

error at gazetteer-v5.json-84292700.gazetteerCache.bin

Exception in thread "main" java.nio.file.FileSystemException: C:\Users\merha\AppData\Local\Temp\gazetteer-v5.json-84292700.gazetteerCache.bin: The process cannot access the file because it is being used by another process.

Get this error on windows 10 with the newest sbt and java sdk 8.

Incorrect Page Numbers

I ran into another issue when trying to build a visualization of the CRF output.

Page numbers for the headers can be incorrect. The error is in PDFToCRFInput in the getSequence method, the pg counter is not correctly incremented.

Website down

I cannot seem to checkout the online service. When I try to access the wensite I get the following error:
504 Gateway Time-out
nginx/1.4.6 (Ubuntu)

When I try to access it through my terminal, I get the following errors-

curl -v -H "Content-type: application/pdf" --data-binary 126579076.pdf "http://scienceparse.allenai.org/v1"

  • Trying 54.200.64.26:80...
  • TCP_NODELAY set
  • Connected to scienceparse.allenai.org (54.200.64.26) port 80 (#0)

POST /v1 HTTP/1.1
Host: scienceparse.allenai.org
User-Agent: curl/7.65.2
Accept: /
Content-type: application/pdf
Content-Length: 13

  • upload completely sent off: 13 out of 13 bytes
  • Mark bundle as not supporting multiuse
    < HTTP/1.1 504 Gateway Time-out
    < Server: nginx/1.4.6 (Ubuntu)
    < Date: Wed, 23 Oct 2019 16:08:57 GMT
    < Content-Type: text/html
    < Content-Length: 191
    < Connection: close
    <
<title>504 Gateway Time-out</title>

504 Gateway Time-out


nginx/1.4.6 (Ubuntu) * Closing connection 0

Exception in thread "ModelLoaderThread" java.lang.OutOfMemoryError: Java heap space

I'm trying to run the parser as a library in Java. However, every time I'm trying to get an instance of parser final Parser parser = Parser.getInstance(); I got an OutOfMemoryError even despite changing heap space to 8GB. Are models bigger than that? What it is that the code is doing when loading tokens like backgroundBow: definierende = 2.0 ? Do you have any other suggestion apart from increasing heap space? I'm not completely sure if this will help me to solve this issue. Part of the error I get is:

10:58:53.047 [ModelLoaderThread] DEBUG o.a.scienceparse.ParserLMFeatures - backgroundBow: definierende = 2.0
10:58:58.121 [ModelLoaderThread] DEBUG o.a.scienceparse.ParserLMFeatures - backgroundBow: definierenden = 2.0
Exception in thread "ModelLoaderThread" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664)
at java.lang.String.(String.java:207)
at java.lang.String.substring(String.java:1969)
at ch.qos.logback.classic.pattern.TargetLengthBasedClassNameAbbreviator.abbreviate(TargetLengthBasedClassNameAbbreviator.java:58)

Exception in thread "main" java.lang.NoSuchMethodError: org.allenai.common.Enum: method <init>()V not found

I am wrapping this in a maven project, and I've encountered this error post model generation. Enclosed is the a copy of the main, the pom, and the error I received.

Error

09:12:37.194 [main] DEBUG o.a.p.SectionTitleExtractor$ - Number section titles detected, pruning sections titles that were not numbered
Exception in thread "main" java.lang.NoSuchMethodError: org.allenai.common.Enum: method <init>()V not found
	at org.allenai.pdffigures2.FigureType.<init>(Figure.scala:6)
	at org.allenai.pdffigures2.FigureType$Figure$.<init>(Figure.scala:9)
	at org.allenai.pdffigures2.FigureType$Figure$.<clinit>(Figure.scala)
	at org.allenai.pdffigures2.CaptionDetector$.findCaptions(CaptionDetector.scala:130)
	at org.allenai.pdffigures2.FigureExtractor.parseDocument(FigureExtractor.scala:121)
	at org.allenai.pdffigures2.FigureExtractor.getFiguresWithText(FigureExtractor.scala:52)
	at org.allenai.scienceparse.Parser.doParse(Parser.java:1147)
	at org.allenai.scienceparse.Parser.doParse(Parser.java:976)
	at [app.Main.main(Main.java:27)]

Main

package app;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.InputStream;

import org.allenai.scienceparse.Parser;
import org.allenai.scienceparse.ExtractedMetadata;

public class Main {

	public static void main(String[] args) throws Exception {
		final String sourceFolder = "U:\\Workspaces\\techknowledgist\\example\\data\\2018\\";
		File initialFile = new File(sourceFolder + "GP-004.pdf");
	        InputStream inputStream = new FileInputStream(initialFile);
		
		String modelFile = "X:\\tools\\science-parse-wrapper\\default_datastore\\public\\org.allenai.scienceparse\\productionModel-v9.dat";
		String gazetteerFile = "X:\\tools\\science-parse-wrapper\\default_datastore\\public\\org.allenai.scienceparse\\gazetteer-v5.json";
		String bibModelFile = "X:\\tools\\science-parse-wrapper\\default_datastore\\public\\org.allenai.scienceparse\\productionBibModel-v7.dat";
		
		final Parser parser = new Parser(modelFile, gazetteerFile, bibModelFile);

		// Parse without timeout
		final ExtractedMetadata em = parser.doParse(inputStream);
		
		String result = em.toString();
	}
}

pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<groupId>edu.vanderbilt.mc.cphi.nlp</groupId>
	<artifactId>science-parse-wrapper</artifactId>
	<version>0.0.1-SNAPSHOT</version>

	<properties>
		<maven.compiler.source>1.8</maven.compiler.source>
		<maven.compiler.target>1.8</maven.compiler.target>
		<encoding>UTF-8</encoding>
		<scala.tools.version>2.10</scala.tools.version>
		<scala.version>2.11.7</scala.version>
	</properties>

	<repositories>
		<repository>
			<id>Central Repo</id>
			<url>https://repo1.maven.org/maven2/</url>
		</repository>
		<repository>
			<id>AllenAI</id>
			<url>https://dl.bintray.com/allenai/maven/</url>
		</repository>
		<repository>
			<id>cloudera-repo-releases</id>
			<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
		</repository>
		<repository>
			<id>spring plugins</id>
			<url>https://repo.spring.io/plugins-release/</url>
		</repository>
		<repository>
			<id>Sonatype</id>
			<url>https://oss.sonatype.org/content/repositories/releases/</url>
		</repository>
	</repositories>

	<dependencies>

		<dependency>
			<groupId>org.allenai.common</groupId>
			<artifactId>common-core_2.11</artifactId>
			<version>2015.04.01-0</version>
		</dependency>

		<dependency>
			<groupId>org.allenai</groupId>
			<artifactId>pdffigures2_2.11</artifactId>
			<version>0.0.11</version>
		</dependency>

		<dependency>
			<groupId>com.fasterxml.jackson.core</groupId>
			<artifactId>jackson-core</artifactId>
			<version>2.7.9</version>
		</dependency>

		<dependency>
			<groupId>com.fasterxml.jackson.core</groupId>
			<artifactId>jackson-databind</artifactId>
			<version>2.7.9</version>
		</dependency>

		<dependency>
			<groupId>com.fasterxml.jackson.module</groupId>
			<artifactId>jackson-module-scala_2.12</artifactId>
			<version>2.7.9</version>
		</dependency>

		<dependency>
			<groupId>org.apache.pdfbox</groupId>
			<artifactId>pdfbox</artifactId>
			<version>2.0.9</version>
			<scope>compile</scope>
			<exclusions>
				<exclusion>
					<groupId>commons-logging</groupId>
					<artifactId>commons-logging</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

		<dependency>
			<groupId>org.apache.pdfbox</groupId>
			<artifactId>fontbox</artifactId>
			<version>2.0.9</version>
			<scope>compile</scope>
			<exclusions>
				<exclusion>
					<groupId>commons-logging</groupId>
					<artifactId>commons-logging</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

		<dependency>
			<groupId>org.slf4j</groupId>
			<artifactId>jcl-over-slf4j</artifactId>
			<version>1.7.7</version>
			<scope>compile</scope>
		</dependency>
		<dependency>
			<groupId>org.allenai</groupId>
			<artifactId>ml</artifactId>
			<version>0.16</version>
			<scope>compile</scope>
			<exclusions>
				<exclusion>
					<groupId>args4j</groupId>
					<artifactId>args4j</artifactId>
				</exclusion>
				<exclusion>
					<groupId>org.slf4j</groupId>
					<artifactId>slf4j-simple</artifactId>
				</exclusion>
			</exclusions>

		</dependency>

		<dependency>
			<groupId>org.projectlombok</groupId>
			<artifactId>lombok</artifactId>
			<version>1.16.20</version>
			<scope>compile</scope>
		</dependency>
		<dependency>
			<groupId>com.goldmansachs</groupId>
			<artifactId>gs-collections</artifactId>
			<version>6.1.0</version>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.scalatest</groupId>
			<artifactId>scalatest_2.11</artifactId>
			<version>2.2.1</version>
			<scope>test</scope>
		</dependency>

		<dependency>
			<groupId>org.testng</groupId>
			<artifactId>testng</artifactId>
			<version>6.8.1</version>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.allenai.common</groupId>
			<artifactId>common-testkit_2.12</artifactId>
			<version>2.0.0</version>
			<scope>test</scope>
		</dependency>

		<dependency>
			<groupId>org.allenai.datastore</groupId>
			<artifactId>datastore_2.11</artifactId>
			<version>1.0.9</version>
		</dependency>

		<dependency>
			<groupId>org.bouncycastle</groupId>
			<artifactId>bcprov-jdk15on</artifactId>
			<version>1.54</version>
			<scope>compile</scope>
		</dependency>
		
		<dependency>
			<groupId>org.bouncycastle</groupId>
			<artifactId>bcmail-jdk15on</artifactId>
			<version>1.54</version>
			<scope>compile</scope>
		</dependency>
		
		<dependency>
			<groupId>org.bouncycastle</groupId>
			<artifactId>bcpkix-jdk15on</artifactId>
			<version>1.54</version>
			<scope>compile</scope>
		</dependency>
		
		<dependency>
			<groupId>org.jsoup</groupId>
			<artifactId>jsoup</artifactId>
			<version>1.8.1</version>
			<scope>compile</scope>
		</dependency>
		
		<dependency>
			<groupId>org.apache.commons</groupId>
			<artifactId>commons-lang3</artifactId>
			<version>3.4</version>
			<scope>compile</scope>
		</dependency>
		
		<dependency>
			<groupId>commons-io</groupId>
			<artifactId>commons-io</artifactId>
			<version>2.4</version>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>com.amazonaws</groupId>
			<artifactId>aws-java-sdk-s3</artifactId>
			<version>1.11.213</version>
			<scope>compile</scope>
			<exclusions>
				<exclusion>
					<groupId>commons-logging</groupId>
					<artifactId>commons-logging</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

		<dependency>
			<groupId>com.medallia.word2vec</groupId>
			<artifactId>word2vecjava_2.11</artifactId>
			<version>1.0-ALLENAI-4</version>
			<scope>compile</scope>
			<exclusions>
				<exclusion>
					<groupId>log4j</groupId>
					<artifactId>log4j</artifactId>
				</exclusion>
				<exclusion>
					<groupId>commons-logging</groupId>
					<artifactId>commons-logging</artifactId>
				</exclusion>
			</exclusions>
		</dependency>

		<dependency>
			<groupId>com.google.guava</groupId>
			<artifactId>guava</artifactId>
			<version>18.0</version>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.scala-lang.modules</groupId>
			<artifactId>scala-java8-compat_2.12</artifactId>
			<version>0.8.0</version>
		</dependency>

		<dependency>
			<groupId>org.scala-lang.modules</groupId>
			<artifactId>scala-xml_2.12</artifactId>
			<version>1.0.6</version>
		</dependency>

		<dependency>
			<groupId>org.scalaj</groupId>
			<artifactId>scalaj-http_2.12</artifactId>
			<version>2.3.0</version>
		</dependency>

		<dependency>
			<groupId>io.spray</groupId>
			<artifactId>spray-json_2.13</artifactId>
			<version>1.3.5</version>
		</dependency>

		<dependency>
			<groupId>de.ruedigermoeller</groupId>
			<artifactId>fst</artifactId>
			<version>2.47</version>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.apache.opennlp</groupId>
			<artifactId>opennlp-tools</artifactId>
			<version>1.7.2</version>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.allenai.scienceparse</groupId>
			<artifactId>science-parse-core_2.11</artifactId>
			<version>3.0.0</version>
		</dependency>
				
		<dependency>
			<groupId>org.allenai.common</groupId>
			<artifactId>common-core_2.12</artifactId>
			<version>2.0.0</version>
		</dependency>
	</dependencies>
</project>

Smaller docker image and memory footprint?

Hi,

Thank you for providing Science Parse and a docker image.
I am currently evaluating it. I was wondering whether there are any plans to reduce the size of the docker image and the memory footprint? Is all of that needed for the extraction?

Thank you

Timeout

issue

I have successfully created the jarfile for science parse but when I run the command : java -Xmx6g -jar jarfile.jar 18bc3569da037a6cb81fb081e2856b77b321c139 or even with any of my pdf papers

this error comes :
C:\Users\HARSH dangerous\science-parse\cli\target\scala-2.11>java -Xmx6g -jar ja
rfile.jar triclustering.pdf
09:08:41.621 [main] DEBUG com.amazonaws.AmazonWebServiceClient - Internal loggin
g successfully configured to commons logger: true
09:08:42.309 [main] DEBUG com.amazonaws.metrics.AwsSdkMetrics - Admin mbean regi
stered under com.amazonaws.management:type=AwsSdkMetrics
09:08:42.512 [main] DEBUG c.a.internal.config.InternalConfig - Configuration ove
rride awssdk_config_override.json not found.
09:08:45.543 [ModelLoaderThread] INFO org.allenai.scienceparse.Parser - Loading
model from C:\Users\HARSH dangerous.ai2\datastore\public\org.allenai.sciencepa
rse\productionModel-v9.dat
09:08:45.543 [ForkJoinPool-1-worker-5] INFO org.allenai.scienceparse.Parser - L
oading gazetteer from C:\Users\HARSH dangerous.ai2\datastore\public\org.allenai
.scienceparse\gazetteer-v5.json
09:08:45.575 [ForkJoinPool-1-worker-5] INFO org.allenai.scienceparse.Parser - L
oading bib model from C:\Users\HARSH dangerous.ai2\datastore\public\org.allenai
.scienceparse\productionBibModel-v7.dat
09:08:45.684 [ForkJoinPool-1-worker-5] INFO org.allenai.scienceparse.Parser - C
reating gazetteer cache at C:\Users\HARSHD~1\AppData\Local\Temp\gazetteer-v5.jso
n-9ee1217b.gazetteerCache.bin
09:09:33.355 [ForkJoinPool-1-worker-5] INFO o.a.scienceparse.ParserGroundTruth

  • Read 1609659 papers.
    09:19:48.757 [ForkJoinPool-1-worker-5] INFO o.a.scienceparse.ExtractReferences

  • could not load kermit gazetter
    Exception in thread "main" java.util.concurrent.TimeoutException: Futures timed
    out after [15 minutes]
    at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)

      at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223
    

)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:116)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockConte
xt.scala:53)
at scala.concurrent.Await$.result(package.scala:116)
at org.allenai.scienceparse.RunSP$$anonfun$main$1.apply(RunSP.scala:186)

    at org.allenai.scienceparse.RunSP$$anonfun$main$1.apply(RunSP.scala:78)
    at scala.Option.foreach(Option.scala:257)
    at org.allenai.scienceparse.RunSP$.main(RunSP.scala:78)
    at org.allenai.scienceparse.RunSP.main(RunSP.scala)

Please help me resolve this issue asap.

Prefer getFontSizePt to getFontSize

Randomly jumping in since I saw the emails and took a look through the code:

TextPosition's getFont method does not always correspond well to the size of the text as displayed on screen. I understand this is because it does not account for some other font properties that can change the size of the text (for example, a PDFont can include an arbitrary matrix transform which can scale up the text). Happily, TextPosition also has a getFontSizePt that returns a font size that accounts for these transformations. In my experience it does a much better (but not perfect) job of getting reasonable font sizes. Switching all getFontSize -> getFontSizePt might make a big improvement on the Title extraction.

[main] INFO org.allenai.datastore.Datastore - Starting to wait on C:\Users\13838157862\.ai2\datastore\public\org.allenai.scienceparse\productionModel-v9.dat.lock

image

hi,
I have successfully created the jarfile for science parse but when I run the command : java -Xmx6g -jar jarfile.jar 18bc3569da037a6cb81fb081e2856b77b321c139 or even with any of my pdf papers

this error comes :
PS D:\Tools\science-parse\cli\target\scala-2.11> java -Xmx6g -jar jarfile.jar 18bc3569da037a6cb81fb081e2856b77b321c139
20:18:33.586 [main] DEBUG com.amazonaws.AmazonWebServiceClient - Internal logging successfully configured to commons logger: true
20:18:33.652 [main] DEBUG com.amazonaws.metrics.AwsSdkMetrics - Admin mbean registered under com.amazonaws.management:type=AwsSdkMetrics
20:18:33.723 [main] DEBUG c.a.internal.config.InternalConfig - Configuration override awssdk_config_override.json not found.
20:18:35.231 [main] INFO org.allenai.datastore.Datastore - Starting to wait on C:\Users\13838157862.ai2\datastore\public\org.allenai.scienceparse\productionModel-v9.dat.lock

It keep stuck here and no more run.

Windows Docker Issue

Hi,

@dirkgr
I am running docker in my windows (Docker version 17.03.1 CE)
An error is occurring when running an the science parser image.

docker_error

My system has 64 G RAM and having 8 physical processors. Why the image is getting killed?

Thanks

Incorrect Text Coordinates

I randomly noticed this while browsing the code.

PDFToken is storing slightly incorrect 'y' coordinates. The reason is that TextPosition's getY method actually returns the lower/maximum y coordinate, not the upper/minimum y coordinate as you might expect (actually I am not sure if this intended behavior from PDFBox or not). PDFExtractor assumes the reverse, so the coordinates it builds are slightly offset from where they should be. When writing TextExtractor for pdffigures I had to manually correct for this affect.

An easy way to check is this to parse that attached PDF and check if the 'y' coordinate of the character is at close to 0.
m.pdf

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.