Giter Site home page Giter Site logo

lfoppiano / grobid-quantities Goto Github PK

View Code? Open in Web Editor NEW
71.0 7.0 24.0 255.5 MB

GROBID extension for identifying and normalizing physical quantities.

Home Page: https://grobid-quantities.readthedocs.io

License: Apache License 2.0

Java 37.27% JavaScript 46.76% HTML 1.75% CSS 3.96% XSLT 0.85% Python 4.91% Dockerfile 0.29% Kotlin 4.21%
physical-quantities scientific-articles crf deep-learning measurements research science tdm

grobid-quantities's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

grobid-quantities's Issues

Unit inside the <num> tag

How to proceed with 92°.5?
Is it ok to embed a <measure> tag in a <num> tag?

<measure type="value"><num>92<measure type="ANGLE" unit="°">°</measure>.5</num></measure>

?

Annotation of Time (as in 20:00)

Sometimes times are mentioned without dates, example:

Mars will miss the comet's orbit (...) at 20:10 UTC

  1. Should we use the <time> element ? -> <measure type="value"><time when="20:10Z">20:10</time> UTC</measure>

  2. Related question, we could distinguish two types of times

    • times not linked to a date, for example a sentence like "To sleep well, relax between 22:00 and 23:00"
    • times linked to an implicit date, for example in the 1408.2792.pdf paper, we know the date is October 19, 2014 because all the paper is about the encounter of a comet with Mars on this day
      -> in that case, should we annotate 20:10 with a <date> element, like this: <measure type="value"><date when="2014-10-19T20:10Z">20:10</date> UTC</measure> ?

Training generation - offset adjustment

When we generate the training data, for lists we have the following exception:

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1967)
    at org.grobid.core.engines.QuantityParser.trainingExtraction(QuantityParser.java:718)
    at org.grobid.core.engines.QuantityParser.createTrainingPDF(QuantityParser.java:452)
    at org.grobid.core.engines.QuantityParser.createTraining(QuantityParser.java:255)
    at org.grobid.core.engines.QuantityParser.createTrainingBatch(QuantityParser.java:502)
    at org.grobid.core.main.batch.QuantityMain.main(QuantityMain.java:176)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.simontuffs.onejar.Boot.run(Boot.java:340)
    at com.simontuffs.onejar.Boot.main(Boot.java:166)

This is due to the fact that the adjustment of the endOffset has been decreased too much. To be verified.

Expression of a resolution

I'm doubting about this example:

The high spatial resolution of the images (40 mas per pixel, corresponding to ≥ 100 km per pixel) resolve the inner coma, and allow investigations of the dust grain expansion velocities.

I would tend to annotate the ANGLE unit and the LENGTH unit, independantly from the "per pixel" part, (because mas/pixel or km/pixel doesn't seem to be a known unit, but I'm not sure at all about that):

- <measure type="value"><num>40</num> <measure type="ANGLE" unit="mas">mas</measure>
</measure> per pixel

- ≥ <measure type="interval"><num atLeast="100">100</num> <measure type="LENGTH" 
unit="km">km</measure></measure> per pixel

If we were to annotate mas/pixel and km/pixel as plain units, what would be their type? (create a RESOLUTION type? UNKNOWN ? DENSITY ?)

What do you think?

Time expressions that are possessive

I reviewed the annotation document for grobid and couldn't find a solution for how to interpret and annotate instances in which time expressions are possessive. How should a phrase like this one be annotated (if at all) in regards to time?: "this year's minimum extent is lower than last year's." (This example is paraphrased from a cryology blog.)

Some PDF containing special characters are breaking the parser/normaliser

Annotating this PDF:
hal-00924047.pdf

There is the following exception:

javax.servlet.ServletException: systems.uom.ucum.internal.format.TokenMgrError: Lexical error at line 1, column 1.  Encountered: "\u00b5" (181), after : ""
	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:420)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
	at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:699)
	at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
	at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:837)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:584)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:534)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
	at java.lang.Thread.run(Thread.java:745)
Caused by: systems.uom.ucum.internal.format.TokenMgrError: Lexical error at line 1, column 1.  Encountered: "\u00b5" (181), after : ""
	at systems.uom.ucum.internal.format.UCUMTokenManager.getNextToken(UCUMTokenManager.java:412)
	at systems.uom.ucum.internal.format.UCUMFormatParser.jj_scan_token(UCUMFormatParser.java:464)
	at systems.uom.ucum.internal.format.UCUMFormatParser.jj_3R_3(UCUMFormatParser.java:268)
	at systems.uom.ucum.internal.format.UCUMFormatParser.jj_3R_4(UCUMFormatParser.java:248)
	at systems.uom.ucum.internal.format.UCUMFormatParser.jj_3R_2(UCUMFormatParser.java:274)
	at systems.uom.ucum.internal.format.UCUMFormatParser.jj_3_1(UCUMFormatParser.java:240)
	at systems.uom.ucum.internal.format.UCUMFormatParser.jj_2_1(UCUMFormatParser.java:219)
	at systems.uom.ucum.internal.format.UCUMFormatParser.Component(UCUMFormatParser.java:112)
	at systems.uom.ucum.internal.format.UCUMFormatParser.Term(UCUMFormatParser.java:76)
	at systems.uom.ucum.internal.format.UCUMFormatParser.parseUnit(UCUMFormatParser.java:66)
	at systems.uom.ucum.format.UCUMFormat$Parsing.parse(UCUMFormat.java:513)
	at systems.uom.ucum.format.UCUMFormat$Parsing.parse(UCUMFormat.java:532)
	at org.grobid.core.data.normalization.QuantityNormalizer.normalizeNonSIQuantities(QuantityNormalizer.java:76)
	at org.grobid.core.data.normalization.QuantityNormalizer.normalizeQuantity(QuantityNormalizer.java:66)
	at org.grobid.core.engines.QuantityParser.normalizeQuantity(QuantityParser.java:346)
	at org.grobid.core.engines.QuantityParser.normalizeMeasurements(QuantityParser.java:300)
	at org.grobid.core.engines.QuantityParser.processLayoutTokenSequence(QuantityParser.java:269)
	at org.grobid.core.engines.QuantityParser.processDocumentPart(QuantityParser.java:223)
	at org.grobid.core.engines.QuantityParser.extractQuantitiesPDF(QuantityParser.java:193)
	at org.grobid.service.QuantityProcessFile.processPDFAnnotation(QuantityProcessFile.java:74)
	at org.grobid.service.QuantityRestService.processPDFAnnotation(QuantityRestService.java:82)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
	at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
	at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
	at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
	at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
	at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
	at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
	at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
	at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
	... 28 more

There are other symbols that are regognised and parsed.
The error should at least not crash the whole process.

Feedback and new test data

Hi @khundman and @chrismattmann, I'm currently working with @kermitt2 to the GROBID quantity model.

I was wondering if you had time to have a look at the demo, if you spot any bug or you have any remarks feel free to open new issues. ;-)

On the other hand, we still have to introduce more training data. If you have some relevant document we could use (perhaps not necessarely on the scientific domain) would be great :) You can push them on the dataset or a subdirectory of it.

Cheers

Value expressed with alphabetic characters

For example "twenty kilos" - currently the recognition is very bad and there is no normalization into numerical values.

We should:

  • add a matching feature for this very limited vocabulary in the quantity model (it will generalize the examples in the training data),
  • add a dedicated normalization.

output in TEI

Implement the TEI output. I open a separate task.

Support for date and time expression

Dates are time measurement, so in the scope of the tool. Currently the recognition of dates and time expressions is limited and there is no normalization for these expressions.

  • Dates could already be recognized well by the existing grobid-ner (a standard Named Entity Recognizer based on grobid). Dates can be reliably normalized with the existing GROBID date model. So we can simply integrate/use this existing stuff.
  • Time expressions require a specific normalization, in particular to recognize the time zone.
  • Expressions combining date and time must be correctly handled and normalized too.

Multilingual support

Currently we support only English (although we have some pieces of training data also in German and Franch for patent related content). Let's make the tool multilingual !

  • add lexical resources (units, number words) for other languages,
  • adapt the QuantityLexicon to work wiht different locale,
  • use GROBID language recognition for the input to be parsed and set the correct locale accordingly for the rest of the process.

Unit type of a flow measurement

In this example, the quantity unit is something like "the number of molecules flowing by square meter, by second":

The gas coma will reach the upper atmosphere of Mars with peak fluxes of order 10 12 molecules m −2 s −1

 <measure type="value"><num>10 12</num> <measure type="?" unit="mol m^-2 s^-1">molecules
 m −2 s −1</measure></measure> 

What would be its type? I don't think it's already in UnitUtilities.java, it would be something like "Molecule flow rate" our "Molecular diffusion" ?

What do you think?

Is meteor/h a unit?

In the following sentence with a zenithal hourly rate quantity:

the meteor shower at Mars is an Earth-equivalent zenith hourly rate 600 h −1

Should we annotate:

(1) zenith hourly rate <measure type="value"><num>600</num></measure> h −1
or
(2) zenith hourly rate <measure type="value"><num>600</num> <measure type="?" 
unit="meteor/h">h −1</measure>
?

If (2), what type would that be? It's an hourly rate but I don't see anything like that in the unitUtilities.java file.

Reference markers, formulas and other irrelevant numbers

This is related to issue #22.

I have a file (1404.4640.training.tei.xml) where there are a few reference markers, for example:

  • lower than those derived by Vaubaillon et al. (2014) and Moorhead et al. (2014)
  • computing the corresponding impact probabilities (Milani et al. 2005)

There are also figure/table titles, and other numbers who don't quantify anything, for example:

  • Figure 1 shows the residuals of C/2013 A1's observations
  • [Figure 1 about here.]
  • Table 1 contains the orbital elements of the computed solution.
  • our new orbit solution (JPL solution 46)

There are also some inline formulas, like:

  • a minimum point of ∆v 2 = |∆v| 2 under the constraint that the particle reaches Mars, i.e., (ξ, ζ)(r, β, ∆v) = (0, 0).

None of these numbers are annotated.

(+addToDoc)

Unit parsing: full names unit, full names with inflections

As we moved from "lexical mapping" (not to say rules!) to a CRF parser to process and normalize the unit expressions, the full name unit are not covered by the unit parser, e.g. hours in "2 hours".

[WARN ] org.grobid.core.engines.QuantityParser: Could not normalize the value: 2. 
org.grobid.core.data.normalization.NormalizationException: The unit Unit{rawName='hours', offsets=661   666, productBlock=null} cannot be normalized. It is either not a valid unit or it is not recognized from the available parsers.

Generic parsing of values

Values can be entirely numerical, use exponent of 10s (see #7) or exponent symbol (0.2E-4), number words ("twenty") (see #8), dates/time expressions ("October 19, 2014 at 20:09 TDB") (see #12).

Currently the treatment of all these cases is ad hoc. We should introduce a value parser to recognize what kind of value we have and to use the right parser/normalization.

What is the type of this unit: rad.m^-2

I'm looking for the type of the unit rad.m^-2, found here:

They found a large-scale rotation measure of ∼ −21 rad m −2 that they attribute to the interstellar medium, and from the observed depolarization they inferred absolute rotation measure values of a few hundred rad m −2 in unresolved filaments.

Should it be a new type ROTATION?

Loss of exponents for the powers of ten from pdf

Exponents are lost in the xml file, for example 10 power -6 in pdf becomes 10 &#x2212;6 (10 −6).

  1. is it a problem?

  2. we agreed we should add the exponent in the attribute when there is one, for example in intervals:
    <measure type="interval"><num atMost="10^-6">10 −6</num></measure>
    Please confirm

I'm not sure but this may be related to issue #7

Problem with running on some data

I am taking a class at USC under prof. Mattmann. We are try to run grobid-quantities on a dataset given to us. I am not able to run it on several files. I get an error saying "IMPLEMENTATION ERROR: tokens got dissynchronized with tokenizations"

I have uploaded a sample file on which I am trying to run grobid quantites here

Could you please explain me the error so that I can help finding a fix.

Windows 64 bit pdftoxml issue

80        if (withAnnotations) {
81            pdf2xml += " -annotation ";
82        }

In windows 64 bit version GROBID uses pdftoxml version 3.01 (However in linux it uses 3.02)

pdftoxml version 1.0
(Based on Xpdf version 3.01, Copyright 1996-2005 Glyph & Cog, LLC)
Copyright 2004-2006 XEROX XRCE

3.01 doesn't have the option -annotation but only -annots.

Either pdftoxml version needs to be upgraded for Windows 64 bit or the above java code has to be updated

Interval markers inside or outside the <measure> tag

Should interval markers such as >, <, "more than", "less than", etc. be included in the tag ?

more than <measure type="interval"> <num atLeast="2">2</num> </measure> 
or
<measure type="interval"> more than <num atLeast="2">2</num> </measure> 

Unit to add: Jy

In this sentence, there's a mention of µJy:

If the nucleus is on the order of 1km in radius, it would contribute a few tenths of a µJy to the fluxes observed during the 2014 Jan. visit when the comet was least active and the nucleus likely contributed the largest fraction of the light.

The Jansky is a Non-SI unit of Spectral flux density. Should we add it to the units.json with a new type SPECTRAL_FLUX_DENSITY ?

Capturing vague units of time

I recently annotated a cyology-related blog and noticed that grobid doesn't allow for vague or inexplicit units of time to be captured. Examples of these include: late July, early August, end of the month, this week, through April, recent decades etc.

I also noticed that it ignores mentions of seasons like spring, fall, summer, summertime, winter, wintertime. Cryology has it's own unique terms to denote seasons like melt season or ice growth period.

It would be very useful if grobid 1)could capture these vague time expressions, 2) if it could be linked to the document/blog/articles publishing date, and 3) if grobid allowed prototypical seasons (if not also those unique to cryology season terms) to be captured as a kind of time expression.

Annotations of Numbers alone

Often there are numbers that are mentioned but they dont' refer a quantity measure:

There are five planets with sufficient signal-to-noise for analysis.

We decided to annotate them, as <measure type=value><num>five</num></measure>. Although this is an overlap with grobid-ner, those are also quantities and we identify the quantitied substance/objects.

Does this make sense?

@unit types and format discussion, addition of new units

While rechecking @Unit in the files, I have this one with a ? type:

values of A 1 are on average ∼ 10 −8 au/d 2

au/d^2 seems to convert to m/s^2 so I annotate it like this:

values of A 1 are on average ∼ <measure type="value"><num>10 −8</num> <measure 
type="ACCELERATION" unit="au/d^2">au/d 2</measure></measure>

au/d^2 and m/s^2 are not yet in the units.json file, how do we proceed to add them?

Create training data

  • annotation of the prime training data
  • correction of 10 automatically annotated training data (check performances after that)

unprecise quantifiers: few, several

Should the quantifiers like few, several, be annotated?

e.g. in this case several millimeter:
At higher velocities, younger grains from sub-millimeter to several millimeter can reach Mars too, although an even smaller fraction of grains is expected have these velocities, with negligible effect on the peak timing.

Interval bounded with quantities expressed in different unit multiples

This interval is delimited by bounds with different multiple of the unit:

radii between 10 µm and 1 cm

Is it enough to annotate it like this:

grains with radii between <measure type="interval"><num atLeast="10">10</num> 
<measure type="LENGTH" unit="µm">µm</measure> and <num atMost="1">1</num>
 <measure type="LENGTH" unit="cm">cm</measure></measure>

?

Interval notation

In the following example:

closest heliocentric distances (3 AU ≤ r h 5 AU)

The unit is repeated twice. Should we annotate both "AU" or only one:

<measure type="interval"><num atLeast="3">3</num> <measure type="LENGTH" 
unit="AU">AU</measure> ≤ r h <num atMost="5">5</num> <measure type="LENGTH" 
unit="AU">AU</measure></measure>

or 
<measure type="interval"><num atLeast="3">3</num> AU ≤ r h <num atMost="5">5</num>
 <measure type="LENGTH unit="AU">AU</measure></measure>
?

Space characters break the unit parser

As mentioned in the code ;)

//Remove spaces. It's a workaround (to be check whether it is working) because spaces are causing troubles
        text = text.replaceAll(" ", "");

However we have later:

text = text.replace("\n", " ");

And indeed in the CRF input, the space character is the separator, so it would lead to an invalid vector, e.g.:

input string -> µJy\none -> µJy one ->

µ 0 0 1 1 NOPUNCT 0
J 1 0 0 0 NOPUNCT 0
y 0 0 1 1 NOPUNCT 0
  1 0 0 0 NOPUNCT 0
o 0 0 0 0 NOPUNCT 0
n 0 0 1 1 NOPUNCT 0
e 0 0 1 0 NOPUNCT 0

It means that for a character level CRF, we need to replace the space character (and the tabulation) by a default UTF8 non-space character, and reverse back the change at decoding.

Handling Jr., Sr. in names

I am using latest version 0.4.2 and checked the following issues in Windows 7 as well as CentOS 7
Reference Citation Sample checked:
Clow GD, McKay CP, Simmons Jr. GM, and Wharton RA, Jr. 1988. Climatological observations and predicted sublimation rates at Lake Hoare, Antarctica. Journal of Climate 1:715-728.

Issue 1. It changes the forename "GD" as "Gd"; "CP" as "Cp" etc.
Issue 2. Captures Jr. as surname and tags "GM" as separate surname without a forename

<author>
	<persName>
		<forename type="first">Cp</forename>
		<surname>Mckay</surname>
	</persName>
</author>    
<author>     
	<persName>
		<forename type="first">Simmons</forename>
		<surname>Jr</surname>
	</persName>
</author>    
<author>     
	<persName>
		<surname>Gm</surname>
	</persName>
</author>
  1. How to retain the forename (initials) as it is withour converting the case.
  2. There are enough datasets in "name\header\corpus" with Jr (for eg.) and dont know why is not capturing it in suffix tag. This is happening in the header part as well as in the citation part.

Regards
Dominic

Interval with complex boundaries (x by powers of ten)

In the following sentences with quantities expressed by a power of ten multiplication, how can we specify the interval boundaries?:

  • A1 (Siding Spring) will pass Mars with a close approach distance of 1.35 ± 0.05 × 10 5 km

the interval is from (1.30 x 10^5) to (1.40 x 10^5), which can't be expressed only with tags here...

  • The gas production rates, Q(CO 2 ) = (3.52> ± 0.03) × 10 26 molecules s −1

Units as not part of a quantity

Question: Should we annotate cm and AU as unit in this paragraph?

where A is the Bond albedo of the dust at the phase angle of observation, f is the filling factor of the dust grains within the aperture, &#x3C1; is the aperture size in cm, &#x2206; and r H are the geocentric and heliocentric distances in cm and AU, respectively, and F comet and F &#x2299; are the flux from the comet and the Solar flux

Answer: As they don't happear with a value, they thus shall be ignored, because it's the description of the unit used in the graph/document/figure.

Numerical value as exponent on 10s

The quantity CRF model recognizes numerical expressions with exponents on 10 (in particular distorted one due to PDF text extraction):

example_exponent

However we are not currently parsing it (in their "noisy" form) to actual BigDecimal values.

Support to change the port number

Hello

I am student at USC and presently using Grobid Quantities for one of my project. I am not able to change the default port number 8080. No processes are running on this port as of now apart from the normal HTTP traffic. Whenever I am using mvn -Dmaven.test.skip=true jetty:run-war GROBID throws exception with 'Address already in use'. Entire stack trace is attached for reference as well.

Is there any possibility to change the default port number without requiring to re-build?Any help is highly appreciated.

Thanks
Grobit_Error.txt

Help with training quantities and units

Hello,

When I run these commands to train :

mvn -DskipTests generate-resources -Ptrain_quantities
mvn -DskipTests generate-resources -Ptrain_units
I get these two errors respectively and models are not generated. Any Idea how to fix it?

EP-0505067-B1.training.tei.xml
EP-0505085-B2.training.tei.xml
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
Warning: unknown measure type, ?
epsilon: 1.0E-6
window: 20
nb threads: 1
error: too much input files on command line
1001.4731.units.training.tei.xml
1404.4640.units.training.tei.xml
1404.7168.units.training.tei.xml
1412.2117.units.training.tei.xml
generated.training.1460634625418.tei.xml
trainingdata1.tei.xml
trainingdata2.tei.xml
trainingdata4.tei.xml
epsilon: 1.0E-7
window: 20
nb threads: 1
error: too much input files on command line

My Configuration is the following:

Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /Users/username/Downloads/apache-maven-3.3.9
Java version: 1.8.0_71, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_71.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"

Thank you

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.