Giter Site home page Giter Site logo

patentpublicdata's Introduction

Patent Public Bulk Files

Tool kit to download, read, and utilize open patent data provided to the public.

Notice

This source code is a work in progress and has not been fully vetted for a production environment.

Two main modules

  • Bulk Downloader automates downloading of public bulk patent data
  • Patent Document provides the ability to iterate and read patents directly from the large bulk download files, supports reading patent documents from 1976 to current, which includes Greenbook, SGML, PAP, and all Redbook XML formats, into a normalized Patent Object Model.

Features

  • Download bulk patent grants and applications, as well as additional resources
  • View individual patent documents directly from the large bulk files
  • Read patent documents directly from the large bulk files, supports reading patent documents from 1976 to current (formats: Greenbook, SGML, PAP, Redbook XML) into a normalized Patent Object Model
  • Extract patent documents from bulk files
  • Normalize and transform patent data before loading into a data resource
  • Company Synonyms generated by removing prefixes and suffixes i.e. (Corp.|Corporation|Co.|Company|...)
  • NPL Citations extraction of US Patent Ids
  • Patent Claim Tree to facilitate analysis
  • Update Classifications from Master CPC File (current CPC classification for patents starting from patent number 1)
  • Include classification definitions from CPC Scheme
  • Build a corpus using Corpus Builder, which automates building a corpus by downloading and extracting patents/applications matching specified classifications, one bulk file at a time for a date range.

Public Patent Data

  • Rate of Release: Evey Tuesday, a new bulk file is released, which contains around two to five thousand patents granted on the same day as the release.
  • Releases are available on both the USPTO Bulkdata and Reedtech websites.
  • Receiving changes of patents after publication, note bulk files are not updated once published, updates can be received by indexing additional supplemental files which are also publicly available. The following are fields which periodically update after publication:
    Field Update available
    Assignee daily within Patent Assignment XML Dump files
    Classifications monthly within Master Classification File Dumps

Other Information

The United States Department of Commerce (DOC)and the United States Patent and Trademark Office (USPTO) GitHub project code is provided on an ‘as is’ basis without any warranty of any kind, either expressed, implied or statutory, including but not limited to any warranty that the subject software will conform to specifications, any implied warranties of merchantability, fitness for a particular purpose, or freedom from infringement, or any warranty that the documentation, if provided, will conform to the subject software. DOC and USPTO disclaim all warranties and liabilities regarding third party software, if present in the original software, and distribute it as is. The user or recipient assumes responsibility for its use. DOC and USPTO have relinquished control of the information and no longer have responsibility to protect the integrity, confidentiality, or availability of the information.

User and recipient agree to waive any and all claims against the United States Government, its contractors and subcontractors as well as any prior recipient, if any. If user or recipient’s use of the subject software results in any liabilities, demands, damages, expenses or losses arising from such use, including any damages from products based on, or resulting from recipient’s use of the subject software, user or recipient shall indemnify and hold harmless the United States government, its contractors and subcontractors as well as any prior recipient, if any, to the extent permitted by law. User or recipient’s sole remedy for any such matter shall be immediate termination of the agreement. This agreement shall be subject to United States federal law for all purposes including but not limited to the validity of the readme or license files, the meaning of the provisions and rights and the obligations and remedies of the parties. Any claims against DOC or USPTO stemming from the use of its GitHub project will be governed by all applicable Federal law. “User” or “Recipient” means anyone who acquires or utilizes the subject code, including all contributors. “Contributors” means any entity that makes a modification.

This agreement or any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not in any manner constitute or imply their endorsement, recommendation or favoring by DOC or the USPTO, nor does it constitute an endorsement by DOC or USPTO or any prior recipient of any results, resulting designs, hardware, software products or any other applications resulting from the use of the subject software. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, including USPTO, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC, USPTO or the United States Government.



CC0
To the extent possible under law, https://github.com/USPTO/PatentPublicData has waived all copyright and related or neighboring rights to Patent Public Data. This work is published from: United States.

patentpublicdata's People

Contributors

bgfeldm avatar figyelmesi avatar maduraimad avatar mustberuss avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

patentpublicdata's Issues

USPC Classification Ranges

Better parsing and handling of USPC Classification Ranges

Sample USPC Classification Ranges:

'340 31- 332'
' 63 57- 158'
'715700-866'
'473251-356'
' 52574-5822'
' 52480-4811'
' 73862041-86208'
' 73862381-86269'
' 81 60- 632'
'383 34- 341'
'400621-6212'
'604 9601-113'
'623 2351- 236'
'435 46- 692'
'269 475- 4752'
'370403-517'
'707100-1041'
'718 1-108'
'438296-977'
'438142-378'
'438400-757'
'257661-786'
'3242072-20724'

Plain fields often empty: Raw, Plain, Normalized fields

Five fields have raw, plain, and normalized versions. Firstly, can you give me some information about the normalized and the plain versions? Secondly, many of the plain fields appear to be empty (often with a newline).

The fields are:

abstract
description.REL_APP_DESC
description.DRAWING_DESC
description.BRIEF_SUMMARY
description.DETAILED_DESC

The plain version appears to be empty a not-insignificant percentage of the time.

As measured by (len(plain) / len(raw) < 50%), when examining a 2010 grant file (specifically ipg100105.bulk) about 0.5% of the abstracts appear to be empty, 2% of the DETAILED_DESC, and approximately 90% of the DRAWING_DESC.

I don't have any further information about the cause of these blank fields.

Any insight is appreciated.

Multiple versions of JSoup causing exception

The BulkDownloader POM calls for the prior version of Jsoup, which leads to both versions being placed in the dependency jar directory, and the following exception:

Exception in thread "main" java.lang.NoSuchMethodError: org.jsoup.parser.Parser.settings(Lorg/jsoup/parser/ParseSettings;)Lorg/jsoup/parser/Parser;	
	at gov.uspto.patent.PatentReader.fixTagsJDOM(PatentReader.java:116)
	at gov.uspto.patent.PatentReader.getJDOM(PatentReader.java:99)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:76)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

TransformerCli throws an NPE and stops processing with pftaps19931130_wk48.zip

2017-01-12 13:37:00,537 INFO  [main] TransformerCli - Record: 'US5266142A' from pftaps19931130_wk48.zip:1102			
Exception in thread "main" java.lang.NullPointerException				
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)	
	at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:320)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:102)		
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)	
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:215)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:199)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)						
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Can't compile commit 7fdebf2189eb662308707ee0a3fd2eef15cc8568

New directory:

git clone https://github.com/USPTO/PatentPublicData.git

then

mvn clean package

Results in:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] 
[INFO] PatentPublicData
[INFO] Common
[INFO] PatentDocument
[INFO] BulkDownloader
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building PatentPublicData 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ PatentPublicData ---
[INFO] 
[INFO] >>> maven-javadoc-plugin:2.10.3:aggregate (aggregate) > generate-sources @ PatentPublicData >>>
[INFO] 
[INFO] <<< maven-javadoc-plugin:2.10.3:aggregate (aggregate) < generate-sources @ PatentPublicData <<<
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:aggregate (aggregate) @ PatentPublicData ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:jar (attach-javadocs) @ PatentPublicData ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:copy-dependencies (copy-dependencies) @ PatentPublicData ---
[INFO] 
[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ PatentPublicData ---
[INFO] Building zip: /Users/patrick/dev/repos/trashthis/PatentPublicData/target/PatentPublicData-0.0.1-SNAPSHOT.zip
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Common 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ Common ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ Common ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ Common ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 24 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/classes
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ Common ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile) @ Common ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 3 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/test-classes
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/src/test/java/gov/uspto/common/text/StopWordTest.java: Some input files use or override a deprecated API.
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/src/test/java/gov/uspto/common/text/StopWordTest.java: Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ Common ---
[INFO] Surefire report directory: /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running gov.uspto.common.text.StopWordTest
Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.203 sec
Running gov.uspto.common.text.StringCaseTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running gov.uspto.common.text.WordUtilTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec

Results :

Tests run: 44, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ Common ---
[INFO] Building jar: /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/Common-0.0.1-SNAPSHOT.jar
[INFO] 
[INFO] >>> maven-javadoc-plugin:2.10.3:aggregate (aggregate) > generate-sources @ Common >>>
[INFO] 
[INFO] <<< maven-javadoc-plugin:2.10.3:aggregate (aggregate) < generate-sources @ Common <<<
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:aggregate (aggregate) @ Common ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:jar (attach-javadocs) @ Common ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:copy-dependencies (copy-dependencies) @ Common ---
[INFO] Copying commons-compress-1.11.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/commons-compress-1.11.jar
[INFO] Copying jopt-simple-5.0.2.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/jopt-simple-5.0.2.jar
[INFO] Copying commons-io-2.5.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/commons-io-2.5.jar
[INFO] Copying slf4j-api-1.7.21.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/slf4j-api-1.7.21.jar
[INFO] Copying slf4j-log4j12-1.7.21.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/slf4j-log4j12-1.7.21.jar
[INFO] Copying log4j-1.2.17.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/log4j-1.2.17.jar
[INFO] Copying guava-19.0.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/guava-19.0.jar
[INFO] 
[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ Common ---
[INFO] Building zip: /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/Common-0.0.1-SNAPSHOT.zip
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building PatentDocument 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ PatentDocument ---
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ PatentDocument ---
[INFO] Using 'Cp1252' encoding to copy filtered resources.
[INFO] Copying 17 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.6.0:compile (default-compile) @ PatentDocument ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 209 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/target/classes
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/sgml/items/DescriptionFigures.java: Some input files use unchecked or unsafe operations.
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/sgml/items/DescriptionFigures.java: Recompile with -Xlint:unchecked for details.
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ PatentDocument ---
[INFO] Using 'Cp1252' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/test/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.6.0:testCompile (default-testCompile) @ PatentDocument ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 27 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/target/test-classes
[INFO] 
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ PatentDocument ---
[INFO] Surefire report directory: /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running gov.uspto.document.parser.dom4j.PatentParserTest
2017-01-12 09:55:18,015 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2017-01-12 09:55:18,078 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2017-01-12 09:55:18,100 WARN  [       main] US7654321            DescriptionNode - Patent does not have a Description.
2017-01-12 09:55:18,111 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2017-01-12 09:55:18,137 WARN  [       main] US3930584            ApplicationIdNode - Invalid application-id 'APN' field not found
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.832 sec
Running gov.uspto.patent.doc.greenbook.FormattedTextTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running gov.uspto.patent.doc.greenbook.GreenbookTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.255 sec
Running gov.uspto.patent.doc.greenbook.items.DescriptionFiguresTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running gov.uspto.patent.doc.pap.PatentAppPubParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.674 sec
Running gov.uspto.patent.doc.sgml.SgmlTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.222 sec
Running gov.uspto.patent.doc.xml.ApplicationParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.424 sec
Running gov.uspto.patent.doc.xml.FormattedTextTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.017 sec
Running gov.uspto.patent.doc.xml.GrantParserTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.172 sec
Running gov.uspto.patent.mathml.MathMLTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running gov.uspto.patent.model.classification.ClassificationPredicateTest
false
false
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.093 sec
Running gov.uspto.patent.model.classification.ClassificationTokenizerTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.006 sec
Running gov.uspto.patent.model.classification.CpCClassificationTest
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.007 sec
Running gov.uspto.patent.model.classification.IpcClassificationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running gov.uspto.patent.model.classification.PatentClassificationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.027 sec
Running gov.uspto.patent.model.classification.UspcClassificationTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.028 sec
Running gov.uspto.patent.model.DocumentIdTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec
Running gov.uspto.patent.model.entity.NamePersonTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running gov.uspto.patent.model.PatentTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.serialize.JsonMapperTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.03 sec
Running gov.uspto.patent.validate.AbstractRuleTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running gov.uspto.patent.validate.ClaimRuleTest
Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.035 sec <<< FAILURE!
nullClaimListFail(gov.uspto.patent.validate.ClaimRuleTest)  Time elapsed: 0.034 sec  <<< ERROR!
java.lang.NullPointerException
	at gov.uspto.patent.model.Patent.setClaim(Patent.java:157)
	at gov.uspto.patent.validate.ClaimRuleTest.nullClaimListFail(ClaimRuleTest.java:32)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Running gov.uspto.patent.validate.ClassificationRuleTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.validate.DescriptionRuleTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.validate.TitleRuleTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec

Results :

Tests in error: 
  nullClaimListFail(gov.uspto.patent.validate.ClaimRuleTest)

Tests run: 71, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] PatentPublicData ................................... SUCCESS [  5.361 s]
[INFO] Common ............................................. SUCCESS [  5.228 s]
[INFO] PatentDocument ..................................... FAILURE [ 11.048 s]
[INFO] BulkDownloader ..................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21.818 s
[INFO] Finished at: 2017-01-12T09:55:24+01:00
[INFO] Final Memory: 30M/280M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project PatentDocument: There are test failures.
[ERROR] 
[ERROR] Please refer to /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :PatentDocument

getting some test failures for PatentDocument on 'maven clean install'

updated local repo to latest master (75597ea) and run mvn clean install, getting test failures on PatentDocument below:

 T E S T S
-------------------------------------------------------
Running gov.uspto.document.model.classification.CpCClassificationTest
[0/D, 1/D/D07, 2/D/D07/D07B, 3/D/D07/D07B/D07B2201, 4/D/D07/D07B/D07B2201/D07B22012051]
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.072 sec
Running gov.uspto.document.model.classification.IpcClassificationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.document.model.classification.UspcClassificationTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.051 sec
Running gov.uspto.document.model.DocumentIdTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.097 sec
Running gov.uspto.document.parser.dom4j.PatentParserTest
2016-12-01 13:37:31,065 WARN  [       main] UNDEFINED1234567     AbstractTextNode - Patent does not have an Abstract.
2016-12-01 13:37:31,069 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2016-12-01 13:37:31,100 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2016-12-01 13:37:31,118 WARN  [       main] US7654321            AbstractTextNode - Patent does not have an Abstract
2016-12-01 13:37:31,118 WARN  [       main] US7654321            DescriptionNode - Patent does not have a Description.
2016-12-01 13:37:31,128 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2016-12-01 13:37:31,137 WARN  [       main] US3930584            ApplicationIdNode - Invalid document-id can not be Null.
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.296 sec
Running gov.uspto.patent.doc.greenbook.FormattedTextTest
Tests run: 3, Failures: 2, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec <<< FAILURE!
normalizedSimpleHtml(gov.uspto.patent.doc.greenbook.FormattedTextTest)  Time elapsed: 0.007 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<...2>SECTION TITLE</h2>[ 
<p>Paragraph text, referenceing <a class="figref">FIG. 1</a> is a side elevational view</p> 
<p>More text now referenceing <a class="figref">FIG. 2B</a> is a top view</p>]> but was:<...2>SECTION TITLE</h2>[
<p>Paragraph text, referenceing <a class="figref">FIG. 1</a> is a side elevational view</p>
<p>More text now referenceing <a class="figref">FIG. 2B</a> is a top view</p>
]>
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at gov.uspto.patent.doc.greenbook.FormattedTextTest.normalizedSimpleHtml(FormattedTextTest.java:82)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

createRefs(gov.uspto.patent.doc.greenbook.FormattedTextTest)  Time elapsed: 0.001 sec  <<< FAILURE!
org.junit.ComparisonFailure: expected:<...ss="figref">FIG. 1(a[)</a>] is> but was:<...ss="figref">FIG. 1(a[</a>)] is>
	at org.junit.Assert.assertEquals(Assert.java:115)
	at org.junit.Assert.assertEquals(Assert.java:144)
	at gov.uspto.patent.doc.greenbook.FormattedTextTest.createRefs(FormattedTextTest.java:62)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Running gov.uspto.patent.doc.greenbook.GreenbookTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.096 sec
Running gov.uspto.patent.doc.greenbook.items.DescriptionFiguresTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.doc.pap.PatentAppPubParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.206 sec
Running gov.uspto.patent.doc.sgml.SgmlTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.147 sec
Running gov.uspto.patent.doc.xml.ApplicationParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.225 sec
Running gov.uspto.patent.doc.xml.FormattedTextTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.01 sec
Running gov.uspto.patent.doc.xml.GrantParserTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.548 sec
Running gov.uspto.patent.mathml.MathMLTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running gov.uspto.patent.model.entity.NamePersonTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.serialize.JsonMapperTest
{"patentCorpus":"PGPUB","patentType":"UTILITY","productionDate":{"raw":"20160101","iso":"2016-01-01T00:00:00Z"},"publishedDate":{"raw":"20160202","iso":"2016-02-02T00:00:00Z"},"documentId":"US123456789","documentDate":{"raw":"","iso":""},"applicationId":"","applicationDate":{"raw":"","iso":""},"relatedIds":[],"otherIds":[],"agent":[],"applicant":[],"inventors":[{"sequence":"","name":{"type":"person","raw":"Inventee, Bob","prefix":"","firstName":"Bob","middleName":"","lastName":"Inventee","suffix":"","abbreviated":"Inventee, B.","synonyms":[]},"address":{"street":"123 Main St","city":"Alexandria","state":"VA","zipCode":"22314","country":"US","email":"","fax":"","phone":""},"residency":"","nationality":""}],"assignees":[{"name":{"type":"org","raw":"Inventee Inc.","suffix":"","synonyms":[]},"address":{"street":"123 Main St","city":"Alexandria","state":"VA","zipCode":"22314","country":"US","email":"","fax":"","phone":""},"role":"","roleDefinition":""}],"examiners":[],"title":"Test Patent","abstract":{"raw":"This is the Abstract Section.","normalized":"   This is the Abstract Section. \n","plain":" This is the Abstract Section. "},"description":{"full_raw":"Drawing Desc Text\nRell App Desc Text\nBrief Summar Desc Text\nDetailed Description Text\n","REL_APP_DESC":{"raw":"Rell App Desc Text","normalized":"   Rell App Desc Text \n","plain":" Rell App Desc Text "},"DRAWING_DESC":{"raw":"Drawing Desc Text","normalized":"   Drawing Desc Text \n","plain":" Drawing Desc Text "},"BRIEF_SUMMARY":{"raw":"Brief Summar Desc Text","normalized":"   Brief Summar Desc Text \n","plain":" Brief Summar Desc Text "},"DETAILED_DESC":{"raw":"Detailed Description Text","normalized":"   Detailed Description Text \n","plain":" Detailed Description Text "}},"claims":[],"citations":[],"classification":{"ipc":[],"uspc":[],"cpc":[]}}
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.024 sec
Running gov.uspto.patent.validate.AbstractRuleTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.validate.ClaimRuleTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec
Running gov.uspto.patent.validate.ClassificationRuleTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.validate.DescriptionRuleTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.004 sec
Running gov.uspto.patent.validate.TitleRuleTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec

Results :

Failed tests:   normalizedSimpleHtml(gov.uspto.patent.doc.greenbook.FormattedTextTest): expected:<...2>SECTION TITLE</h2>[ (..)
  createRefs(gov.uspto.patent.doc.greenbook.FormattedTextTest): expected:<...ss="figref">FIG. 1(a[)</a>] is> but was:<...ss="figref">FIG. 1(a[</a>)] is>

Tests run: 67, Failures: 2, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] PatentPublicData ................................... SUCCESS [  3.146 s]
[INFO] Common ............................................. SUCCESS [  2.813 s]
[INFO] PatentDocument ..................................... FAILURE [  5.830 s]
[INFO] BulkDownloader ..................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 11.939 s
[INFO] Finished at: 2016-12-01T13:37:34-05:00
[INFO] Final Memory: 32M/385M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project PatentDocument: There are test failures.
[ERROR] 

TransformerCli throws a NPE on a number of Redbook SGML archives

pg010306.zip pg020924.zip pg021001.zip pg021231.zip pg030204.zip pg030930.zip pg031230.zip pg040803.zip pg040928.zip

Stack traces for a random sample of these follows:

2017-01-20 11:42:09,513 INFO  [main] TransformerCli - Record: 'US6456731B1' from pg020924.zip:3846	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:320)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:102)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)
	
2017-01-20 11:43:38,636 INFO  [main] TransformerCli - Record: 'US6500452B2' from pg021231.zip:1720	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:320)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:102)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)
	
2017-01-20 11:46:59,340 INFO  [main] TransformerCli - Record: 'US6771588B2' from pg040803.zip:2801	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:320)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:102)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)

46mb single patent file

After being written into an individual json file by TransformerCli this file is 42mb (other files are ~100k). It is filled with this following section of the json file (trimmed).

The google copy of the patent looks "normal" without this type of data. Is this an error?

Year: 2010
file: ipg100105.zip
US7640662B2.json

   ```
 {
            "type":"main",
            "raw":"2989003-890054",
            "normalized":"029/100000000,100001000,100002000,100003000,100004000,100005000,100006000,100007000,100008000,100009000,100010000,100011000,100012000,100013000,100014000,100015000,100016000,100017000,100018000,100019000,100020000,100021000,100022000,100023000,100024000,100025000,100026000,100027000,100028000,100029000,100030000,10003100....trimmed......00,995450000,995460000,995470000,995480000,995490000,995500000,995510000,995520000,995530000,995540000,995550000,995560000,995570000,995580000,995590000,995600000,995610000,995620000,asdf",
            "facets":[
                "1/029/029287426000",
                "1/029/029267810000",
                "1/029/029148404000",
                "1/029/029484454000",
                "1/029/029365048000",
                "1/029/029345432000",

....trimmed....


TransformerCli throws a NPE and stops processing on 32 GreenBook archives

The 32 failing archives are:
pftaps19880426_wk17.zip pftaps19880830_wk35.zip pftaps19881004_wk40.zip pftaps19890214_wk07.zip pftaps19890627_wk26.zip pftaps19890801_wk31.zip pftaps19890815_wk33.zip pftaps19890829_wk35.zip pftaps19900130_wk05.zip pftaps19900925_wk39.zip pftaps19910212_wk07.zip pftaps19910409_wk15.zip pftaps19910917_wk38.zip pftaps19911008_wk41.zip pftaps19911231_wk53.zip pftaps19920324_wk12.zip pftaps19920714_wk28.zip pftaps19920721_wk29.zip pftaps19921020_wk42.zip pftaps19930518_wk20.zip pftaps19931130_wk48.zip pftaps19940322_wk12.zip pftaps19940524_wk21.zip pftaps19940705_wk27.zip pftaps19940809_wk32.zip pftaps19941227_wk52.zip pftaps19950328_wk13.zip pftaps19950404_wk14.zip pftaps19950815_wk33.zip pftaps19951017_wk42.zip pftaps19951212_wk50.zip pftaps20010508_wk19.zip

A stacktrace for a random sample of 5 of the 32 archives follows:

2017-01-20 11:09:43,093 INFO  [main] TransformerCli - Record: 'US4804179A' from pftaps19890214_wk07.zip:558	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)
	
	
	
2017-01-20 11:10:57,420 INFO  [main] TransformerCli - Record: 'US4860908A' from pftaps19890829_wk35.zip:683	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)
	
	
	
2017-01-20 11:12:18,532 INFO  [main] TransformerCli - Record: 'US4959248A' from pftaps19900925_wk39.zip:1036	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)
	
	
2017-01-20 11:13:35,608 INFO  [main] TransformerCli - Record: 'US5376511A' from pftaps19941227_wk52.zip:1477	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)
	
	
2017-01-20 11:15:50,493 INFO  [main] TransformerCli - Record: 'US6228925A' from pftaps20010508_wk19.zip:2566	
Exception in thread "main" java.lang.NullPointerException	
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)

ClaimRuleTest nullClaimListFail test fails

-------------------------------------------------------------------------------
Test set: gov.uspto.patent.validate.ClaimRuleTest
-------------------------------------------------------------------------------
Tests run: 3, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.004 sec <<< FAILURE!
nullClaimListFail(gov.uspto.patent.validate.ClaimRuleTest)  Time elapsed: 0 sec  <<< ERROR!
java.lang.NullPointerException
	at gov.uspto.patent.model.Patent.setClaim(Patent.java:157)
	at gov.uspto.patent.validate.ClaimRuleTest.nullClaimListFail(ClaimRuleTest.java:32)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
	at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
	at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)

Test Failure in PatentAppPubParserTest

I cloned the PatentPublicData repository today, and the build failed because one test failed:

Running gov.uspto.patent.doc.pap.PatentAppPubParserTest
!! This Patent.class method is returning null: 'getDocumentDate()' for Patent id: US20010000943A1
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.314 sec <<< FAILURE!

TransformerCli throws an NPE when processing ipaipa090702.zip

The following stacktrace results:

2017-01-11 14:20:39,877 INFO  [main] TransformerCli - Record: 'US20090165183A1' from ipa090702.zip:1
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.model.classification.PatentClassification.compareTo(PatentClassification.java:144)
	at gov.uspto.patent.model.classification.PatentClassification.compareTo(PatentClassification.java:15)
	at java.util.TreeMap.compare(TreeMap.java:1294)
	at java.util.TreeMap.put(TreeMap.java:538)
	at java.util.TreeSet.add(TreeSet.java:255)
	at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1548)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
	at gov.uspto.patent.model.classification.PatentClassification.filter(PatentClassification.java:150)
	at gov.uspto.patent.model.classification.PatentClassification.filterByType(PatentClassification.java:155)
	at gov.uspto.patent.serialize.JsonMapper.mapClassifications(JsonMapper.java:138)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:113)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:215)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:199)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Thanks.

TransformerCli throws a NPE and stops processing with pftaps19860826_wk34.zip

2017-01-20 11:03:44,700 INFO  [main] TransformerCli - Record: 'US4607439A' from pftaps19860826_wk34.zip:176
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:217)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:201)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)

TransformerCli throws NumberFormatException and stops processing with ipg060110.zip

2017-01-12 13:18:22,562 INFO  [main] TransformerCli - Record: 'US6985272B2' from ipg060110.zip:1903
Exception in thread "main" java.lang.NumberFormatException: For input string: ""
	at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.lang.Integer.parseInt(Integer.java:592)
	at java.lang.Integer.valueOf(Integer.java:766)
	at gov.uspto.patent.model.classification.IpcClassification.standardize(IpcClassification.java:123)
	at gov.uspto.patent.model.classification.IpcClassification.toString(IpcClassification.java:288)
	at java.lang.String.valueOf(String.java:2994)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at java.util.AbstractCollection.toString(AbstractCollection.java:462)
	at java.lang.String.valueOf(String.java:2994)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at gov.uspto.patent.model.Patent.toString(Patent.java:429)
	at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:168)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:74)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Cannot parse 1980 Grant to JSON - null pointer

Freshly downloaded (with BulkDownloader) file from 1980 crashes with a NULL pointer:

$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/* gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/1980/pftaps19800101_wk01.zip" --outdir="BulkDownloader/download/grants/1980/expanded/" --limit=10
log4j:WARN No appenders could be found for logger (gov.uspto.patent.TransformerCli).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.bulk.DumpFile.close(DumpFile.java:82)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:192)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:115)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:270)

SAX error 2001 Application

TransformerCli on 2001 application file pa010322.zip and 2002 application file pa020103.zip:

The following fatal error occurs:

2017-01-08 09:22:17,666 ERROR [       main] pa010322.zip         TransformerCli - Patent Reader error: 
gov.uspto.patent.PatentReaderException: Failed to load XML
	at gov.uspto.patent.PatentReader.getJDOM(PatentReader.java:94)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:75)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:167)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:277)
Caused by: org.dom4j.DocumentException: Error on line 372 of document  : The entity "ldquo" was referenced, but not declared. Nested exception: The entity "ldquo" was referenced, but not declared.
	at org.dom4j.io.SAXReader.read(SAXReader.java:482)
	at org.dom4j.io.SAXReader.read(SAXReader.java:365)
	at gov.uspto.patent.PatentReader.getJDOM(PatentReader.java:92)
	... 4 more

Issue in PctRegionalIdNode

In TransformerCli, running 1981 grant file pftaps19811006_wk40.zip to generate pftaps19811006_wk40.bulk, a number of patents generate errors in the following function at the setDate lines "filindDocId.setDate(filingDate);" and "pubDocId.setDate(pubDate);".

As a workaround I have modified this function as follows, adding a generalized exception catch and avoiding adding to the docIds if the resulting filindDocId or pubDocId is null.

    public List<DocumentId> read() {
        List<DocumentId> docIds = new ArrayList<DocumentId>();

        Node pctGroupN = document.selectSingleNode(FRAGMENT_PATH);
        if (pctGroupN == null) {
            return docIds;
        }

        Node pctFilingIdN = pctGroupN.selectSingleNode("PCN");
        if (pctFilingIdN != null) {
            DocumentId filindDocId = buildDocId(pctFilingIdN.getText());

            Node pctFilingDateN = pctGroupN.selectSingleNode("PD3"); // filing date.
            if (pctFilingDateN != null) {
                try {
                    DocumentDate filingDate = new DocumentDate(pctFilingDateN.getText());
                    filindDocId.setDate(filingDate);
                } catch (InvalidDataException e) {
                    LOGGER.warn("Invalid PCT Filing Date: {}", pctFilingDateN.getText());
                } catch (Exception e) {
                    LOGGER.warn("pubDocId: {}", pctFilingDateN.getText());
                }
            }
            if (filindDocId != null) {
                docIds.add(filindDocId);
            }
        }

        Node pctPubIdN = pctGroupN.selectSingleNode("PCP");
        if (pctPubIdN != null) {
            DocumentId pubDocId = buildDocId(pctPubIdN.getText());

            Node pctPubDateN = pctGroupN.selectSingleNode("PCD"); // publication date
            if (pctPubDateN != null) {
                try {
                    DocumentDate pubDate = new DocumentDate(pctPubDateN.getText());
                    pubDocId.setDate(pubDate);
                } catch (InvalidDataException e) {
                    LOGGER.warn("Invalid PCT Publication Date: {}", pctPubDateN.getText());
                } catch (Exception ex) {
                    LOGGER.warn("pubDocId: {}", pctPubIdN.getText());
                }
            }
            if (pubDocId != null ) {
                docIds.add(pubDocId);
            }
        }

        return docIds;
    }

build fails with errors like `incompatible types: java.util.List<capture#8 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>`

A clean clone of this project and mvn compile results in the following errors (truncated for brevity) in both jdk 1.7 and jdk 1.8, both openjdk as well as the Oracle jdk, on Ubuntu 16.04:

[INFO] 8 errors 
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] PatentPublicData ................................... SUCCESS [  0.002 s]
[INFO] Common ............................................. SUCCESS [  0.384 s]
[INFO] PatentDocument ..................................... SUCCESS [  0.042 s]
[INFO] BulkDownloader ..................................... FAILURE [  0.835 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1.355 s
[INFO] Finished at: 2016-10-04T17:28:32-04:00
[INFO] Final Memory: 18M/356M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.5.1:compile (default-compile) on project BulkDownloader: Compilation failure: Compilation failure:
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationPatent.java:[42,53] incompatible types: java.util.List<capture#1 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationPatent.java:[43,54] incompatible types: java.util.List<capture#2 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationPatent.java:[66,74] incompatible types: java.util.List<capture#3 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationPatent.java:[81,75] incompatible types: java.util.List<capture#4 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationXPath.java:[41,75] incompatible types: java.util.List<capture#5 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationXPath.java:[49,76] incompatible types: java.util.List<capture#6 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationXPathSGML.java:[39,75] incompatible types: java.util.List<capture#7 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>
[ERROR] /home/abucci/dev/USPTO/PatentPublicData/BulkDownloader/src/main/java/gov/uspto/bulkdata/corpusbuilder/MatchClassificationXPathSGML.java:[47,76] incompatible types: java.util.List<capture#8 of ? extends gov.uspto.patent.model.classification.Classification> cannot be converted to java.util.List<gov.uspto.patent.model.classification.Classification>

I had a look at the error on line 42 as an exemplar. It looks like MatchClassificationPatent.setup() is passing a List<Classification> into public static List<? extends Classification> getByType(Collection<? extends Classification> classes, ClassificationType type) declared in Classification. This call returns List<? extends Classification>, which setup() attempts to assign to a variable that is declared of type List<Classification>. The compiler throws an error here. The error on line 66 looks similar. I have not looked through all the errors but I imagine they have to do with this issue of java generics.

Fatal errors processing full files

Processing samples of full filse from 1980, 1990, 2000 and 2010 resulted in fatal errors after 100s of files (earlier testing was with limit=10). Three different errors occurred (based on dumps). Specific files (downloaded for cited year with limit=2) are listed below. One example of each error type is included in the post.

Specific files:
1980: pftaps19800101_wk01.zip Error type 1
1990: pftaps19900102_wk01.zip Error type 1
2000: pftaps20000104_wk01.zip Error type 2
2010: ipg100105.zip Error type 3

patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip" --outBulk=false --outdir="BulkDownloader/download/grants/1990/expanded/"
2016-11-08 21:49:27,321 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:49:27,356 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip
2016-11-08 21:49:27,358 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: Greenbook
2016-11-08 21:49:27,361 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip
2016-11-08 21:49:27,410 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[]]]: pftaps19900102_wk01.txt
2016-11-08 21:49:27,412 INFO  [main] PatentDocFormatDetect - PatentType fromContent: Greenbook
2016-11-08 21:49:27,610 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,715 INFO  [main] TransformerCli - Record: 'US305275' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:1
2016-11-08 21:49:27,776 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,780 INFO  [main] TransformerCli - Record: 'US305277' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:2
2016-11-08 21:49:27,812 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,815 INFO  [main] TransformerCli - Record: 'US305279' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:3
2016-11-08 21:49:27,838 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,841 INFO  [main] TransformerCli - Record: 'US305281' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:4
2016-11-08 21:49:27,858 WARN  [main] AbstractTextNode - Patent does not have an Abstract
(trimmed)
2016-11-08 21:49:41,038 INFO  [main] TransformerCli - Record: 'US4891829' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:806
2016-11-08 21:49:41,064 INFO  [main] TransformerCli - Record: 'US4891831' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:807
2016-11-08 21:49:41,075 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G21K  104' from : <CLAS><OCL>378145</OCL><XCL>378146</XCL><XCL>378151</XCL><XCL>378152</XCL><EDF>4</EDF><ICL>G21K  104</ICL><FSC>378</FSC><FSS>8;19;145-147;95;150;151;152;205</FSS></CLAS>
2016-11-08 21:49:41,077 INFO  [main] TransformerCli - Record: 'US4891833' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:808
2016-11-08 21:49:41,086 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H04M  164' from : <CLAS><OCL>379 88</OCL><XCL>379 45</XCL><XCL>379 51</XCL><XCL>379 73</XCL><EDF>4</EDF><ICL>H04M  164</ICL><FSC>379</FSC><FSS>41;45;48;51;67;84;88;89;110;73</FSS><FSC>369</FSC><FSS>2;14;24;83</FSS><FSC>360</FSC><FSS>32</FSS></CLAS>
2016-11-08 21:49:41,088 INFO  [main] TransformerCli - Record: 'US4891835' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:809
2016-11-08 21:49:41,113 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H04M  160' from : <CLAS><OCL>379390</OCL><XCL>379388</XCL><XCL>381106</XCL><EDF>4</EDF><ICL>H04M  160</ICL><FSC>381</FSC><FSS>68;684;94;96;102;106;107;111;72;113</FSS><FSC>379</FSC><FSS>388;389;390;395;387</FSS></CLAS>
2016-11-08 21:49:41,114 INFO  [main] TransformerCli - Record: 'US4891837' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:810
2016-11-08 21:49:41,133 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H04S  300' from : <CLAS><OCL>381 22</OCL><EDF>4</EDF><ICL>H04S  300</ICL><FSC>381</FSC><FSS>17;18;19;20;21;22;23</FSS></CLAS>
2016-11-08 21:49:41,135 INFO  [main] TransformerCli - Record: 'US4891839' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:811
2016-11-08 21:49:41,180 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H03G  500' from : <CLAS><OCL>351 98</OCL><XCL>333 28T</XCL><EDF>4</EDF><ICL>H03G  500</ICL><FSC>328</FSC><FSS>28 T;28 R</FSS><FSC>381</FSC><FSS>98;103</FSS></CLAS>
2016-11-08 21:49:41,181 INFO  [main] TransformerCli - Record: 'US4891841' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:812
2016-11-08 21:49:41,190 INFO  [main] TransformerCli - Record: 'US4891843' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:813
Exception in thread "main" java.util.NoSuchElementException
	at gov.uspto.patent.bulk.DumpFileAps.read(DumpFileAps.java:50)
	at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:92)
	at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:23)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:160)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)
patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip" --outBulk=false --outdir="BulkDownloader/download/grants/2000/expanded/"
2016-11-08 21:50:34,345 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:50:34,379 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip
2016-11-08 21:50:34,382 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: Greenbook
2016-11-08 21:50:34,386 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip
2016-11-08 21:50:34,418 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[]]]: pftaps20000104_wk01.txt
2016-11-08 21:50:34,420 INFO  [main] PatentDocFormatDetect - PatentType fromContent: Greenbook
2016-11-08 21:50:34,599 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0205' from : <CLAS><OCL>D 2602</OCL><XCL>D2608</XCL><XCL>D2627</XCL><EDF>6</EDF><ICL>0205</ICL><FSC>D 2</FSC><FSS>600;602;608;609;624;627;639;859;891</FSS><FSC>2</FSC><FSS>236;237;232;269;338;124;125;115</FSS></CLAS>
2016-11-08 21:50:34,608 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,709 INFO  [main] TransformerCli - Record: 'US418273' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1
2016-11-08 21:50:34,779 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0207' from : <CLAS><OCL>D 2639</OCL><EDF>6</EDF><ICL>0207</ICL><FSC>D11</FSC><FSS>200-201;261;55-66</FSS><FSC>D 2</FSC><FSS>639</FSS><FSC>297</FSC><FSS>482</FSS><FSC>24</FSC><FSS>633</FSS><FSC>D21</FSC><FSS>604</FSS><FSC>428</FSC><FSS>100</FSS></CLAS>
2016-11-08 21:50:34,782 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,784 INFO  [main] TransformerCli - Record: 'US418275' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:2
2016-11-08 21:50:34,810 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0202' from : <CLAS><OCL>D 2743</OCL><EDF>6</EDF><ICL>0202</ICL><FSC>D 2</FSC><FSS>728;743;746;753;830;838</FSS><FSC>D 5</FSC><FSS>62</FSS><FSC>2</FSC><FSS>69;94;1;40;227</FSS></CLAS>
2016-11-08 21:50:34,813 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,818 INFO  [main] TransformerCli - Record: 'US418277' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:3
2016-11-08 21:50:34,841 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0203' from : <CLAS><OCL>D 2882</OCL><EDF>6</EDF><ICL>0203</ICL><FSC>D 2</FSC><FSS>865;866;872;876;879;882;883;884;886;887</FSS><FSC>D29</FSC><FSS>102;104</FSS><FSC>2</FSC><FSS>171;181;195.1;412;419;209.13</FSS></CLAS>
2016-11-08 21:50:34,850 WARN  [main] AbstractTextNode - Patent does not have an Abstract
(Trimmed)
2016-11-08 21:50:58,508 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G02F  11335' from : <CLAS><OCL>349113</OCL><XCL>349139</XCL><EDF>6</EDF><ICL>G02F  11335</ICL><ICL>G02F  11343</ICL><FSC>349</FSC><FSS>113;149;151;152;139</FSS></CLAS>
2016-11-08 21:50:58,509 INFO  [main] TransformerCli - Record: 'US6011605' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1224
2016-11-08 21:50:58,521 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G02F  11339' from : <CLAS><OCL>349153</OCL><XCL>349151</XCL><EDF>6</EDF><ICL>G02F  11339</ICL><FSC>349</FSC><FSS>149;151;153</FSS></CLAS>
2016-11-08 21:50:58,522 INFO  [main] TransformerCli - Record: 'US6011607' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1225
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 1
	at java.lang.String.substring(String.java:1963)
	at gov.uspto.patent.model.entity.NamePerson.getAbbreviatedName(NamePerson.java:75)
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:370)
	at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:319)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:68)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:56)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:203)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:188)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)
patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/2010/ipg100105.zip" --outBulk=false --outdir="BulkDownloader/download/grants/2010/expanded/"
2016-11-08 21:51:36,523 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:51:36,560 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip
2016-11-08 21:51:36,562 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: RedbookGrant
2016-11-08 21:51:36,569 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip
2016-11-08 21:51:36,601 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[SuffixFilter [suffixes=[xml]]]]]: ipg100105.xml
2016-11-08 21:51:36,604 INFO  [main] PatentDocFormatDetect - PatentType fromContent: RedbookGrant
2016-11-08 21:51:37,205 INFO  [main] TransformerCli - Record: 'USD0607176S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1
2016-11-08 21:51:37,351 INFO  [main] TransformerCli - Record: 'USD0607177S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:2
2016-11-08 21:51:37,463 INFO  [main] TransformerCli - Record: 'USD0607178S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:3
2016-11-08 21:51:37,601 INFO  [main] TransformerCli - Record: 'USD0607179S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:4
2016-11-08 21:51:37,738 INFO  [main] TransformerCli - Record: 'USD0607180S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:5

.......(trimmed)
2016-11-08 21:55:32,444 INFO  [main] TransformerCli - Record: 'US7642067B2' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1937
2016-11-08 21:55:32,555 INFO  [main] TransformerCli - Record: 'US7642068B2' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1938
2016-11-08 21:55:32,618 INFO  [main] TransformerCli - Record: 'US7642069B2' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1939
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.model.entity.NamePerson.getAbbreviatedName(NamePerson.java:75)
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:370)
	at gov.uspto.patent.serialize.JsonMapper.mapApplicant(JsonMapper.java:306)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:99)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:68)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:56)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:203)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:188)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)

Old vs. New Downloader

I am getting a 404 error when requesting 2000 or earlier using gov.uspto.bulkdata.cli2.BulkData. (see bottom)

I see the data formats change at this time, with 2001 in SGML and 2002+ in XML. Is one to use gov.uspto.bulkdata.cli.Download for years before 2001?

It appears Download does not take years as a parameter. Does it just download all years? It appears to have started downloading 2015 (file names ipa150101.zip, etc.)...

Thanks.

gov.uspto.bulkdata.cli2.BulkData --type=application --years=2000 --outdir="${PROJECTPATH}/download/patents2000"

Exception in thread "main" java.io.IOException: Unexpected server response code 404
at gov.uspto.bulkdata.PageLinkScraper.fetchLinks(PageLinkScraper.java:88)
at gov.uspto.bulkdata.PageLinkScraper.fetchLinks(PageLinkScraper.java:41)
at gov.uspto.bulkdata.cli2.BulkData.fetchLinks(BulkData.java:145)
at gov.uspto.bulkdata.cli2.BulkData.fetchLinks(BulkData.java:152)
at gov.uspto.bulkdata.cli2.BulkData.download(BulkData.java:73)
at gov.uspto.bulkdata.cli2.BulkData.main(BulkData.java:243)

CLI

Sorry, I am having trouble making sense of the Example Usage for the Bulkdownloader.

How is it called? How are the arguments passed? I have tried many forms of "java -cp ..." and "java -jar ..." with no luck.

stopwords.txt file is missing?

Hi, when I run the tests, I get

Tests in error: 
  parsePatentSGML(gov.uspto.document.parser.dom4j.PatentParserTest)
  parsePatentApplication(gov.uspto.document.parser.dom4j.PatentParserTest)
  parsePatentPAP(gov.uspto.document.parser.dom4j.PatentParserTest)
  parsePatentGrant(gov.uspto.document.parser.dom4j.PatentParserTest)
  parsePatentGreenbook(gov.uspto.document.parser.dom4j.PatentParserTest)
  stopWordTrailing(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveTrailingString(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordLeadingLocation(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveTrailingStringArrayLocation(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveLeadingStringLocation(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordEdgeStringFalse(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordEdge(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordContainedStringArray(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordIsTest(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordEdgeStringTrue(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveLeadingString(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordContainedList(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordEdgeArrayFalse(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordEdgeArrayTrue(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveString(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveEdgeStringArray(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordLeading(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordTrailingStringFalse(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveArray(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordNotContainedString(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveEdgeString(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordContainedString(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveArrayLocation(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordTrailingStringTrue(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveTrailingStringLocation(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  fileExistCheck(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordTrailingArrayFalse(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveEdgeStringLocation(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveContainsStringLocationThusKill(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordRemoveTrailingStringArray(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt
  stopWordTrailingArrayTrue(gov.uspto.text.StopWordTest): StopWord file does not exist: src/main/resources/lst/stopwords.txt

Tests run: 52, Failures: 0, Errors: 36, Skipped: 0

Is this expected?

TransformerCli throws a NPE and stop processing on pftaps19790619_wk25.zip

2017-01-18 14:00:24,305 INFO  [main] TransformerCli - Record: 'US4158475A' from pftaps19790619_wk25.zip:297
2017-01-18 14:00:24,309 WARN  [main] DocumentIdNode - Invalid document-id, field 'WKU' not found
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:92)
	at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:49)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:70)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Greenbook file + --limit option->last file null

Command:

java -cp PatentDocument/target/:PatentDocument/target/dependency-jars/:resources gov.uspto.patent.TransformerCli
--input="/PatentPublicData/BulkDownloader/download/grants/1980/pftaps19800101_wk01.zip" --outBulk=false --outdir="/PatentPublicData/BulkDownloader/download/grants/1980/expanded/" --limit=10

The 10th file will have zero bytes.

Failed logic in the filenumber<filenumberlimit logic. Don't know where yet.

Latest push not compiling

Running:

git clone https://github.com/USPTO/PatentPublicData.git
and then:

mvn clean package

Results in the following compilation error:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] 
[INFO] PatentPublicData
[INFO] Common
[INFO] PatentDocument
[INFO] BulkDownloader
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building PatentPublicData 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ PatentPublicData ---
[INFO] Deleting /Users/patrick/dev/repos/trashthis/PatentPublicData/target
[INFO] 
[INFO] >>> maven-javadoc-plugin:2.10.3:aggregate (aggregate) > generate-sources @ PatentPublicData >>>
[INFO] 
[INFO] <<< maven-javadoc-plugin:2.10.3:aggregate (aggregate) < generate-sources @ PatentPublicData <<<
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:aggregate (aggregate) @ PatentPublicData ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:jar (attach-javadocs) @ PatentPublicData ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:copy-dependencies (copy-dependencies) @ PatentPublicData ---
[INFO] 
[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ PatentPublicData ---
[INFO] Building zip: /Users/patrick/dev/repos/trashthis/PatentPublicData/target/PatentPublicData-0.0.1-SNAPSHOT.zip
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building Common 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ Common ---
[INFO] Deleting /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ Common ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] skip non existing resourceDirectory /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/src/main/resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ Common ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 24 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/classes
[INFO] 
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ Common ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 1 resource
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile) @ Common ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 3 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/test-classes
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/src/test/java/gov/uspto/common/text/StopWordTest.java: Some input files use or override a deprecated API.
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/src/test/java/gov/uspto/common/text/StopWordTest.java: Recompile with -Xlint:deprecation for details.
[INFO] 
[INFO] --- maven-surefire-plugin:2.12.4:test (default-test) @ Common ---
[INFO] Surefire report directory: /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/surefire-reports

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running gov.uspto.common.text.StopWordTest
Tests run: 31, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.249 sec
Running gov.uspto.common.text.StringCaseTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running gov.uspto.common.text.WordUtilTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.014 sec

Results :

Tests run: 44, Failures: 0, Errors: 0, Skipped: 0

[INFO] 
[INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ Common ---
[INFO] Building jar: /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/Common-0.0.1-SNAPSHOT.jar
[INFO] 
[INFO] >>> maven-javadoc-plugin:2.10.3:aggregate (aggregate) > generate-sources @ Common >>>
[INFO] 
[INFO] <<< maven-javadoc-plugin:2.10.3:aggregate (aggregate) < generate-sources @ Common <<<
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:aggregate (aggregate) @ Common ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-javadoc-plugin:2.10.3:jar (attach-javadocs) @ Common ---
[INFO] Skipping javadoc generation
[INFO] 
[INFO] --- maven-dependency-plugin:2.10:copy-dependencies (copy-dependencies) @ Common ---
[INFO] Copying commons-compress-1.11.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/commons-compress-1.11.jar
[INFO] Copying jopt-simple-5.0.2.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/jopt-simple-5.0.2.jar
[INFO] Copying commons-io-2.5.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/commons-io-2.5.jar
[INFO] Copying slf4j-api-1.7.21.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/slf4j-api-1.7.21.jar
[INFO] Copying slf4j-log4j12-1.7.21.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/slf4j-log4j12-1.7.21.jar
[INFO] Copying log4j-1.2.17.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/log4j-1.2.17.jar
[INFO] Copying guava-19.0.jar to /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/dependency-jars/guava-19.0.jar
[INFO] 
[INFO] --- maven-assembly-plugin:2.2-beta-5:single (default) @ Common ---
[INFO] Building zip: /Users/patrick/dev/repos/trashthis/PatentPublicData/Common/target/Common-0.0.1-SNAPSHOT.zip
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building PatentDocument 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ PatentDocument ---
[INFO] Deleting /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/target
[INFO] 
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ PatentDocument ---
[WARNING] Using platform encoding (UTF-8 actually) to copy filtered resources, i.e. build is platform dependent!
[INFO] Copying 17 resources
[INFO] 
[INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @ PatentDocument ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 210 source files to /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/target/classes
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/sgml/items/DescriptionFigures.java: Some input files use unchecked or unsafe operations.
[INFO] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/sgml/items/DescriptionFigures.java: Recompile with -Xlint:unchecked for details.
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/CountryCodeHistory.java:[116,61] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java:[8,69] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java:[8,81] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java:[8,91] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[96,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[100,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[103,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[106,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[109,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
[INFO] 9 errors 
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] PatentPublicData ................................... SUCCESS [  6.167 s]
[INFO] Common ............................................. SUCCESS [  5.398 s]
[INFO] PatentDocument ..................................... FAILURE [  3.470 s]
[INFO] BulkDownloader ..................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 15.231 s
[INFO] Finished at: 2017-01-10T21:17:02+01:00
[INFO] Final Memory: 28M/272M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.5.1:compile (default-compile) on project PatentDocument: Compilation failure: Compilation failure:
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/CountryCodeHistory.java:[116,61] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java:[8,69] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java:[8,81] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/model/NplCitation.java:[8,91] unmappable character for encoding UTF-8
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[96,46] cannot find symbol
[ERROR] symbol:   method setKindCode(java.lang.String)
[ERROR] location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[100,46] cannot find symbol
[ERROR] symbol:   method setKindCode(java.lang.String)
[ERROR] location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[103,46] cannot find symbol
[ERROR] symbol:   method setKindCode(java.lang.String)
[ERROR] location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[106,46] cannot find symbol
[ERROR] symbol:   method setKindCode(java.lang.String)
[ERROR] location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] /Users/patrick/dev/repos/trashthis/PatentPublicData/PatentDocument/src/main/java/gov/uspto/patent/doc/greenbook/Greenbook.java:[109,46] cannot find symbol
[ERROR] symbol:   method setKindCode(java.lang.String)
[ERROR] location: variable publicationId of type gov.uspto.patent.model.DocumentId
[ERROR] -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :PatentDocument

TransformerCli throwns PatentReaderException related to html-entities with pg010102.zip

TransformerCli throws the following exception while processing pg010102.zip:

2017-01-12 12:59:56,303 ERROR [main] TransformerCli - Patent Reader error: 
gov.uspto.patent.PatentReaderException: Failed to Fix and Parse Docuemnt
	at gov.uspto.patent.PatentReader.fixTagsJDOM(PatentReader.java:124)
	at gov.uspto.patent.PatentReader.getJDOM(PatentReader.java:99)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:76)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)
Caused by: org.dom4j.DocumentException: Error on line 284 of document  : The entity "nbsp" was referenced, but not declared. Nested exception: The entity "nbsp" was referenced, but not declared.
	at org.dom4j.io.SAXReader.read(SAXReader.java:482)
	at org.dom4j.io.SAXReader.read(SAXReader.java:365)
	at gov.uspto.patent.PatentReader.fixTagsJDOM(PatentReader.java:122)
	... 5 more

TransformerCli throws NumberFormatException while processing pftaps19760309_wk10.zip

The stack trace is as follows (the exception stops processing of the remaining documents in the archive):

Exception in thread "main" java.lang.NumberFormatException: null
	at java.lang.Integer.parseInt(Integer.java:542)
	at java.lang.Integer.valueOf(Integer.java:766)
	at gov.uspto.patent.model.classification.IpcClassification.standardize(IpcClassification.java:121)
	at gov.uspto.patent.model.classification.IpcClassification.toString(IpcClassification.java:297)
	at java.lang.String.valueOf(String.java:2994)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at java.util.AbstractCollection.toString(AbstractCollection.java:462)
	at java.lang.String.valueOf(String.java:2994)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at gov.uspto.patent.model.Patent.toString(Patent.java:421)
	at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:153)
	at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:49)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:71)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Thanks!

TransformerCli hanging

TransformerCli hanging on ipg120626.zip file, patent US8207315B2.

Have not yet found cause.

Last output line:

2017-01-25 15:36:04,812 INFO [ main] US8207315B2 TransformerCli - Record: 'US8207315B2' from /2012/ipg120626.zip:2479

Question: citation patent numbers

Brian,

A question: in 2001 the USPTO started publishing the WIPO ST.16 code on patents and "recommended" their use. It looks like additional codes were added in 2011. (https://www.uspto.gov/learning-and-resources/support-centers/electronic-business-center/kind-codes-included-uspto-patent)

In the json documents that are created by TransformerCli, is there referential integrity between the US patent citations in a patent and the other US patents published? That is, if a 2014 patent cites a 1990 patent in the citations collection, is the US patent number in citations[ an item ].text going to match the documentId of the referenced patent?

As an example, I see US7640606B1.json (2010) citation[ num=00001 ].text = "US329663A" (maybe too early for electronic publication) and citation[ num=00015 ].text = "US4391545A" (published 1983).

The ones I have spot checked appear to match. Is the integrity guaranteed?

Thanks,

Patrick

Downloader date range

Bulkdownloader does not do inclusive comparisons on date ranges. Requesting files for 1985 would intuitively accept the parameter:

--date 19850101-19851231

However BulkDownloader will miss the Jan 1 and December 31 files. One or the other problem (1st or 31st) occurs at least in the years 80, 85, 91, 96, 02.

Cumbersome workaround is to use the parameter:

--date 19841231-19860101

Clearly not an elegant solution.

The DateRange code is in gov.uspto.common.DateRange, but I am not sure what other code assumes non-inclusive comparisons.

name = null

Processing 1978 grant file pftaps19780321_wk12.zip the code crashes with a null name.

In JsonMapper.java, I have updated the MapName function to check for null. I don't know whether this is the optimal solution, but the file processes completely with this change.

private JsonObject mapName(Name name) {
     JsonObjectBuilder jsonObj = Json.createObjectBuilder();
     if (name instanceof NamePerson) {
         NamePerson perName = (NamePerson) name;
         jsonObj.add("type", "person");
         jsonObj.add("raw", valueOrEmpty(name.getName()));
         jsonObj.add("prefix", valueOrEmpty(perName.getPrefix()));
         jsonObj.add("firstName", valueOrEmpty(perName.getFirstName()));
         jsonObj.add("middleName", valueOrEmpty(perName.getMiddleName()));
         jsonObj.add("lastName", valueOrEmpty(perName.getLastName()));
         jsonObj.add("suffix", valueOrEmpty(perName.getPrefix()));
         jsonObj.add("abbreviated", valueOrEmpty(perName.getAbbreviatedName()));
         jsonObj.add("synonyms", toJsonArray(perName.getSynonyms()));
     } else if (name != null) {
         NameOrg orgName = (NameOrg) name;
         jsonObj.add("type", "org");
         jsonObj.add("raw", valueOrEmpty(name.getName()));
         jsonObj.add("suffix", valueOrEmpty(orgName.getSuffix()));
         jsonObj.add("synonyms", toJsonArray(orgName.getSynonyms()));
     }
     return jsonObj.build();
 }

Cannot parse Grant to json - null name/value pair

I am having difficulty in parsing a freshly downloaded (via BulkDownloader) grand (applications appear to parse to JSON without a problem). This occurs on files from multiple years. Any hints as I start searching with Eclipse?

(The log4j.properties file is in PatentPublicData/resources, but is apparently not found. Where should it be placed? I will open a separate issue if you prefer.)

(Also, PatentDocument/README.md file lists output parameter as --outDir=".." but it only recognizes --outdir=".."; the docs need updating. Again, let me know if you'd like a separate issue opened.)

$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/* gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/2009/ipg090106.zip" --outdir="BulkDownloader/download/grants/2009/expanded/" --limit=10
log4j:WARN No appenders could be found for logger (gov.uspto.patent.TransformerCli).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.NullPointerException: Value in JsonObjects name/value pair cannot be null
	at org.glassfish.json.JsonObjectBuilderImpl.validateValue(JsonObjectBuilderImpl.java:164)
	at org.glassfish.json.JsonObjectBuilderImpl.add(JsonObjectBuilderImpl.java:74)
	at gov.uspto.patent.serialize.JsonMapper.mapExaminers(JsonMapper.java:330)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:67)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:55)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:197)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:182)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:115)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:270)

Compilation error with revision 4a31817

While attempting to build revision 4a31817, I get the following errors. I believe the "unmappable character" errors have not prevented successful compilation in the past.

gov/uspto/patent/model/NplCitation.java:[8,69] unmappable character for encoding UTF-8
gov/uspto/patent/model/NplCitation.java:[8,81] unmappable character for encoding UTF-8
gov/uspto/patent/model/NplCitation.java:[8,91] unmappable character for encoding UTF-8
gov/uspto/patent/model/CountryCodeHistory.java:[116,61] unmappable character for encoding UTF-8
gov/uspto/patent/doc/greenbook/Greenbook.java:[98,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
gov/uspto/patent/doc/greenbook/Greenbook.java:[102,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
gov/uspto/patent/doc/greenbook/Greenbook.java:[105,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
gov/uspto/patent/doc/greenbook/Greenbook.java:[108,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
gov/uspto/patent/doc/greenbook/Greenbook.java:[111,46] cannot find symbol
  symbol:   method setKindCode(java.lang.String)
  location: variable publicationId of type gov.uspto.patent.model.DocumentId
```
Thanks

TransformerCli throws NoSuchElementException while processing pg010102.zip

2017-01-10 17:08:20,011 INFO  [main] TransformerCli - --- Start ---
2017-01-10 17:08:20,041 INFO  [main] TransformerCli - Dump File[1]: pg010102.zip
2017-01-10 17:08:20,043 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: Sgml
2017-01-10 17:08:20,046 INFO  [main] ZipReader - Reading zip pg010102.zip
Exception in thread "main" java.util.NoSuchElementException
	at gov.uspto.common.file.archive.ZipReader.next(ZipReader.java:122)
	at gov.uspto.patent.bulk.DumpFile.open(DumpFile.java:65)
	at gov.uspto.patent.bulk.DumpFileXml.open(DumpFileXml.java:31)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:166)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Dates question

What is the meaning of these date fields?

documentDate
publishedDate
productionDate

I presume documentDate is the day it was granted, publishedDate is when the application became public, but I have no idea on productionDate.

Is there a difference between grants and applications?

Thank you.

getting test failure 'readSamples(gov.uspto.patent.doc.greenbook.GreenbookTest): No match available' on maven clean install

When I run: mvn clean install on the top level PatentPublicData directory, I get the following failure:

-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running gov.uspto.document.model.classification.cpc.CpCClassificationTest
[0/D, 1/D/D07, 2/D/D07/D07B, 3/D/D07/D07B/D07B2201, 4/D/D07/D07B/D07B2201/D07B22012051]
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.08 sec
Running gov.uspto.document.model.classification.ipc.IpcClassificationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.document.model.classification.uspc.UspcClassificationTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.063 sec
Running gov.uspto.document.model.DocumentIdTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.116 sec
Running gov.uspto.document.parser.dom4j.PatentParserTest
2016-11-20 10:29:17,191 WARN  [       main] UNDEFINED1234567     AbstractTextNode - Patent does not have an Abstract.
2016-11-20 10:29:17,193 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2016-11-20 10:29:17,232 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2016-11-20 10:29:17,246 WARN  [       main] US7654321            AbstractTextNode - Patent does not have an Abstract
2016-11-20 10:29:17,247 WARN  [       main] US7654321            DescriptionNode - Patent does not have a Description.
2016-11-20 10:29:17,255 WARN  [       main] UNDEFINED1234567     DescriptionNode - Patent does not have a Description.
2016-11-20 10:29:17,260 WARN  [       main] UNDEFINED1234567     DocumentIdNode - Invalid document-id can not be Null.
2016-11-20 10:29:17,261 WARN  [       main] UNDEFINED1234567     ApplicationIdNode - Invalid document-id can not be Null.
2016-11-20 10:29:17,266 WARN  [       main] UNDEFINED1234567     AbstractTextNode - Patent does not have an Abstract
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.336 sec
Running gov.uspto.patent.doc.greenbook.FormattedTextTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.005 sec
Running gov.uspto.patent.doc.greenbook.GreenbookTest
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.067 sec <<< FAILURE!
readSamples(gov.uspto.patent.doc.greenbook.GreenbookTest)  Time elapsed: 0.067 sec  <<< ERROR!
java.lang.IllegalStateException: No match available
	at java.util.regex.Matcher.end(Matcher.java:415)
	at gov.uspto.patent.doc.greenbook.items.DescriptionFigures.findFigures(DescriptionFigures.java:61)
	at gov.uspto.patent.doc.greenbook.items.DescriptionFigures.read(DescriptionFigures.java:40)
	at gov.uspto.patent.doc.greenbook.fragments.DescriptionNode.read(DescriptionNode.java:42)
	at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:113)
	at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:37)
	at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:30)
	at gov.uspto.patent.doc.greenbook.GreenbookTest.readSamples(GreenbookTest.java:27)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:252)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:141)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:112)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:189)
	at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:165)
	at org.apache.maven.surefire.booter.ProviderFactory.invokeProvider(ProviderFactory.java:85)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:115)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:75)

Running gov.uspto.patent.doc.greenbook.items.DescriptionFiguresTest
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0 sec
Running gov.uspto.patent.doc.pap.PatentAppPubParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.257 sec
Running gov.uspto.patent.doc.sgml.SgmlTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.166 sec
Running gov.uspto.patent.doc.xml.ApplicationParserTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.282 sec
Running gov.uspto.patent.doc.xml.FormattedTextTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.018 sec
Running gov.uspto.patent.doc.xml.GrantParserTest
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.025 sec
Running gov.uspto.patent.mathml.MathMLTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec
Running gov.uspto.patent.model.entity.NamePersonTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.001 sec
Running gov.uspto.patent.serialize.JsonMapperTest
{"patentCorpus":"PGPUB","patentType":"UTILITY","productionDate":{"raw":"20160101","iso":"2016-01-01T00:00:00Z"},"publishedDate":{"raw":"20160202","iso":"2016-02-02T00:00:00Z"},"documentId":"US123456789","documentDate":{"raw":"","iso":""},"applicationId":"","applicationDate":{"raw":"","iso":""},"relatedIds":[],"otherIds":[],"agent":[],"applicant":[],"inventors":[{"sequence":"","name":{"type":"person","raw":"Inventee, Bob","prefix":"","firstName":"Bob","middleName":"","lastName":"Inventee","suffix":"","abbreviated":"Inventee, B.","synonyms":[]},"address":{"street":"123 Main St","city":"Alexandria","state":"VA","zipCode":"22314","country":"US","email":"","fax":"","phone":""},"residency":"","nationality":""}],"assignees":[{"name":{"type":"org","raw":"Inventee Inc.","suffix":"","synonyms":[]},"address":{"street":"123 Main St","city":"Alexandria","state":"VA","zipCode":"22314","country":"US","email":"","fax":"","phone":""},"role":"","roleDefinition":""}],"examiners":[],"title":"Test Patent","abstract":{"raw":"This is the Abstract Section.","normalized":"   This is the Abstract Section. \n","plain":"   This is the Abstract Section. \n"},"description":{"full_raw":"Drawing Desc Text\nRell App Desc Text\nBrief Summar Desc Text\nDetailed Description Text\n","REL_APP_DESC":{"raw":"Rell App Desc Text","normalized":"   Rell App Desc Text \n","plain":"   Rell App Desc Text \n"},"DRAWING_DESC":{"raw":"Drawing Desc Text","normalized":"   Drawing Desc Text \n","plain":"   Drawing Desc Text \n"},"BRIEF_SUMMARY":{"raw":"Brief Summar Desc Text","normalized":"   Brief Summar Desc Text \n","plain":"   Brief Summar Desc Text \n"},"DETAILED_DESC":{"raw":"Detailed Description Text","normalized":"   Detailed Description Text \n","plain":"   Detailed Description Text \n"}},"claims":[],"citations":[],"classification":{"ipc":[],"uspc":[],"cpc":[]}}
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.026 sec

Results :

Tests in error: 
  readSamples(gov.uspto.patent.doc.greenbook.GreenbookTest): No match available

Tests run: 51, Failures: 0, Errors: 1, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] PatentPublicData ................................... SUCCESS [ 19.928 s]
[INFO] Common ............................................. SUCCESS [  7.860 s]
[INFO] PatentDocument ..................................... FAILURE [  6.423 s]
[INFO] BulkDownloader ..................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 34.324 s
[INFO] Finished at: 2016-11-20T10:29:21-05:00
[INFO] Final Memory: 34M/524M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12.4:test (default-test) on project PatentDocument: There are test failures.

Do you have any advice for how to correct?

null name

Processing 1978 grant file pftaps19780321_wk12.zip the code crashes with a null name.

In JsonMapper.java, I have updated the MapName function to check for null. I don't know whether this is the optimal solution, but the file processes completely with this change.

private JsonObject mapName(Name name) {
     JsonObjectBuilder jsonObj = Json.createObjectBuilder();
     if (name instanceof NamePerson) {
         NamePerson perName = (NamePerson) name;
         jsonObj.add("type", "person");
         jsonObj.add("raw", valueOrEmpty(name.getName()));
         jsonObj.add("prefix", valueOrEmpty(perName.getPrefix()));
         jsonObj.add("firstName", valueOrEmpty(perName.getFirstName()));
         jsonObj.add("middleName", valueOrEmpty(perName.getMiddleName()));
         jsonObj.add("lastName", valueOrEmpty(perName.getLastName()));
         jsonObj.add("suffix", valueOrEmpty(perName.getPrefix()));
         jsonObj.add("abbreviated", valueOrEmpty(perName.getAbbreviatedName()));
         jsonObj.add("synonyms", toJsonArray(perName.getSynonyms()));
     } else if (name != null) {
         NameOrg orgName = (NameOrg) name;
         jsonObj.add("type", "org");
         jsonObj.add("raw", valueOrEmpty(name.getName()));
         jsonObj.add("suffix", valueOrEmpty(orgName.getSuffix()));
         jsonObj.add("synonyms", toJsonArray(orgName.getSynonyms()));
     }
     return jsonObj.build();
 }

Can't compile package

Hi!

I am not really a direct Maven user, so I might be doing something wrong, but when I run mvn compile I get:

[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO] 
[INFO] PatentPublicData
[INFO] BulkDownloader
[INFO] PatentDocument
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building PatentPublicData 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO]                                                                         
[INFO] ------------------------------------------------------------------------
[INFO] Building BulkDownloader 0.0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] PatentPublicData ................................... SUCCESS [  0.003 s]
[INFO] BulkDownloader ..................................... FAILURE [  0.525 s]
[INFO] PatentDocument ..................................... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 0.690 s
[INFO] Finished at: 2016-06-14T13:39:06+02:00
[INFO] Final Memory: 6M/150M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project BulkDownloader: Could not resolve dependencies for project gov.uspto:BulkDownloader:jar:0.0.1-SNAPSHOT: Could not find artifact gov.uspto:PatentPublicData:jar:0.0.1-SNAPSHOT -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :BulkDownloader

End of file error?

I am splitting the error types reported yesterday into separate issues.

I now have the code up and running in a debugger, but I am not yet familiar enough with the program flow and data to be confident of correct behaviors. However, I will help try to debug.

Regarding the error shown below, I am testing with this file: 1980 pftaps19800101_wk01.zip

It is crashing on an explicitly thrown exception on gov.uspto.patent.bulk.DumpFileAps.read(DumpFileAps.java:50).

As far as I can tell so far: This is thrown at the end of a line-by-line read of the input file designed to break the file into individual blobs of text representing a single patent. Looking at the data, each patent is separated by a line PATN. This is the start tag. The code runs until it finds the next PATN, and sends back the preceeding accumulated text as the patent. Then it starts accumulating text to represent the next patent.

Since there is not EOF file marker, when super.getReader().readLine()) == null the WHILE exits, the last accumulated patent data is not returned, and the exception of line 50 is thrown. That is: it appears that (at least for files of this type; I haven't looked at other files yet) this code will ALWAYS crash on the last patent, and will always NOT return the last patent's test.

Commenting out line 50 and inserting the return of the accumulated text allows the file to return the last patent. This function is called one last time, so the content.length() has to be checked and if it is zero, return NULL to stop program flow. (Again, I don't know the data enough to know whether this is correct.)

With this following change, this 1980s file completes.

		//throw new NoSuchElementException();
		if (content.length()==0){
			return null;
		}
		else { 
			return startTag + "\n" + content.toString();
		}

Error messages with current code below:

patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip" --outBulk=false --outdir="BulkDownloader/download/grants/1990/expanded/"
2016-11-08 21:49:27,321 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:49:27,356 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip
2016-11-08 21:49:27,358 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: Greenbook
2016-11-08 21:49:27,361 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip
2016-11-08 21:49:27,410 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[]]]: pftaps19900102_wk01.txt
2016-11-08 21:49:27,412 INFO  [main] PatentDocFormatDetect - PatentType fromContent: Greenbook
2016-11-08 21:49:27,610 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,715 INFO  [main] TransformerCli - Record: 'US305275' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:1
2016-11-08 21:49:27,776 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,780 INFO  [main] TransformerCli - Record: 'US305277' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:2
2016-11-08 21:49:27,812 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,815 INFO  [main] TransformerCli - Record: 'US305279' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:3
2016-11-08 21:49:27,838 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:49:27,841 INFO  [main] TransformerCli - Record: 'US305281' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:4
2016-11-08 21:49:27,858 WARN  [main] AbstractTextNode - Patent does not have an Abstract
(trimmed)
2016-11-08 21:49:41,038 INFO  [main] TransformerCli - Record: 'US4891829' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:806
2016-11-08 21:49:41,064 INFO  [main] TransformerCli - Record: 'US4891831' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:807
2016-11-08 21:49:41,075 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G21K  104' from : <CLAS><OCL>378145</OCL><XCL>378146</XCL><XCL>378151</XCL><XCL>378152</XCL><EDF>4</EDF><ICL>G21K  104</ICL><FSC>378</FSC><FSS>8;19;145-147;95;150;151;152;205</FSS></CLAS>
2016-11-08 21:49:41,077 INFO  [main] TransformerCli - Record: 'US4891833' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:808
2016-11-08 21:49:41,086 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H04M  164' from : <CLAS><OCL>379 88</OCL><XCL>379 45</XCL><XCL>379 51</XCL><XCL>379 73</XCL><EDF>4</EDF><ICL>H04M  164</ICL><FSC>379</FSC><FSS>41;45;48;51;67;84;88;89;110;73</FSS><FSC>369</FSC><FSS>2;14;24;83</FSS><FSC>360</FSC><FSS>32</FSS></CLAS>
2016-11-08 21:49:41,088 INFO  [main] TransformerCli - Record: 'US4891835' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:809
2016-11-08 21:49:41,113 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H04M  160' from : <CLAS><OCL>379390</OCL><XCL>379388</XCL><XCL>381106</XCL><EDF>4</EDF><ICL>H04M  160</ICL><FSC>381</FSC><FSS>68;684;94;96;102;106;107;111;72;113</FSS><FSC>379</FSC><FSS>388;389;390;395;387</FSS></CLAS>
2016-11-08 21:49:41,114 INFO  [main] TransformerCli - Record: 'US4891837' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:810
2016-11-08 21:49:41,133 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H04S  300' from : <CLAS><OCL>381 22</OCL><EDF>4</EDF><ICL>H04S  300</ICL><FSC>381</FSC><FSS>17;18;19;20;21;22;23</FSS></CLAS>
2016-11-08 21:49:41,135 INFO  [main] TransformerCli - Record: 'US4891839' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:811
2016-11-08 21:49:41,180 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'H03G  500' from : <CLAS><OCL>351 98</OCL><XCL>333 28T</XCL><EDF>4</EDF><ICL>H03G  500</ICL><FSC>328</FSC><FSS>28 T;28 R</FSS><FSC>381</FSC><FSS>98;103</FSS></CLAS>
2016-11-08 21:49:41,181 INFO  [main] TransformerCli - Record: 'US4891841' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:812
2016-11-08 21:49:41,190 INFO  [main] TransformerCli - Record: 'US4891843' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/1990/pftaps19900102_wk01.zip:813
Exception in thread "main" java.util.NoSuchElementException
	at gov.uspto.patent.bulk.DumpFileAps.read(DumpFileAps.java:50)
	at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:92)
	at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:23)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:160)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)

TransformerCli throws an NPE and stops processing with ipg050308.zip

2017-01-18 14:11:23,091 INFO  [main] TransformerCli - Record: 'US6863028B2' from ipg050308.zip:522
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.doc.xml.fragments.CitationNode.readNplCitations(CitationNode.java:71)
	at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:48)
	at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:112)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:74)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

TransformerCli throws an NPE and stops processing pftaps19920714_wk28.zip

2017-01-26 15:06:08,005 INFO  [main] TransformerCli - Record: 'US5129494A' from pftaps19920714_wk28.zip:612	
Exception in thread "main" java.lang.NullPointerException	
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
	at com.google.common.base.Splitter.splitToList(Splitter.java:408)
	at gov.uspto.patent.doc.greenbook.items.NameNode.read(NameNode.java:29)
	at gov.uspto.patent.doc.greenbook.fragments.InventorNode.readInventor(InventorNode.java:41)
	at gov.uspto.patent.doc.greenbook.fragments.InventorNode.read(InventorNode.java:31)
	at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:132)
	at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:65)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:70)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:180)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:298)

A similar StackTrace occurs with:
pftaps19880426_wk17.zip
pftaps19881004_wk40.zip
pftaps19890214_wk07.zip
pftaps19890627_wk26.zip
pftaps19890801_wk31.zip
pftaps19890815_wk33.zip
pftaps19890829_wk35.zip
pftaps19900130_wk05.zip
pftaps19900925_wk39.zip
pftaps19910212_wk07.zip
pftaps19910409_wk15.zip
pftaps19911008_wk41.zip
pftaps19920324_wk12.zip
pftaps19921020_wk42.zip
pftaps19930518_wk20.zip
pftaps19931130_wk48.zip
pftaps19940322_wk12.zip
pftaps19940524_wk21.zip
pftaps19940705_wk27.zip
pftaps19940809_wk32.zip
pftaps19941227_wk52.zip
pftaps19950328_wk13.zip
pftaps19950404_wk14.zip
pftaps19950815_wk33.zip
pftaps19951017_wk42.zip
pftaps19951212_wk50.zip

Year 2000 grant: getAbbreviatedName index out of bounds error

Splitting out earlier post with multiple error types:

Running Y2000 file pftaps20000104_wk01.zip, to limit. Error follows:

patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip" --outBulk=false --outdir="BulkDownloader/download/grants/2000/expanded/"
2016-11-08 21:50:34,345 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:50:34,379 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip
2016-11-08 21:50:34,382 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: Greenbook
2016-11-08 21:50:34,386 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip
2016-11-08 21:50:34,418 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[]]]: pftaps20000104_wk01.txt
2016-11-08 21:50:34,420 INFO  [main] PatentDocFormatDetect - PatentType fromContent: Greenbook
2016-11-08 21:50:34,599 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0205' from : <CLAS><OCL>D 2602</OCL><XCL>D2608</XCL><XCL>D2627</XCL><EDF>6</EDF><ICL>0205</ICL><FSC>D 2</FSC><FSS>600;602;608;609;624;627;639;859;891</FSS><FSC>2</FSC><FSS>236;237;232;269;338;124;125;115</FSS></CLAS>
2016-11-08 21:50:34,608 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,709 INFO  [main] TransformerCli - Record: 'US418273' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1
2016-11-08 21:50:34,779 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0207' from : <CLAS><OCL>D 2639</OCL><EDF>6</EDF><ICL>0207</ICL><FSC>D11</FSC><FSS>200-201;261;55-66</FSS><FSC>D 2</FSC><FSS>639</FSS><FSC>297</FSC><FSS>482</FSS><FSC>24</FSC><FSS>633</FSS><FSC>D21</FSC><FSS>604</FSS><FSC>428</FSC><FSS>100</FSS></CLAS>
2016-11-08 21:50:34,782 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,784 INFO  [main] TransformerCli - Record: 'US418275' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:2
2016-11-08 21:50:34,810 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0202' from : <CLAS><OCL>D 2743</OCL><EDF>6</EDF><ICL>0202</ICL><FSC>D 2</FSC><FSS>728;743;746;753;830;838</FSS><FSC>D 5</FSC><FSS>62</FSS><FSC>2</FSC><FSS>69;94;1;40;227</FSS></CLAS>
2016-11-08 21:50:34,813 WARN  [main] AbstractTextNode - Patent does not have an Abstract
2016-11-08 21:50:34,818 INFO  [main] TransformerCli - Record: 'US418277' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:3
2016-11-08 21:50:34,841 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: '0203' from : <CLAS><OCL>D 2882</OCL><EDF>6</EDF><ICL>0203</ICL><FSC>D 2</FSC><FSS>865;866;872;876;879;882;883;884;886;887</FSS><FSC>D29</FSC><FSS>102;104</FSS><FSC>2</FSC><FSS>171;181;195.1;412;419;209.13</FSS></CLAS>
2016-11-08 21:50:34,850 WARN  [main] AbstractTextNode - Patent does not have an Abstract
(Trimmed)
2016-11-08 21:50:58,508 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G02F  11335' from : <CLAS><OCL>349113</OCL><XCL>349139</XCL><EDF>6</EDF><ICL>G02F  11335</ICL><ICL>G02F  11343</ICL><FSC>349</FSC><FSS>113;149;151;152;139</FSS></CLAS>
2016-11-08 21:50:58,509 INFO  [main] TransformerCli - Record: 'US6011605' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1224
2016-11-08 21:50:58,521 WARN  [main] ClassificationNode - Failed to Parse IPC Classification: 'G02F  11339' from : <CLAS><OCL>349153</OCL><XCL>349151</XCL><EDF>6</EDF><ICL>G02F  11339</ICL><FSC>349</FSC><FSS>149;151;153</FSS></CLAS>
2016-11-08 21:50:58,522 INFO  [main] TransformerCli - Record: 'US6011607' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2000/pftaps20000104_wk01.zip:1225
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 1
	at java.lang.String.substring(String.java:1963)
	at gov.uspto.patent.model.entity.NamePerson.getAbbreviatedName(NamePerson.java:75)
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:370)
	at gov.uspto.patent.serialize.JsonMapper.mapAssignees(JsonMapper.java:319)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:68)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:56)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:203)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:188)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)

Cannot read bulk

Current version of TransformerCli generates .bulk files that cannot be iterated over with Python 3.5.

Previously the following Python code was able to iterate over a bulk file generated with TransformerCli. The currently pulled version of the Java code generates a bulk file that crashes with the following message:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1164: ordinal not in range(128)

This generally occurs multiple patents into the file, that is, after processing as many as 100 patents in the files I tested.

This has been shown to occur on the following files:

pftaps19780103_wk01.bulk
pftaps19840103_wk01.bulk
ipg100112.bulk

Python 3.5; to run, type (on my Anaconda setup):

python3 fileName.py /Users/patrick/pftaps19780103_wk01.bulk

Put following code into fileName.py:

import os
import sys
import json

class objIterate(object):
    
    def IterateOverFile(self, fileName):
        with open(fileName) as file:
            for patentJson in file:
                patent = json.loads(patentJson)
                documentID = patent['documentId']
                print("{}".format(documentID))
                
if __name__ == '__main__' and __package__ is None:
    
    inserter = objIterate()
    fileName = sys.argv[1]
    inserter.IterateOverFile(fileName)

TransformerCli throws a NPE and stops processing ipa140206

TransformerCli does the same thing with many other archive files.

2017-01-18 12:28:04,705 INFO  [main] TransformerCli - Record: 'US20140036026A1' from ipa140206.zip:2639
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:376)
	at gov.uspto.patent.serialize.JsonMapper.mapInventors(JsonMapper.java:336)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:101)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:69)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:57)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:215)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:199)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

TransformerCli throws an InvalidDataException and stops processing with pa020124.zip

2017-01-12 13:09:56,836 INFO  [main] TransformerCli - Record: 'US20020008037A1' from pa020124.zip:532

2017-01-12 13:09:56,861 WARN  [main] DocumentIdNode - Invalid CountryCode ' XP
                                   ', from: <document-id>
<doc-number> PCT/US99/14459
                                   </doc-number>
<kind-code> A1
                                   </kind-code>
<document-date> 19990624
                                   </document-date>
<country-code> XP
                                   </country-code>
</document-id>
gov.uspto.patent.InvalidDataException: Invalid Code:  XP
                                   
	at gov.uspto.patent.model.CountryCode.fromString(CountryCode.java:305)
	at gov.uspto.patent.doc.pap.items.DocumentIdNode.read(DocumentIdNode.java:49)
	at gov.uspto.patent.doc.pap.fragments.RelatedIdNode.readIds(RelatedIdNode.java:115)
	at gov.uspto.patent.doc.pap.fragments.RelatedIdNode.contionationOf(RelatedIdNode.java:27)
	at gov.uspto.patent.doc.pap.fragments.RelatedIdNode.read(RelatedIdNode.java:93)
	at gov.uspto.patent.doc.pap.PatentAppPubParser.parse(PatentAppPubParser.java:96)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:78)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)
Exception in thread "main" java.lang.NullPointerException: CountryCode can not be set to Null
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:228)
	at gov.uspto.patent.model.DocumentId.<init>(DocumentId.java:39)
	at gov.uspto.patent.doc.pap.items.DocumentIdNode.read(DocumentIdNode.java:61)
	at gov.uspto.patent.doc.pap.fragments.RelatedIdNode.readIds(RelatedIdNode.java:115)
	at gov.uspto.patent.doc.pap.fragments.RelatedIdNode.contionationOf(RelatedIdNode.java:27)
	at gov.uspto.patent.doc.pap.fragments.RelatedIdNode.read(RelatedIdNode.java:93)
	at gov.uspto.patent.doc.pap.PatentAppPubParser.parse(PatentAppPubParser.java:96)
	at gov.uspto.patent.PatentReader.read(PatentReader.java:78)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)

Y2010 file getAbbreviatedName null pointer exception

Splitting out earlier post with multiple error types:

Running Y2010 file ipg100105.zip, to limit. Error follows:

patrick$ java -cp PatentDocument/target/*:PatentDocument/target/dependency-jars/*:resources gov.uspto.patent.TransformerCli --input="BulkDownloader/download/grants/2010/ipg100105.zip" --outBulk=false --outdir="BulkDownloader/download/grants/2010/expanded/"
2016-11-08 21:51:36,523 INFO  [main] TransformerCli - --- Start ---
2016-11-08 21:51:36,560 INFO  [main] TransformerCli - Dump File[1]: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip
2016-11-08 21:51:36,562 INFO  [main] PatentDocFormatDetect - PatentDocFormat fromFileName: RedbookGrant
2016-11-08 21:51:36,569 INFO  [main] ZipReader - Reading zip file: /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip
2016-11-08 21:51:36,601 INFO  [main] ZipReader - Found 1 file[FileFilter [matchRules=[SuffixFilter [suffixes=[xml]]]]]: ipg100105.xml
2016-11-08 21:51:36,604 INFO  [main] PatentDocFormatDetect - PatentType fromContent: RedbookGrant
2016-11-08 21:51:37,205 INFO  [main] TransformerCli - Record: 'USD0607176S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1
2016-11-08 21:51:37,351 INFO  [main] TransformerCli - Record: 'USD0607177S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:2
2016-11-08 21:51:37,463 INFO  [main] TransformerCli - Record: 'USD0607178S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:3
2016-11-08 21:51:37,601 INFO  [main] TransformerCli - Record: 'USD0607179S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:4
2016-11-08 21:51:37,738 INFO  [main] TransformerCli - Record: 'USD0607180S1' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:5

.......(trimmed)
2016-11-08 21:55:32,444 INFO  [main] TransformerCli - Record: 'US7642067B2' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1937
2016-11-08 21:55:32,555 INFO  [main] TransformerCli - Record: 'US7642068B2' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1938
2016-11-08 21:55:32,618 INFO  [main] TransformerCli - Record: 'US7642069B2' from /Users/patrick/dev/repos/PatentPublicData/BulkDownloader/download/grants/2010/ipg100105.zip:1939
Exception in thread "main" java.lang.NullPointerException
	at gov.uspto.patent.model.entity.NamePerson.getAbbreviatedName(NamePerson.java:75)
	at gov.uspto.patent.serialize.JsonMapper.mapName(JsonMapper.java:370)
	at gov.uspto.patent.serialize.JsonMapper.mapApplicant(JsonMapper.java:306)
	at gov.uspto.patent.serialize.JsonMapper.buildJson(JsonMapper.java:99)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:68)
	at gov.uspto.patent.serialize.JsonMapper.write(JsonMapper.java:56)
	at gov.uspto.patent.TransformerCli.write(TransformerCli.java:203)
	at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:188)
	at gov.uspto.patent.TransformerCli.process(TransformerCli.java:116)
	at gov.uspto.patent.TransformerCli.main(TransformerCli.java:276)

NameNode

When processing the following 1995 files:

pftaps19951017_wk42.zip
pftaps19950214_wk07.zip

In NameNode.read called from AssigneeName the variable fullName comes up NULL and TransformerCli crashes.

The current code:

String fullName = nameN != null ? nameN.getText() : null;

List<String> nameParts = Splitter.onPattern("[,;]").limit(2).trimResults().splitToList(fullName);

Appears to expect fullName to be null at times, but the com.google.common code called in Splitter.splitToList does not seem to like it. (Is there a version of splitToList that accepts nulls?)

Current workaround I am using is:

if (fullName == null) return null;

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.