pcolar / ocr-parser Goto Github PK
View Code? Open in Web Editor NEWProgramming project for INLS560
License: MIT License
Programming project for INLS560
License: MIT License
Modify the code to accept a range of input files:
-- files with the same date will count up the page sequence number
-- open and close each input file cleanly
In the current version under development, the code has a fix data output to the CSV file. Ideally, the code would read the header of the CSV file to determine which data elements to parse.
Practical application would require standardized element names as part of an extensible controlled vocabulary.
For Windows and Mac OSX
-- call the parse program using a range of input files based on the selection in Finder or Windows file manager
Possible levels:
Collection
Digitization batch
Reel
Item
Derived metadata
Parsed metadata
Using the CSV library will simplify the parsing program and make it easier to maintain.
Complete partial metadata from the OCR file.
Data may be incomplete: 'January 12, 199"
no field separation: 'Page1', 'PageOne'
A high priority is moving the parsing regex code to a library, where it can be extensible and not clog the main code.
Several spreadsheets and data files exist from separate levels of the workflow.
Define matching elements and inheritance
a function to convert date strings into standardized date formats
'JUNE 15, 1946' -> 19460615
One or more metadata elements are not found in the OCR file
Document how to become involved and submit code or other work to the project.
Code to parse the text files output by the digitization workflow.
Currently identified fields include:
Volume
Issue
Date
Page Number
a function to convert numbers written as text to digits:
one -> 1
twenty seven -> 27
Some of our command line parameters can be loaded from shell environment variables using the 'os' library: os.environ["shell_variable"]
A small python script can query the user and create the shell variables - which will persist for the life of the terminal session
Function to convert roman numerals to digits
i.e. IX -> 9
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.