Comments (3)
yes, both actually:)
from nougat.
Thank you,
yes we removed the headers from the training data. And it works pretty well for pdfs however the scanned documents are out of distribution and might show unwanted artifacts.
That the fist line is missing is a bit weird. not 100% sure, but could be due to an unrelated issue with the page splitting algorithm.
from nougat.
Thanks, it indeed works really well. I'm curious to know the motivation of removing headers when preparing the training image-latex pair, is it beneficial for page splitting, or for easier post-processing, or something else? Appreciate it.
from nougat.
Related Issues (20)
- ERROR:root:missing reference detected
- Will Nougat support Turkish and German utf-8 characters? HOT 3
- Support code blocks in paper?
- Don't look for GPU when --help is present
- pdf to .tex and not .mmd HOT 2
- training from scratch can't get proper model HOT 2
- the output(multimarkdown) unable to display and convert well HOT 3
- Dataset creation: Do we expect the .tex files to be just a single file for each corresponding PDF? HOT 2
- Doesn't support checkboxes
- Huggingface output
- Creating my own dataset and training it using Trainer from transformers HOT 1
- get type error from running sklearn.externals.joblib.externals import cloudpickle HOT 1
- some package in docker image building file seems outdated HOT 2
- nougat misses the double column pdfs
- Pretraining Objectives? HOT 6
- token by token generation with probability?
- Extract Table failed.... HOT 2
- Why does metadata exist?
- can not run this app with warning torch.meshgrid HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nougat.