Comments (11)
Thanks for the kind words, and I'll take a look. You're right, right now the header implementation is very basic, relying mostly on font size comparisons. The next step would probably be examining some of the built-in styles that are included in Word XML by default (like Header1
/Nadpis1
); those make it easy to identify things that are headers, but there's no guarantee that they will necessarily be used in any given document... that's why there's a bit of a guessing game here.
It's interesting that the names of those classes are localized, however. Hopefully there's a reliable way to identify those styles no matter what they happen to be named.
from swordfish.
Thanks for the information. Well, if that's now built like you have just confirmed, how is that suppose to work? If I make, let's say, a paragraph of 10 pt font, next one of 13 pt, next one again of 10 pt, nothing happens. The output is totally flattened, everything is in the simple paragraph shape. It would be very helpful to have such a basic function :)
from swordfish.
Are you enabling the appropriate settings flag? The call should look something like this:
Swordfish.open('~/Documents/my_word_doc.docx').settings(:guess_headers => true).to_html
However, like you said, it's far from perfect. I'll look into improving the reliability.
from swordfish.
Thank you for the hint, have forgotten that. However, it throws this (both cases, when using styles as well as just different font sizes):
NoMethodError: undefined method `[]' for nil:NilClass
from ... /lib/swordfish/document.rb:108:in `block in find_headers!'
from ... /lib/swordfish/document.rb:106:in `each'
from ... /lib/swordfish/document.rb:106:in `each_with_index'
from ... /lib/swordfish/document.rb:106:in `find_headers!'
from ... /lib/swordfish/document.rb:65:in `settings'
I am using a bit customized fork now but haven't even touch this part of the app...
from swordfish.
Well, it works, I'd probably used a bad file... Will test that thoroughly and let you know.
from swordfish.
Yay, still struggling with that, it too often throws this error. Always pointing to the line 108 in document.rb It definitely seems to emerge when there are NO headers at all.
from swordfish.
Well, it seems something like this helps, however, I really doubt this would be a clean solution:
header_sizes = []
font_sizes.each_with_index do |f, idx|
if idx == 0
header_sizes << f[:size] if f[:size] > font_sizes[idx+1][:size]
elsif idx != font_sizes.length - 1
header_sizes << f[:size] if (f[:size] > font_sizes[idx-1][:size] && f[:size] > font_sizes[idx+1][:size])
end
end
rescue # < !
nil # < here anything helps... string as well as nil
header_sizes = header_sizes.uniq.sort.reverse
font_sizes.each do |f|
level = header_sizes.find_index(f[:size])
...
from swordfish.
If you take a look at the new better_headings
branch (https://github.com/voikya/swordfish/tree/better_headings), do you see any improvement when the :guess_headers
flag is set? This should hopefully both prevent that error you saw as well as improving the general reliability of headers.
from swordfish.
I am still testing that, will let you know. Again, really nice tool, with a tiny monkey patch I can parse into, let's say, valid DocBook!
from swordfish.
Well, it helped. It does not throw the error like before, however, it's results are still very accidental. For example, I have two docs with a very very similar simple styling (h1 20 points, the text 12 points). In case of the first it works very well, in case of the second it still produces simple paragraphs instead of headers. Can't figure out where could be the problem/difference. Unfortunately, I still can't rely on this tiny part (but so much needed :) ).
from swordfish.
Could you upload a minimal test case? A Word doc that still isn't having headers parsed correctly, but with any content you don't want shared removed? (Even better if you can create a new document and replicate the problem)
from swordfish.
Related Issues (6)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from swordfish.