Giter Site home page Giter Site logo

Headers about swordfish HOT 11 CLOSED

voikya avatar voikya commented on July 2, 2024
Headers

from swordfish.

Comments (11)

voikya avatar voikya commented on July 2, 2024

Thanks for the kind words, and I'll take a look. You're right, right now the header implementation is very basic, relying mostly on font size comparisons. The next step would probably be examining some of the built-in styles that are included in Word XML by default (like Header1/Nadpis1); those make it easy to identify things that are headers, but there's no guarantee that they will necessarily be used in any given document... that's why there's a bit of a guessing game here.

It's interesting that the names of those classes are localized, however. Hopefully there's a reliable way to identify those styles no matter what they happen to be named.

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

Thanks for the information. Well, if that's now built like you have just confirmed, how is that suppose to work? If I make, let's say, a paragraph of 10 pt font, next one of 13 pt, next one again of 10 pt, nothing happens. The output is totally flattened, everything is in the simple paragraph shape. It would be very helpful to have such a basic function :)

from swordfish.

voikya avatar voikya commented on July 2, 2024

Are you enabling the appropriate settings flag? The call should look something like this:

Swordfish.open('~/Documents/my_word_doc.docx').settings(:guess_headers => true).to_html

However, like you said, it's far from perfect. I'll look into improving the reliability.

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

Thank you for the hint, have forgotten that. However, it throws this (both cases, when using styles as well as just different font sizes):

NoMethodError: undefined method `[]' for nil:NilClass
        from ... /lib/swordfish/document.rb:108:in `block in find_headers!'
        from ... /lib/swordfish/document.rb:106:in `each'
        from ... /lib/swordfish/document.rb:106:in `each_with_index'
        from ... /lib/swordfish/document.rb:106:in `find_headers!'
        from ... /lib/swordfish/document.rb:65:in `settings'

I am using a bit customized fork now but haven't even touch this part of the app...

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

Well, it works, I'd probably used a bad file... Will test that thoroughly and let you know.

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

Yay, still struggling with that, it too often throws this error. Always pointing to the line 108 in document.rb It definitely seems to emerge when there are NO headers at all.

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

Well, it seems something like this helps, however, I really doubt this would be a clean solution:

header_sizes = []
font_sizes.each_with_index do |f, idx|
   if idx == 0
      header_sizes << f[:size] if f[:size] > font_sizes[idx+1][:size]
   elsif idx != font_sizes.length - 1
      header_sizes << f[:size] if (f[:size] > font_sizes[idx-1][:size] && f[:size] > font_sizes[idx+1][:size])
   end
end
rescue # < !
   nil # < here anything helps... string as well as nil
header_sizes = header_sizes.uniq.sort.reverse
font_sizes.each do |f|
   level = header_sizes.find_index(f[:size])
...

from swordfish.

voikya avatar voikya commented on July 2, 2024

If you take a look at the new better_headings branch (https://github.com/voikya/swordfish/tree/better_headings), do you see any improvement when the :guess_headers flag is set? This should hopefully both prevent that error you saw as well as improving the general reliability of headers.

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

I am still testing that, will let you know. Again, really nice tool, with a tiny monkey patch I can parse into, let's say, valid DocBook!

from swordfish.

welblaud avatar welblaud commented on July 2, 2024

Well, it helped. It does not throw the error like before, however, it's results are still very accidental. For example, I have two docs with a very very similar simple styling (h1 20 points, the text 12 points). In case of the first it works very well, in case of the second it still produces simple paragraphs instead of headers. Can't figure out where could be the problem/difference. Unfortunately, I still can't rely on this tiny part (but so much needed :) ).

from swordfish.

voikya avatar voikya commented on July 2, 2024

Could you upload a minimal test case? A Word doc that still isn't having headers parsed correctly, but with any content you don't want shared removed? (Even better if you can create a new document and replicate the problem)

from swordfish.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.