Giter Site home page Giter Site logo

Comments (21)

jsfenfen avatar jsfenfen commented on September 2, 2024 3

hey @krishnakt031990 I don't think so, though the version I did of it is still here: https://github.com/jsfenfen/pdfplumber/tree/master . I guess there's a minor release that's been added since, I will update when I've got a sec.
@jsvine it looks like the pr doesn't have squashed commits? this isn't a big change, though would be clearer if I could squash those. Hmm.

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024 1

I got a different sample of the docs with the font height thing! Going through them, uh, soonish.

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024

I guess with word heights I'm going back and forth on averaging them or taking the mode; left the latter in for the moment.

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

Thanks! I like this. For testing's sake: Do you have shareable examples of PDFs where chars that should belong to the same word either have different heights or fontnames?

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024

So I still haven't heard back about the files that originally required this. I could pretty easily just make up a sample pdf that failed the font height test, though obviously having an example would be better... The other time this stuff (can) come up is when the word tolerance is set too high and words run together inadvertently--though only if adjacent cells have different fonts. Will look around a bit.

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

No worries. Thinking through this a bit. I'm tempted to, by default, group words by fonts, size, and color. (Yes, upcoming versions of pdfplumber will include font color!) Boolean params could turn them off. I.e., defaults would be:

def extract_words(chars,
  x_tolerance=DEFAULT_X_TOLERANCE,
  y_tolerance=DEFAULT_Y_TOLERANCE,
  keep_blank_chars=False,
  match_fontsize=True,
  match_fontcolor=True,
  match_fontname=True
)

That'd mean losing some of the flexibility of, e.g., DEFAULT_FONT_HEIGHT_TOLERANCE, but might make the options clearer. It'd also mean avoiding having to calculate the average/mode values for tolerance-ed attributes. For instance, this ...

page.extract_words()

.... might return ...

[ {
  "text": "Hello",
  "fontsize": 12,
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

... while ...

page.extract_words(match_fontsize=False)

.... would return ...

[ {
  "text": "Hello",
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

What do you think? Too inflexible?

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024

I think that's great!

Also, I think whatever adjustments might be needed will become more obvious the more pdfs we trawl through...

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024

Ok, I have this working in the word_fonts branch here using made up pdfs as tests. Trying to dig up the sample observed in the wild.

Am doing this with a custom WordFontError subclassed from RuntimeError, but am open to suggestions...

No idea if this will be at all helpful ahead of 0.60 rewrite, but...

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

Ooh, thanks! Will definitely aim to incorporate this (or something close to it) into the next big release.

from pdfplumber.

problemsniper avatar problemsniper commented on September 2, 2024

Is this in the current version? I am looking for font name and font size per work and not per letter.

from pdfplumber.

problemsniper avatar problemsniper commented on September 2, 2024

Works perfectly! thanks @jsfenfen. Just have another question regarding the document. Did you try to reverse engineer to build a pdf out of the extracted properties of text? Just wanted some tips to create one if you did look into doing it.

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024

"Did you try to reverse engineer to build a pdf out of the extracted properties of text?"
No.... I'm not sure I get the use case--couldn't you just use the original pdf? But if you really want to create a pdf from objects of your choosing, maybe https://bitbucket.org/rptlab/reportlab ?

from pdfplumber.

jsfenfen avatar jsfenfen commented on September 2, 2024

@krishnakt031990 is this a pdf that's been OCR'ed? Fonts aren't very reliable in most of the OCR I've seen--could this have been set there? Also possible this is a pdfminer thing? Can you share a doc that does this?

from pdfplumber.

problemsniper avatar problemsniper commented on September 2, 2024

For the font size.. the point size is about 4-5 pts more than the actual font. I can give an example with an image here.

image

See that extra spacing on top of My?

from pdfplumber.

Saqhas avatar Saqhas commented on September 2, 2024

@jsvine Is this issue resolved and the functionality added.

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

This functionality has not yet been added. I'm certainly open to adding it, but haven't had the time quite yet.

from pdfplumber.

Saqhas avatar Saqhas commented on September 2, 2024

I wanted this functionality in one of my project. I have done some changes in the repo code to support this functionality, should I push it in a branch and create pull request. So that we can discuss and add it.

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

Thanks, @Saqhas! It's definitely worth a discussion and opening a pull request. I'm not certain I'll use your code, but it could definitely be helpful inspiration and I would certainly credit you for that.

from pdfplumber.

ibrahimshuail avatar ibrahimshuail commented on September 2, 2024

can we capture based on the font size, for eg if my font size is 12 I need the relevant words from that?

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

@ibrahimshuail See my response to the separate issue you opened, #234

from pdfplumber.

jsvine avatar jsvine commented on September 2, 2024

Closing this now-done issue. Per merged PR above, this feature was added last year! 🎉

from pdfplumber.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.