Having a font for an entire word helps parsing. A lot. Height also helps some. <p

Ok, I have this working in the word_fonts branch <a href="https://github.com/jsfenfen/

word-level font names and heights about pdfplumber HOT 21 CLOSED

jsvine commented on September 2, 2024 1

word-level font names and heights

from pdfplumber.

Comments (21)

jsfenfen commented on September 2, 2024 3

hey @krishnakt031990 I don't think so, though the version I did of it is still here: https://github.com/jsfenfen/pdfplumber/tree/master . I guess there's a minor release that's been added since, I will update when I've got a sec.
@jsvine it looks like the pr doesn't have squashed commits? this isn't a big change, though would be clearer if I could squash those. Hmm.

from pdfplumber.

jsfenfen commented on September 2, 2024 1

I got a different sample of the docs with the font height thing! Going through them, uh, soonish.

from pdfplumber.

jsfenfen commented on September 2, 2024

I guess with word heights I'm going back and forth on averaging them or taking the mode; left the latter in for the moment.

from pdfplumber.

jsvine commented on September 2, 2024

Thanks! I like this. For testing's sake: Do you have shareable examples of PDFs where chars that should belong to the same word either have different heights or fontnames?

from pdfplumber.

jsfenfen commented on September 2, 2024

So I still haven't heard back about the files that originally required this. I could pretty easily just make up a sample pdf that failed the font height test, though obviously having an example would be better... The other time this stuff (can) come up is when the word tolerance is set too high and words run together inadvertently--though only if adjacent cells have different fonts. Will look around a bit.

from pdfplumber.

jsvine commented on September 2, 2024

No worries. Thinking through this a bit. I'm tempted to, by default, group words by fonts, size, and color. (Yes, upcoming versions of pdfplumber will include font color!) Boolean params could turn them off. I.e., defaults would be:

def extract_words(chars,
  x_tolerance=DEFAULT_X_TOLERANCE,
  y_tolerance=DEFAULT_Y_TOLERANCE,
  keep_blank_chars=False,
  match_fontsize=True,
  match_fontcolor=True,
  match_fontname=True
)

That'd mean losing some of the flexibility of, e.g., DEFAULT_FONT_HEIGHT_TOLERANCE, but might make the options clearer. It'd also mean avoiding having to calculate the average/mode values for tolerance-ed attributes. For instance, this ...

page.extract_words()

.... might return ...

[ {
  "text": "Hello",
  "fontsize": 12,
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

... while ...

page.extract_words(match_fontsize=False)

.... would return ...

[ {
  "text": "Hello",
  "fontname": "ArialBold",
  "fontcolor": "#000000"
} ]

What do you think? Too inflexible?

from pdfplumber.

jsfenfen commented on September 2, 2024

I think that's great!

Also, I think whatever adjustments might be needed will become more obvious the more pdfs we trawl through...

from pdfplumber.

jsfenfen commented on September 2, 2024

Ok, I have this working in the word_fonts branch here using made up pdfs as tests. Trying to dig up the sample observed in the wild.

Am doing this with a custom WordFontError subclassed from RuntimeError, but am open to suggestions...

No idea if this will be at all helpful ahead of 0.60 rewrite, but...

from pdfplumber.

jsvine commented on September 2, 2024

Ooh, thanks! Will definitely aim to incorporate this (or something close to it) into the next big release.

from pdfplumber.

problemsniper commented on September 2, 2024

Is this in the current version? I am looking for font name and font size per work and not per letter.

from pdfplumber.

problemsniper commented on September 2, 2024

Works perfectly! thanks @jsfenfen. Just have another question regarding the document. Did you try to reverse engineer to build a pdf out of the extracted properties of text? Just wanted some tips to create one if you did look into doing it.

from pdfplumber.

jsfenfen commented on September 2, 2024

"Did you try to reverse engineer to build a pdf out of the extracted properties of text?"
No.... I'm not sure I get the use case--couldn't you just use the original pdf? But if you really want to create a pdf from objects of your choosing, maybe https://bitbucket.org/rptlab/reportlab ?

from pdfplumber.

jsfenfen commented on September 2, 2024

@krishnakt031990 is this a pdf that's been OCR'ed? Fonts aren't very reliable in most of the OCR I've seen--could this have been set there? Also possible this is a pdfminer thing? Can you share a doc that does this?

from pdfplumber.

problemsniper commented on September 2, 2024

For the font size.. the point size is about 4-5 pts more than the actual font. I can give an example with an image here.

See that extra spacing on top of My?

from pdfplumber.

Saqhas commented on September 2, 2024

@jsvine Is this issue resolved and the functionality added.

from pdfplumber.

jsvine commented on September 2, 2024

This functionality has not yet been added. I'm certainly open to adding it, but haven't had the time quite yet.

from pdfplumber.

Saqhas commented on September 2, 2024

I wanted this functionality in one of my project. I have done some changes in the repo code to support this functionality, should I push it in a branch and create pull request. So that we can discuss and add it.

from pdfplumber.

jsvine commented on September 2, 2024

Thanks, @Saqhas! It's definitely worth a discussion and opening a pull request. I'm not certain I'll use your code, but it could definitely be helpful inspiration and I would certainly credit you for that.

from pdfplumber.

ibrahimshuail commented on September 2, 2024

can we capture based on the font size, for eg if my font size is 12 I need the relevant words from that?

from pdfplumber.

jsvine commented on September 2, 2024

@ibrahimshuail See my response to the separate issue you opened, #234

from pdfplumber.

jsvine commented on September 2, 2024

Closing this now-done issue. Per merged PR above, this feature was added last year! 🎉

from pdfplumber.

word-level font names and heights about pdfplumber HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent