Hi! Not sure if this is a bug or a feature, but I'd love to use the <code class="notra

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

`ai_extraction=True` not working locally about thepipe HOT 2 OPEN

sisyga commented on September 25, 2024

`ai_extraction=True` not working locally

from thepipe.

Comments (2)

emcf commented on September 25, 2024

Hi @sisyga , the ai_extraction parameter is only available from the API at the moment.

When running locally on PDFs with lots of pages, I experience this problem too. That is a reasonable workaround, although I don't think it is sufficient for the reasons you mentioned.

I am actually not sure what would be sufficient -- I am toying with the idea of training a page-image classifier to filter pages without visuals/tables, but this is quite demanding. If you had any additional ideas I would love to hear them!

from thepipe.

sisyga commented on September 25, 2024

Hey, thanks for working to open-source the AI classifier. In the meantime, I use the following workaround:

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_image_info()
                drawing_commands = page.get_drawings()
                drawing_count = len(drawing_commands)

                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list or drawing_count > 5:  # only make a snapshot if there is an image or more than 5 lines drawn
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks

Basically, I extract the number of drawing commands, and if it is higher than a threshold (here: 5, which could be implemented as an option), I make an image snapshot. This is working all right since complex formulas and table lines also count toward the drawing commands, which is what I want.

from thepipe.

`ai_extraction=True` not working locally about thepipe HOT 2 OPEN

Comments (2)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent