Comments (2)
Hi @sisyga , the ai_extraction parameter is only available from the API at the moment.
When running locally on PDFs with lots of pages, I experience this problem too. That is a reasonable workaround, although I don't think it is sufficient for the reasons you mentioned.
I am actually not sure what would be sufficient -- I am toying with the idea of training a page-image classifier to filter pages without visuals/tables, but this is quite demanding. If you had any additional ideas I would love to hear them!
from thepipe.
Hey, thanks for working to open-source the AI classifier. In the meantime, I use the following workaround:
def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
chunks = []
if ai_extraction:
with open(file_path, "rb") as f:
response = requests.post(
url=API_URL,
files={'file': (file_path, f)},
data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
)
try:
response_json = response.json()
except json.JSONDecodeError:
raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
if 'error' in response_json:
raise ValueError(f"{response_json['error']}")
messages = response_json['messages']
chunks = create_chunks_from_messages(messages)
else:
import fitz
# extract text and images of each page from the PDF
with open(file_path, 'rb') as file:
doc = fitz.open(file_path)
for page in doc:
text = page.get_text()
image_list = page.get_image_info()
drawing_commands = page.get_drawings()
drawing_count = len(drawing_commands)
if text_only:
chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
elif image_list or drawing_count > 5: # only make a snapshot if there is an image or more than 5 lines drawn
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))
else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
doc.close()
return chunks
Basically, I extract the number of drawing commands, and if it is higher than a threshold (here: 5, which could be implemented as an option), I make an image snapshot. This is working all right since complex formulas and table lines also count toward the drawing commands, which is what I want.
from thepipe.
Related Issues (17)
- Feature requests 🔨 HOT 4
- Swap Whisper Version
- Some videos (without audio) fail to extract
- add syntax to match multiple patterns with match/ignore functionality.
- Add .ino functionality for GitHub repos related to arduino
- Error when trying to Pipe Linkedin profile
- Make docker image
- Running "Locally" HOT 2
- file type scanning
- Pytesseract error when text_only is True within GitHub Action
- Increment Timestamp for Long Videos
- Full-page screenshot when extracting page URL HOT 4
- Directory extraction fails if one file or any files fail HOT 1
- Video frame + transcript extraction
- Audio transcript extraction HOT 1
- No longer working after addition of THEPIPE_API_KEY HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from thepipe.