Question Validation <

Hello, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Question]: SmartPDFLoader does not work as a file_extractor about llama_index HOT 5 CLOSED

chewbm05 commented on July 17, 2024

[Question]: SmartPDFLoader does not work as a file_extractor

from llama_index.

Comments (5)

dosubot commented on July 17, 2024 3

Hello, @chewbm05! I'm a bot here to help you with any issues or questions you may have. Let's see how we can solve this problem together.

The error "'WindowsPath' object has no attribute 'decode'" occurs because the pdf_path_or_url argument in the load_data method of SmartPDFLoader is expected to be a string, but a WindowsPath object is being passed instead. To fix this, you can convert the WindowsPath object to a string before passing it to the load_data method.

Here's how you can modify your code to ensure the pdf_path_or_url argument is a string:

from pathlib import Path
from llama_index.readers.smart_pdf_loader.base import SmartPDFLoader
from llama_index.readers.simple_directory_reader import SimpleDirectoryReader

# Convert the WindowsPath object to a string
file_path = Path("Armenia - Croatia Treaty.pdf")
file_path_str = str(file_path)

# Initialize SmartPDFLoader
pdf_reader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)

# Define the file extractor with the appropriate reader for PDF files
file_extractor = {".pdf": pdf_reader}

# Initialize SimpleDirectoryReader with the file_extractor
reader = SimpleDirectoryReader(input_files=[file_path_str], required_exts=[".pdf"], file_extractor=file_extractor)

# Load data from the directory
documents = reader.load_data()

This ensures that the pdf_path_or_url argument is a string, which should prevent the error from occurring ^[1].

To continue talking to Dosu, mention @dosu.

from llama_index.

logan-markewich commented on July 17, 2024

@chewbm05 the pdf reader class just isn't handling the path objects that simple directory reader passes in. Easy bug fix in a pr

from llama_index.

chewbm05 commented on July 17, 2024

Thanks! I managed to do a workaround by creating a class based on SimpleDirectoryReader:

class CustomSimpleDirectoryReader(SimpleDirectoryReader):
def init(self, *args, **kwargs):
super().init(*args, **kwargs)

def load_data(self):
    documents = []
    for input_file in self.input_files:
        input_file_str = str(input_file)  # Ensure the path is a string
        try:
            doc = self.file_extractor[".pdf"].load_data(input_file_str)
            documents.extend(doc)
        except Exception as e:
            print(f"Failed to load file {input_file} with error: {e}. Skipping...")
    return documents

from llama_index.

chewbm05 commented on July 17, 2024

However, this poses another issue:

Use the CustomSimpleDirectoryReader

reader = CustomSimpleDirectoryReader(
input_dir=data_path,
required_exts=[".pdf"],
file_extractor=file_extractor,
recursive=True
)

Load documents

documents = reader.load_data()
Failed to load file C:\Users\chewb\DTA\data\Countries_samples\Armenia\Armenia-Cyprus Treaty.pdf with error: No host specified.. Skipping...
Failed to load file C:\Users\chewb\DTA\data\Countries_samples\Armenia\ArmeniaCroatiaTreaty.pdf with error: No host specified.. Skipping...

Does anyone know what the issue is?

from llama_index.

logan-markewich commented on July 17, 2024

@chewbm05 I think you need to specify the llmsherpa_api_url ? (I have no idea how this reader works)

from llama_index.

Recommend Projects

[Question]: SmartPDFLoader does not work as a file_extractor about llama_index HOT 5 CLOSED

Comments (5)

Use the CustomSimpleDirectoryReader

Load documents

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent