Giter Site home page Giter Site logo

Comments (5)

dosubot avatar dosubot commented on July 17, 2024 3

Hello, @chewbm05! I'm a bot here to help you with any issues or questions you may have. Let's see how we can solve this problem together.

The error "'WindowsPath' object has no attribute 'decode'" occurs because the pdf_path_or_url argument in the load_data method of SmartPDFLoader is expected to be a string, but a WindowsPath object is being passed instead. To fix this, you can convert the WindowsPath object to a string before passing it to the load_data method.

Here's how you can modify your code to ensure the pdf_path_or_url argument is a string:

from pathlib import Path
from llama_index.readers.smart_pdf_loader.base import SmartPDFLoader
from llama_index.readers.simple_directory_reader import SimpleDirectoryReader

# Convert the WindowsPath object to a string
file_path = Path("Armenia - Croatia Treaty.pdf")
file_path_str = str(file_path)

# Initialize SmartPDFLoader
pdf_reader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)

# Define the file extractor with the appropriate reader for PDF files
file_extractor = {".pdf": pdf_reader}

# Initialize SimpleDirectoryReader with the file_extractor
reader = SimpleDirectoryReader(input_files=[file_path_str], required_exts=[".pdf"], file_extractor=file_extractor)

# Load data from the directory
documents = reader.load_data()

This ensures that the pdf_path_or_url argument is a string, which should prevent the error from occurring [1].

To continue talking to Dosu, mention @dosu.

from llama_index.

logan-markewich avatar logan-markewich commented on July 17, 2024

@chewbm05 the pdf reader class just isn't handling the path objects that simple directory reader passes in. Easy bug fix in a pr

from llama_index.

chewbm05 avatar chewbm05 commented on July 17, 2024

Thanks! I managed to do a workaround by creating a class based on SimpleDirectoryReader:

class CustomSimpleDirectoryReader(SimpleDirectoryReader):
def init(self, *args, **kwargs):
super().init(*args, **kwargs)

def load_data(self):
    documents = []
    for input_file in self.input_files:
        input_file_str = str(input_file)  # Ensure the path is a string
        try:
            doc = self.file_extractor[".pdf"].load_data(input_file_str)
            documents.extend(doc)
        except Exception as e:
            print(f"Failed to load file {input_file} with error: {e}. Skipping...")
    return documents

from llama_index.

chewbm05 avatar chewbm05 commented on July 17, 2024

However, this poses another issue:

Use the CustomSimpleDirectoryReader

reader = CustomSimpleDirectoryReader(
input_dir=data_path,
required_exts=[".pdf"],
file_extractor=file_extractor,
recursive=True
)

Load documents

documents = reader.load_data()
Failed to load file C:\Users\chewb\DTA\data\Countries_samples\Armenia\Armenia-Cyprus Treaty.pdf with error: No host specified.. Skipping...
Failed to load file C:\Users\chewb\DTA\data\Countries_samples\Armenia\ArmeniaCroatiaTreaty.pdf with error: No host specified.. Skipping...

Does anyone know what the issue is?

from llama_index.

logan-markewich avatar logan-markewich commented on July 17, 2024

@chewbm05 I think you need to specify the llmsherpa_api_url ? (I have no idea how this reader works)

from llama_index.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.