Comments (5)
Hello, @chewbm05! I'm a bot here to help you with any issues or questions you may have. Let's see how we can solve this problem together.
The error "'WindowsPath' object has no attribute 'decode'" occurs because the pdf_path_or_url
argument in the load_data
method of SmartPDFLoader
is expected to be a string, but a WindowsPath
object is being passed instead. To fix this, you can convert the WindowsPath
object to a string before passing it to the load_data
method.
Here's how you can modify your code to ensure the pdf_path_or_url
argument is a string:
from pathlib import Path
from llama_index.readers.smart_pdf_loader.base import SmartPDFLoader
from llama_index.readers.simple_directory_reader import SimpleDirectoryReader
# Convert the WindowsPath object to a string
file_path = Path("Armenia - Croatia Treaty.pdf")
file_path_str = str(file_path)
# Initialize SmartPDFLoader
pdf_reader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
# Define the file extractor with the appropriate reader for PDF files
file_extractor = {".pdf": pdf_reader}
# Initialize SimpleDirectoryReader with the file_extractor
reader = SimpleDirectoryReader(input_files=[file_path_str], required_exts=[".pdf"], file_extractor=file_extractor)
# Load data from the directory
documents = reader.load_data()
This ensures that the pdf_path_or_url
argument is a string, which should prevent the error from occurring [1].
To continue talking to Dosu, mention @dosu.
from llama_index.
@chewbm05 the pdf reader class just isn't handling the path objects that simple directory reader passes in. Easy bug fix in a pr
from llama_index.
Thanks! I managed to do a workaround by creating a class based on SimpleDirectoryReader:
class CustomSimpleDirectoryReader(SimpleDirectoryReader):
def init(self, *args, **kwargs):
super().init(*args, **kwargs)
def load_data(self):
documents = []
for input_file in self.input_files:
input_file_str = str(input_file) # Ensure the path is a string
try:
doc = self.file_extractor[".pdf"].load_data(input_file_str)
documents.extend(doc)
except Exception as e:
print(f"Failed to load file {input_file} with error: {e}. Skipping...")
return documents
from llama_index.
However, this poses another issue:
Use the CustomSimpleDirectoryReader
reader = CustomSimpleDirectoryReader(
input_dir=data_path,
required_exts=[".pdf"],
file_extractor=file_extractor,
recursive=True
)
Load documents
documents = reader.load_data()
Failed to load file C:\Users\chewb\DTA\data\Countries_samples\Armenia\Armenia-Cyprus Treaty.pdf with error: No host specified.. Skipping...
Failed to load file C:\Users\chewb\DTA\data\Countries_samples\Armenia\ArmeniaCroatiaTreaty.pdf with error: No host specified.. Skipping...
Does anyone know what the issue is?
from llama_index.
@chewbm05 I think you need to specify the llmsherpa_api_url
? (I have no idea how this reader works)
from llama_index.
Related Issues (20)
- [Question]: How to insert/delete document to/from VectorStoreIndex when using IngestionPipeline? HOT 2
- Compatibility issue between Qdrant and DSPy when Qdrant is used as the VectorStoreIndex's storage context HOT 5
- [Question]: AttributeError: 'property' object has no attribute 'context_window' HOT 1
- [Question]: The created knowledge graph does not have edge relationships neo4j HOT 13
- [Documentation]: Some of the URL Not Working HOT 3
- [Question]: Unable to understand how document storage works in case nodes are deleted HOT 1
- [Documentation]: Broken 'Examples' Link HOT 3
- [Feature Request]: Add a notebook to show llamaindex agent works with graphRAG and Vertex AI
- [Bug]: File rename error in llama-index-finetuning/llama_index/finetuning/mistralai/utils.py HOT 1
- [Question]: How to enable "Calling function" print out after querying from Multi-Document Agent example HOT 3
- [Question]: Access LLM's response object CompleteResponse() attribute `additional_kwarg` in RAG HOT 2
- [Bug]: Error in initializing neo4j HOT 2
- Indexes cannot be created correctly using the MilvusVectorStore. HOT 12
- How should the dim parameter value of MilvusVectorStore be calculated? HOT 4
- [Bug]: ERROR: Failed building wheel for pystemmer HOT 1
- How to deploy open-source embedding models in auto-merging retriever: ValueError: shapes (1024,) and (384,) not aligned: 1024 (dim 0) != 384 (dim 0) HOT 2
- [Bug]: No module named 'llama_index.llms.openai.base HOT 1
- [Bug]: [OpenAILike] Cannot use llm_chat_callback on an instance without a callback_manager attribute HOT 4
- [Feature Request]: Version pinning for sub packages HOT 2
- I wonder how to use llama_index to retrieve the Milvus collection after it is created and indexed using the MilvusVectorStore. HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llama_index.