Giter Site home page Giter Site logo

dr-doc-search's Introduction

dr-doc-search's People

Contributors

klingefjord avatar namuan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dr-doc-search's Issues

Windows machine 'C:\Program' not recognized as an internal or external command,

What happend: I installed in a new environment, set up the path like described. Except here I have a more recent version of image magic, so the IMCONV looks like this:
%PROGRAMFILES%\ImageMagick-7.1.0-Q16-HDRI\

Windows finds magick as a command, but when I do
dr-doc-search --train -i "pdfs\my_pdf.pdf" --embedding huggingface

There is this error.

'C:\Program' is not recognized as an internal or external command,
operable program or batch file.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\user\code_folder\dr_doc_test\venv\Scripts\dr-doc-search.exe\__main__.py", line 7, in <
module>
  File "C:\Users\user\code_folder\dr_doc_test\venv\Lib\site-packages\doc_search\app.py", line 68, in ma
in
    run_workflow(context, training_workflow_steps())
  File "C:\Users\user\code_folder\dr_doc_test\venv\Lib\site-packages\py_executable_checklist\workflow.p
y", line 36, in run_workflow
    __run_step(step, context)
  File "C:\Users\user\code_folder\dr_doc_test\venv\Lib\site-packages\py_executable_checklist\workflow.p
y", line 29, in __run_step
    returned_context = step_instance.execute() or {}
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\code_folder\dr_doc_test\venv\Lib\site-packages\doc_search\workflow\__init__.py",
line 142, in execute
    run_command(convert_command)
  File "C:\Users\user\code_folder\dr_doc_test\venv\Lib\site-packages\py_executable_checklist\workflow.p
y", line 9, in run_command
    return subprocess.check_output(command, shell=True).decode("utf-8")  # nosemgrep
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 466, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'C:\Program Files\ImageMagick-7.1.0-Q16-HDRI\magick convert -density 150 -
trim -background white -alpha remove -quality 100 -sharpen 0x1.0 C:\Users\user\OutputDir\dr-doc-search\my_pdf\my_pdf.pdf[1] -quali
ty 100 C:\Users\user\OutputDir\dr-doc-search\my_pdf\images\outpu
t-1.png' returned non-zero exit status 1.

It seems to me, that this is the problem: 'C:\Program' is not recognized as an internal or external command, operable program or batch file.

In Windows the folder is called Program Files and the space in the name is a reliable source of errors in my scripts. I don't really understand where it comes from in this case.

Error when ConvertPDFToImages.execute is run

I get the error below when running dr-doc-search --train -i filename.pdf on Windows 11.
I noticed that convert_command is set to this:

convert_command = f"""convert -density 150 -trim -background white -alpha remove -quality 100 -sharpen 0x1.0 {input_file_page} -quality 100 {image_path}"""

I think convert might be an alias on Linux: https://linux.die.net/man/1/convert
but on Windows that alias is taken by an existing command: https://www.wikiwand.com/en/Convert_(command)
Is there a way to allow the use of magick convert for Windows machines?

Invalid Parameter - 150
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Python311\Scripts\dr-doc-search.exe\__main__.py", line 7, in <module>
  File "C:\Python311\Lib\site-packages\doc_search\app.py", line 48, in main
    run_workflow(context, training_workflow_steps())
  File "C:\Python311\Lib\site-packages\py_executable_checklist\workflow.py", line 36, in run_workflow
    __run_step(step, context)
  File "C:\Python311\Lib\site-packages\py_executable_checklist\workflow.py", line 29, in __run_step
    returned_context = step_instance.execute() or {}
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\site-packages\doc_search\workflow\__init__.py", line 102, in execute
    run_command(convert_command)
  File "C:\Python311\Lib\site-packages\py_executable_checklist\workflow.py", line 9, in run_command
    return subprocess.check_output(command, shell=True).decode("utf-8")  # nosemgrep
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\subprocess.py", line 465, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Python311\Lib\subprocess.py", line 569, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'convert -density 150 -trim -background white -alpha remove -quality 100 -sharpen 0x1.0 C:\Users\iqaco\OutputDir\dr-doc-search\wind-up-bird-chronicle\wind-up-bird-chronicle.pdf[1] -quality 100 C:\Users\iqaco\OutputDir\dr-doc-search\wind-up-bird-chronicle\images\output-1.png' returned non-zero exit status 4.```

Feature Request: Instead of `--embedding huggingface` allow path to model

Hi, I want to suggest the feature, to use a local model from a models path.

Example

dr-doc-search --train -i my_pdf.pdf --path "my_models/ggml-model-14_0.bin"

It'd make the reuse of models easier and allow people with a restricted internet connection (company proxy in my case) to download models the way they can and use them later on.

Having that would be awesome!

Add feature to take text directly.

I want to run the indexing on available text file instead of a pdf file. Basically bypass the Imagemagick + OCR workflow and load the text file directly.

Include and output sources

Use VectorDBQAWithSourcesChain and return sources along with the answer.

Check if we can get page number so that an book page image can be displayed along with the answer?
Better if we can highlight the interesting sentences

Received error on running Huggingface LLM on a Windows 10 machine

After installing your library and tried given example on my win10 machine I got following error
Invalid Parameter - 150 Traceback (most recent call last): File "C:\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\Python310\Scripts\dr-doc-search.exe\__main__.py", line 7, in <module> File "C:\Python310\lib\site-packages\doc_search\app.py", line 68, in main run_workflow(context, training_workflow_steps()) File "C:\Python310\lib\site-packages\py_executable_checklist\workflow.py", line 36, in run_workflow __run_step(step, context) File "C:\Python310\lib\site-packages\py_executable_checklist\workflow.py", line 29, in __run_step returned_context = step_instance.execute() or {} File "C:\Python310\lib\site-packages\doc_search\workflow\__init__.py", line 120, in execute run_command(convert_command) File "C:\Python310\lib\site-packages\py_executable_checklist\workflow.py", line 9, in run_command return subprocess.check_output(command, shell=True).decode("utf-8") # nosemgrep File "C:\Python310\lib\subprocess.py", line 420, in check_output return run(*popenargs, stdout=PIPE, timeout=timeout, check=True, File "C:\Python310\lib\subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command 'convert -density 150 -trim -background white -alpha remove -quality 100 -sharpen 0x1.0 C:\Users\XXXX\OutputDir\dr-doc-search\parable_monetary_economy\parable_monetary_economy.pdf[1] -quality 100 C:\Users\XXXX\OutputDir\dr-doc-search\parable_monetary_economy\images\output-1.png' returned non-zero exit status 4.

Any Help to fix this issue?

Different providers?

It would be pleasant to be able to use different backends than OpenAI. Training a model locally, even.

An error occurs when the file name contains Chinese characters

dr-doc-search --web-app -i ~/Downloads/吴晓波:勇敢者的方法论.pdf --llm huggingface
error:

Traceback (most recent call last):
  File "/Users/boyer/hsg/dr-doc-search/venv/bin/dr-doc-search", line 8, in <module>
    sys.exit(main())
  File "/Users/boyer/hsg/dr-doc-search/venv/lib/python3.10/site-packages/doc_search/app.py", line 66, in main
    run_web(context)
  File "/Users/boyer/hsg/dr-doc-search/venv/lib/python3.10/site-packages/doc_search/web.py", line 77, in run_web
    run_workflow(global_context, inference_workflow_steps())
  File "/Users/boyer/hsg/dr-doc-search/venv/lib/python3.10/site-packages/py_executable_checklist/workflow.py", line 36, in run_workflow
    __run_step(step, context)
  File "/Users/boyer/hsg/dr-doc-search/venv/lib/python3.10/site-packages/py_executable_checklist/workflow.py", line 29, in __run_step
    returned_context = step_instance.execute() or {}
  File "/Users/boyer/hsg/dr-doc-search/venv/lib/python3.10/site-packages/doc_search/workflow/__init__.py", line 254, in execute
    raise FileNotFoundError(f"FAISS DB file not found: {self.faiss_db}")
FileNotFoundError: FAISS DB file not found: /Users/boyer/OutputDir/dr-doc-search/index/index.pkl

rename demo.pdf, 问答的内容和中文也不兼容吗?

截屏2023-03-19 22 34 19

text_splitter error

hi, thanks for your work. I tried to test, but got this error:

pythondev1-ubuntu@pythondev1-ubuntu:~$ dr-doc-search --train -i ~/dr-doc-search/tests/data/kh.pdf
2023-02-07 20:57:32 - text_splitter.py:59 - Created a chunk of size 1339, which is longer than the specified 1000

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.