Giter Site home page Giter Site logo

paper-qa's Introduction

PaperQA

GitHub tests PyPI version

This is a minimal package for doing question and answering from PDFs or text files (which can be raw HTML). It strives to give very good answers, with no hallucinations, by grounding responses with in-text citations.

By default, it uses OpenAI Embeddings with a simple numpy vector DB to embed and search documents. However, via langchain you can use open-source models or embeddings (see details below).

paper-qa uses the process shown below:

  1. embed docs into vectors
  2. embed query into vector
  3. search for top k passages in docs
  4. create summary of each passage relevant to query
  5. score and select only relevant summaries
  6. put summaries into prompt
  7. generate answer with prompt

See our paper for more details:

@article{lala2023paperqa,
  title={PaperQA: Retrieval-Augmented Generative Agent for Scientific Research},
  author={L{\'a}la, Jakub and O'Donoghue, Odhran and Shtedritski, Aleksandar and Cox, Sam and Rodriques, Samuel G and White, Andrew D},
  journal={arXiv preprint arXiv:2312.07559},
  year={2023}
}

Output Example

Question: How can carbon nanotubes be manufactured at a large scale?

Carbon nanotubes can be manufactured at a large scale using the electric-arc technique (Journet6644). This technique involves creating an arc between two electrodes in a reactor under a helium atmosphere and using a mixture of a metallic catalyst and graphite powder in the anode. Yields of 80% of entangled carbon filaments can be achieved, which consist of smaller aligned SWNTs self-organized into bundle-like crystallites (Journet6644). Additionally, carbon nanotubes can be synthesized and self-assembled using various methods such as DNA-mediated self-assembly, nanoparticle-assisted alignment, chemical self-assembly, and electro-addressed functionalization (Tulevski2007). These methods have been used to fabricate large-area nanostructured arrays, high-density integration, and freestanding networks (Tulevski2007). 98% semiconducting CNT network solution can also be used and is separated from metallic nanotubes using a density gradient ultracentrifugation approach (Chen2014). The substrate is incubated in the solution and then rinsed with deionized water and dried with N2 air gun, leaving a uniform carbon network (Chen2014).

References

Journet6644: Journet, Catherine, et al. "Large-scale production of single-walled carbon nanotubes by the electric-arc technique." nature 388.6644 (1997): 756-758.

Tulevski2007: Tulevski, George S., et al. "Chemically assisted directed assembly of carbon nanotubes for the fabrication of large-scale device arrays." Journal of the American Chemical Society 129.39 (2007): 11964-11968.

Chen2014: Chen, Haitian, et al. "Large-scale complementary macroelectronics using hybrid integration of carbon nanotubes and IGZO thin-film transistors." Nature communications 5.1 (2014): 4097.

What's New?

Version 4 removed langchain from the package because it no longer supports pickling. This also simplifies the package a bit - especially prompts. Langchain can still be used, but it's not required. You can use any LLMs from langchain, but you will need to use the LangchainLLMModel class to wrap the model.

Install

Install with pip:

pip install paper-qa

You need to have an LLM to use paper-qa. You can use OpenAI, llama.cpp (via Server), or any LLMs from langchain. OpenAI just works, as long as you have set your OpenAI API key (export OPENAI_API_KEY=sk-...). See instructions below for other LLMs.

Usage

To use paper-qa, you need to have a list of paths/files/urls (valid extensions include: .pdf, .txt). You can then use the Docs class to add the documents and then query them. Docs will try to guess citation formats from the content of the files, but you can also provide them yourself.

from paperqa import Docs

my_docs = ...# get a list of paths

docs = Docs()
for d in my_docs:
    docs.add(d)

answer = docs.query("What manufacturing challenges are unique to bispecific antibodies?")
print(answer.formatted_answer)

The answer object has the following attributes: formatted_answer, answer (answer alone), question , and context (the summaries of passages found for answer).

Async

paper-qa is written to be used asynchronously. The synchronous API is just a wrapper around the async. Here are the methods and their async equivalents:

Sync Async
Docs.add Docs.aadd
Docs.add_file Docs.aadd_file
Docs.add_url Docs.add_url
Docs.get_evidence Docs.aget_evidence
Docs.query Docs.aquery

The synchronous version just call the async version in a loop. Most modern python environments support async natively (including Jupyter notebooks!). So you can do this in a Jupyter Notebook:

from paperqa import Docs

my_docs = ...# get a list of paths

docs = Docs()
for d in my_docs:
    await docs.aadd(d)

answer = await docs.aquery("What manufacturing challenges are unique to bispecific antibodies?")

Adding Documents

add will add from paths. You can also use add_file (expects a file object) or add_url to work with other sources.

Choosing Model

By default, it uses a hybrid of gpt-3.5-turbo and gpt-4-turbo. You can adjust this:

docs = Docs(llm='gpt-3.5-turbo')

or you can use any other model available in langchain:

from paperqa import Docs
from langchain_community.chat_models import ChatAnthropic
docs = Docs(llm="langchain",
            client=ChatAnthropic())

Note we split the model into the wrapper and client, which is ChatAnthropic here. This is because client stores the non-pickleable part and langchain LLMs are only sometimes serializable/pickleable. The paper-qa Docs must always serializable. Thus, we split the model into two parts.

import pickle
docs = Docs(llm="langchain",
            client=ChatAnthropic())
model_str = pickle.dumps(docs)
docs = pickle.loads(model_str)
# but you have to set the client after loading
docs.set_client(ChatAnthropic())

Locally Hosted

You can use llama.cpp to be the LLM. Note that you should be using relatively large models, because paper-qa requires following a lot of instructions. You won't get good performance with 7B models.

The easiest way to get set-up is to download a llama file and execute it with -cb -np 4 -a my-llm-model --embedding which will enable continuous batching and embeddings.

from paperqa import Docs, LlamaEmbeddingModel
from openai import AsyncOpenAI

# start llamap.cpp client with

local_client = AsyncOpenAI(
    base_url="http://localhost:8080/v1",
    api_key = "sk-no-key-required"
)

docs = Docs(client=local_client,
            embedding_model=LlamaEmbeddingModel(),
            llm_model=OpenAILLMModel(config=dict(model="my-llm-model", temperature=0.1, frequency_penalty=1.5, max_tokens=512)))

Changing Embedding Model

You can use langchain embedding models, or the SentenceTransformer models. For example

from paperqa import Docs, SentenceTransformerEmbeddingModel
from openai import AsyncOpenAI

# start llamap.cpp client with

local_client = AsyncOpenAI(
    base_url="http://localhost:8080/v1",
    api_key = "sk-no-key-required"
)

docs = Docs(client=local_client,
            embedding_model=SentenceTransformerEmbeddingModel(),
            llm_model=OpenAILLMModel(config=dict(model="my-llm-model", temperature=0.1, frequency_penalty=1.5, max_tokens=512)))

Just like in the above examples, we have to split the Langchain model into a client and model to keep Docs serializable.

from paperqa import Docs, LangchainEmbeddingModel

docs = Docs(embedding_model=LangchainEmbeddingModel(), embedding_client=OpenAIEmbeddings())

Adjusting number of sources

You can adjust the numbers of sources (passages of text) to reduce token usage or add more context. k refers to the top k most relevant and diverse (may from different sources) passages. Each passage is sent to the LLM to summarize, or determine if it is irrelevant. After this step, a limit of max_sources is applied so that the final answer can fit into the LLM context window. Thus, k > max_sources and max_sources is the number of sources used in the final answer.

docs.query("What manufacturing challenges are unique to bispecific antibodies?", k = 5, max_sources = 2)

Using Code or HTML

You do not need to use papers -- you can use code or raw HTML. Note that this tool is focused on answering questions, so it won't do well at writing code. One note is that the tool cannot infer citations from code, so you will need to provide them yourself.

import glob

source_files = glob.glob('**/*.js')

docs = Docs()
for f in source_files:
    # this assumes the file names are unique in code
    docs.add(f, citation='File ' + os.path.name(f), docname=os.path.name(f))
answer = docs.query("Where is the search bar in the header defined?")
print(answer)

Using External DB/Vector DB and Caching

You may want to cache parsed texts and embeddings in an external database or file. You can then build a Docs object from those directly:

docs = Docs()

for ... in my_docs:
    doc = Doc(docname=...,  citation=..., dockey=..., citation=...)
    texts = [Text(text=..., name=..., doc=doc) for ... in my_texts]
    docs.add_texts(texts, doc)

If you want to use an external vector store, you can also do that directly via langchain. For example, to use the FAISS vector store from langchain:

from paperqa import LangchainVectorStore, Docs
from langchain_community.vector_store import FAISS
from langchain_openai import OpenAIEmbeddings

my_index = LangchainVectorStore(cls=FAISS, embedding_model=OpenAIEmbeddings())
docs = Docs(texts_index=my_index)

Where do I get papers?

Well that's a really good question! It's probably best to just download PDFs of papers you think will help answer your question and start from there.

Zotero

If you use Zotero to organize your personal bibliography, you can use the paperqa.contrib.ZoteroDB to query papers from your library, which relies on pyzotero.

Install pyzotero to use this feature:

pip install pyzotero

First, note that paperqa parses the PDFs of papers to store in the database, so all relevant papers should have PDFs stored inside your database. You can get Zotero to automatically do this by highlighting the references you wish to retrieve, right clicking, and selecting "Find Available PDFs". You can also manually drag-and-drop PDFs onto each reference.

To download papers, you need to get an API key for your account.

  1. Get your library ID, and set it as the environment variable ZOTERO_USER_ID.
    • For personal libraries, this ID is given here at the part "Your userID for use in API calls is XXXXXX".
    • For group libraries, go to your group page https://www.zotero.org/groups/groupname, and hover over the settings link. The ID is the integer after /groups/. (h/t pyzotero!)
  2. Create a new API key here and set it as the environment variable ZOTERO_API_KEY.
    • The key will need read access to the library.

With this, we can download papers from our library and add them to paperqa:

from paperqa.contrib import ZoteroDB

docs = paperqa.Docs()
zotero = ZoteroDB(library_type="user")  # "group" if group library

for item in zotero.iterate(limit=20):
    if item.num_pages > 30:
        continue  # skip long papers
    docs.add(item.pdf, docname=item.key)

which will download the first 20 papers in your Zotero database and add them to the Docs object.

We can also do specific queries of our Zotero library and iterate over the results:

for item in zotero.iterate(
        q="large language models",
        qmode="everything",
        sort="date",
        direction="desc",
        limit=100,
):
    print("Adding", item.title)
    docs.add(item.pdf, docname=item.key)

You can read more about the search syntax by typing zotero.iterate? in IPython.

Paper Scraper

If you want to search for papers outside of your own collection, I've found an unrelated project called paper-scraper that looks like it might help. But beware, this project looks like it uses some scraping tools that may violate publisher's rights or be in a gray area of legality.

keyword_search = 'bispecific antibody manufacture'
papers = paperscraper.search_papers(keyword_search)
docs = paperqa.Docs()
for path,data in papers.items():
    try:
        docs.add(path)
    except ValueError as e:
        # sometimes this happens if PDFs aren't downloaded or readable
        print('Could not read', path, e)
answer = docs.query("What manufacturing challenges are unique to bispecific antibodies?")
print(answer)

PDF Reading Options

By default PyPDF is used since it's pure python and easy to install. For faster PDF reading, paper-qa will detect and use PymuPDF (fitz):

pip install pymupdf

Callbacks Factory

To execute a function on each chunk of LLM completions, you need to provide a function that when called with the name of the step produces a list of functions to execute on each chunk. For example, to get a typewriter view of the completions, you can do:

def make_typewriter(step_name):
    def typewriter(chunk):
        print(chunk, end="")
    return [typewriter] # <- note that this is a list of functions
...
docs.query("What manufacturing challenges are unique to bispecific antibodies?", get_callbacks=make_typewriter)

Caching Embeddings

In general, embeddings are cached when you pickle a Docs regardless of what vector store you use. See above for details on more explicit management of them.

Customizing Prompts

You can customize any of the prompts, using the PromptCollection class. For example, if you want to change the prompt for the question, you can do:

from paperqa import Docs, Answer, PromptCollection

my_qaprompt = "Answer the question '{question}' "
    "Use the context below if helpful. "
    "You can cite the context using the key "
    "like (Example2012). "
    "If there is insufficient context, write a poem "
    "about how you cannot answer.\n\n"
    "Context: {context}\n\n"
prompts=PromptCollection(qa=my_qaprompt)
docs = Docs(prompts=prompts)

Pre and Post Prompts

Following the syntax above, you can also include prompts that are executed after the query and before the query. For example, you can use this to critique the answer.

FAQ

How is this different from LlamaIndex?

It's not that different! This is similar to the tree response method in LlamaIndex. I just have included some prompts I find useful, readers that give page numbers/line numbers, and am focused on one task - answering technical questions with cited sources.

How is this different from LangChain?

There has been some great work on retrievers in langchain and you could say this is an example of a retriever.

Can I save or load?

The Docs class can be pickled and unpickled. This is useful if you want to save the embeddings of the documents and then load them later.

import pickle

# save
with open("my_docs.pkl", "wb") as f:
    pickle.dump(docs, f)

# load
with open("my_docs.pkl", "rb") as f:
    docs = pickle.load(f)

docs.set_client() #defaults to OpenAI

paper-qa's People

Contributors

dustyposa avatar ekcomputer avatar embracelife avatar jamesbraza avatar jan-kubica avatar kohulan avatar mcbhenwood avatar milescranmer avatar mskarlin avatar odhran-o-d avatar rosepolle avatar samcox822 avatar scott------ avatar shao-jiaqi757 avatar thenatefisher avatar whitead avatar xloem avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paper-qa's Issues

Demo screenshots

Super cool project!

@whitead Could you upload a few screenshots for showcasing demo scenarios? It'd be great to put them in the README.

Using a local model has issues

Hi @whitead ,

I used the following code block and used a quantized LLM from here: https://huggingface.co/Pi3141

The model I used: https://huggingface.co/Pi3141/gpt4-x-alpaca-native-13B-ggml

import paperscraper
from paperqa import Docs
from langchain.embeddings import LlamaCppEmbeddings
from langchain.llms import LlamaCpp

llm = LlamaCpp(model_path="ggml-model-q4_1.bin")
embeddings = LlamaCppEmbeddings(model_path="ggml-model-q4_1.bin")
docs = Docs(llm=llm, embeddings=embeddings)

keyword_search = 'bispecific antibody manufacture'
papers = paperscraper.search_papers(keyword_search, limit=2)
for path,data in papers.items():
    try:
        docs.add(path,chunk_chars=500)
    except ValueError as e:
        print('Could not read', path, e)

answer = docs.query("What manufacturing challenges are unique to bispecific antibodies?")
print(answer)
  • This did run for nearly 1.5 hours and crashed.

Error log:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Input In [4], in <cell line: 12>()
     10         print('Could not read', path, e)
     11 print("Im here")
---> 12 answer = docs.query("What manufacturing challenges are unique to bispecific antibodies?")
     13 print(answer)
     14 end_time = time.time()

File ~/.local/lib/python3.10/site-packages/paperqa/docs.py:374, in Docs.query(self, query, k, max_sources, length_prompt, marginal_relevance, answer, key_filter)
    372     loop = asyncio.new_event_loop()
    373     asyncio.set_event_loop(loop)
--> 374 return loop.run_until_complete(
    375     self.aquery(
    376         query,
    377         k=k,
    378         max_sources=max_sources,
    379         length_prompt=length_prompt,
    380         marginal_relevance=marginal_relevance,
    381         answer=answer,
    382         key_filter=key_filter,
    383     )
    384 )

File ~/.local/lib/python3.10/site-packages/nest_asyncio.py:89, in _patch_loop.<locals>.run_until_complete(self, future)
     86 if not f.done():
     87     raise RuntimeError(
     88         'Event loop stopped before Future completed.')
---> 89 return f.result()

File /usr/lib/python3.10/asyncio/futures.py:201, in Future.result(self)
    199 self.__log_traceback = False
    200 if self._exception is not None:
--> 201     raise self._exception.with_traceback(self._exception_tb)
    202 return self._result

File /usr/lib/python3.10/asyncio/tasks.py:234, in Task.__step(***failed resolving arguments***)
    232         result = coro.send(None)
    233     else:
--> 234         result = coro.throw(exc)
    235 except StopIteration as exc:
    236     if self._must_cancel:
    237         # Task is cancelled right before coro stops.

File ~/.local/lib/python3.10/site-packages/paperqa/docs.py:406, in Docs.aquery(self, query, k, max_sources, length_prompt, marginal_relevance, answer, key_filter)
    404         answer.tokens += cb.total_tokens
    405         answer.cost += cb.total_cost
--> 406     answer = await self.aget_evidence(
    407         answer,
    408         k=k,
    409         max_sources=max_sources,
    410         marginal_relevance=marginal_relevance,
    411         key_filter=keys if key_filter else None,
    412     )
    413 context_str, contexts = answer.context, answer.contexts
    414 bib = dict()

File ~/.local/lib/python3.10/site-packages/paperqa/docs.py:311, in Docs.aget_evidence(self, answer, k, max_sources, marginal_relevance, key_filter)
    308     return None
    310 with get_openai_callback() as cb:
--> 311     contexts = await asyncio.gather(*[process(doc) for doc in docs])
    312 answer.tokens += cb.total_tokens
    313 answer.cost += cb.total_cost

File /usr/lib/python3.10/asyncio/tasks.py:304, in Task.__wakeup(self, future)
    302 def __wakeup(self, future):
    303     try:
--> 304         future.result()
    305     except BaseException as exc:
    306         # This may also be a cancellation.
    307         self.__step(exc)

File /usr/lib/python3.10/asyncio/tasks.py:232, in Task.__step(***failed resolving arguments***)
    228 try:
    229     if exc is None:
    230         # We use the `send` method directly, because coroutines
    231         # don't have `__iter__` and `__next__` methods.
--> 232         result = coro.send(None)
    233     else:
    234         result = coro.throw(exc)

File ~/.local/lib/python3.10/site-packages/paperqa/docs.py:299, in Docs.aget_evidence.<locals>.process(doc)
    294 if doc.metadata["key"] in [c.key for c in answer.contexts]:
    295     return None
    296 c = Context(
    297     key=doc.metadata["key"],
    298     citation=doc.metadata["citation"],
--> 299     context=await self.summary_chain.arun(
    300         question=answer.question,
    301         context_str=doc.page_content,
    302         citation=doc.metadata["citation"],
    303     ),
    304     text=doc.page_content,
    305 )
    306 if "Not applicable" not in c.context:
    307     return c

File ~/.local/lib/python3.10/site-packages/langchain/chains/base.py:237, in Chain.arun(self, *args, **kwargs)
    234     return (await self.acall(args[0]))[self.output_keys[0]]
    236 if kwargs and not args:
--> 237     return (await self.acall(kwargs))[self.output_keys[0]]
    239 raise ValueError(
    240     f"`run` supported with either positional arguments or keyword arguments"
    241     f" but not both. Got args: {args} and kwargs: {kwargs}."
    242 )

File ~/.local/lib/python3.10/site-packages/langchain/chains/base.py:154, in Chain.acall(self, inputs, return_only_outputs)
    152     else:
    153         self.callback_manager.on_chain_error(e, verbose=self.verbose)
--> 154     raise e
    155 if self.callback_manager.is_async:
    156     await self.callback_manager.on_chain_end(outputs, verbose=self.verbose)

File ~/.local/lib/python3.10/site-packages/langchain/chains/base.py:148, in Chain.acall(self, inputs, return_only_outputs)
    142     self.callback_manager.on_chain_start(
    143         {"name": self.__class__.__name__},
    144         inputs,
    145         verbose=self.verbose,
    146     )
    147 try:
--> 148     outputs = await self._acall(inputs)
    149 except (KeyboardInterrupt, Exception) as e:
    150     if self.callback_manager.is_async:

File ~/.local/lib/python3.10/site-packages/langchain/chains/llm.py:135, in LLMChain._acall(self, inputs)
    134 async def _acall(self, inputs: Dict[str, Any]) -> Dict[str, str]:
--> 135     return (await self.aapply([inputs]))[0]

File ~/.local/lib/python3.10/site-packages/langchain/chains/llm.py:123, in LLMChain.aapply(self, input_list)
    121 async def aapply(self, input_list: List[Dict[str, Any]]) -> List[Dict[str, str]]:
    122     """Utilize the LLM generate method for speed gains."""
--> 123     response = await self.agenerate(input_list)
    124     return self.create_outputs(response)

File ~/.local/lib/python3.10/site-packages/langchain/chains/llm.py:67, in LLMChain.agenerate(self, input_list)
     65 """Generate LLM result from inputs."""
     66 prompts, stop = await self.aprep_prompts(input_list)
---> 67 return await self.llm.agenerate_prompt(prompts, stop)

File ~/.local/lib/python3.10/site-packages/langchain/llms/base.py:113, in BaseLLM.agenerate_prompt(self, prompts, stop)
    109 async def agenerate_prompt(
    110     self, prompts: List[PromptValue], stop: Optional[List[str]] = None
    111 ) -> LLMResult:
    112     prompt_strings = [p.to_string() for p in prompts]
--> 113     return await self.agenerate(prompt_strings, stop=stop)

File ~/.local/lib/python3.10/site-packages/langchain/llms/base.py:229, in BaseLLM.agenerate(self, prompts, stop)
    227     else:
    228         self.callback_manager.on_llm_error(e, verbose=self.verbose)
--> 229     raise e
    230 if self.callback_manager.is_async:
    231     await self.callback_manager.on_llm_end(
    232         new_results, verbose=self.verbose
    233     )

File ~/.local/lib/python3.10/site-packages/langchain/llms/base.py:223, in BaseLLM.agenerate(self, prompts, stop)
    217     self.callback_manager.on_llm_start(
    218         {"name": self.__class__.__name__},
    219         missing_prompts,
    220         verbose=self.verbose,
    221     )
    222 try:
--> 223     new_results = await self._agenerate(missing_prompts, stop=stop)
    224 except (KeyboardInterrupt, Exception) as e:
    225     if self.callback_manager.is_async:

File ~/.local/lib/python3.10/site-packages/langchain/llms/base.py:334, in LLM._agenerate(self, prompts, stop)
    332 generations = []
    333 for prompt in prompts:
--> 334     text = await self._acall(prompt, stop=stop)
    335     generations.append([Generation(text=text)])
    336 return LLMResult(generations=generations)

File ~/.local/lib/python3.10/site-packages/langchain/llms/base.py:315, in LLM._acall(self, prompt, stop)
    313 async def _acall(self, prompt: str, stop: Optional[List[str]] = None) -> str:
    314     """Run the LLM on the given prompt and input."""
--> 315     raise NotImplementedError("Async generation not implemented for this LLM.")

NotImplementedError: Async generation not implemented for this LLM.

how to "force" it to consider all papers?

hello! i'm loving this tool, thanks a lot for all the effort!

Do you guys have any tips on how to force the bot to consider all papers being passed to formulate the response?

I notice that it randomly picks which paper it will respond to and others fall behind.

thanks!

Implement CLI Functionality

It would be nice to have a CLI that can be run to automatically enter the question-answering routine. The app should be installable through the setup.py and be created using the click package. Example of desired behavior pasted below:

qa-pdf [pdf_file]

What if existing content get updated and new contents created?

Thanks for sharing the code.

What happen when the existing content get updated and new contents created, do it need to create embeddings for all contents again? The current approach is not good as create embeddings cost money?

Would it be possible progressively update the vector store?

Please advise. Thank you.

Slow query times

During the query process where the user can find the answer to their question, the time to obtain the chatgpt response as well as retrieving evidence can be significant. It would be nice to be able to set up the context retrieval and answer creation process more efficiently, similar to the waiting times in askyourpdf.com.

Slow PDF Reading Times

Using pypdf to read PDF files when creating the context chunks is orders of magnitude slower than packages such as fitz and pdftotext. The fitz package cannot be used due to licensing conflicts, however, pdftotext is under the MIT license, and may be a potential solution.

Directory structure for ZoteroDB?

I'm on Zotero v6.0.23, on Mac OS, and attachments get stored in the following format:

/Users/myusername/Zotero/storage/key/papertitle.pdf

If I initialize a ZoteroDB with storage=/Users/myusername/Zotero/storage/, it seems like ZoteroDB will look for and store attachments as

/Users/myusername/Zotero/storage/key.pdf

On my system this leads to some unnecessary redownloading that can be avoided by modifying the behavior of ZoteroDB.get_pdf().

Am I understanding this correctly? And do you know if Zotero uses the same directory structure across platforms?

Update README

  1. async updates for colab/python notebook
  2. add notes on agent
  3. doc filter notes
  4. evidence notes - how to gather that independently

Method for adding PDFs to Docs from webpage link

It would be nice to be able to pass a webpage link to a PDF in order to begin the query process. The method should fetch the PDF from the provided URL and then save it to a temporary file which can then be added as normal.

`Docs.add` not safe for using inside a `try-except`

Currently the README recommends using a try-except to skip over errors in adding to the Docs() object:

for path, data in papers.items():
    try:
        docs.add(path)
    except ValueError as e:
        # sometimes this happens if PDFs aren't downloaded or readable
        print('Could not read', path, e)

Problem:

However, this is not safe as the add method of Docs() mutates its properties at different parts of the method:

    def add(...):

        ...
            raise ValueError(f"Document {path} already in collection.")
        ...
                raise ValueError(
                    f"Could not parse key from citation {citation}. Consider just passing key explicitly - e.g. docs.py (path, citation, key='mykey')"
                )
        ...
        key += suffix
        self.keys.add(key)
        ...
            raise ValueError(
                f"This does not look like a text document: {path}. Path disable_check to ignore this error."
            )
        ...
        self.docs[path] = dict(texts=texts, metadata=metadata, key=key)
        if self._faiss_index is not None:
            self._faiss_index.add_texts(texts, metadatas=metadata)
        if self._doc_index is not None:
            self._doc_index.add_texts([citation], metadatas=[{"key": key}])

This means if you skip over items with try-except, it could occur that the self.keys object had already been updated. Or, perhaps the keys and docs are updated, but something errors in the _faiss_index.

Solution:

Only update the self.keys, self.docs, and self._faiss_index, at the very end of the method. All errors should be raised before that point (i.e., manually check whether key is in self.keys beforehand). Also, self._faiss_index.add_texts should be ran before self.keys and self.docs are updated.

Gpt-4

How to switch to gpt-4 in the code?

'Docs' object has no attribute '_build_faiss_index'

Trying to run the huggingface app.py locally and see this error: 'Docs' object has no attribute '_build_faiss_index'

Tried to install faiss-cpu locally but doesn't fix the issue

It doesn't appear the uploaded papers are indexed as they are on the huggingface version;
could this have something to do with the path of the gradio app?

Support BetterBibTeX citekeys in ZoteroDB

It would be nice to be able to use the citekeys provided by Better BibTeX if the user has it installed. For Better BibTeX users, these citekeys are probably the ones that will be used in writing with citeproc.

This can be done by querying the Better BibTeX RPC server.

def _get_citation_key(item: dict, better_bibtex: bool = False) -> str:
    if better_bibtex:
        import requests

        url = "http://localhost:23119/better-bibtex/json-rpc"
        headers = {"Content-Type": "application/json", "Accept": "application/json"}
        payload = {
            "jsonrpc": "2.0",
            "method": "item.citationkey",
            "params": [[item["key"]]],
        }

        response = requests.post(url, headers=headers, json=payload)

        return response.json()["result"][item["key"]]

    if (
        "data" not in item
        or "creators" not in item["data"]
        or len(item["data"]["creators"]) == 0
        or "lastName" not in item["data"]["creators"][0]
        or "title" not in item["data"]
        or "date" not in item["data"]
    ):
        return item["key"]

    last_name = item["data"]["creators"][0]["lastName"]
    short_title = "".join(item["data"]["title"].split(" ")[:3])
    date = item["data"]["date"]

    # Delete non-alphanumeric characters:
    short_title = "".join([c for c in short_title if c.isalnum()])
    last_name = "".join([c for c in last_name if c.isalnum()])
    date = "".join([c for c in date if c.isalnum()])

    return f"{last_name}_{short_title}_{date}_{item['key']}".replace(" ", "")

Support other embeddings in Docs() method

Currently, both the llm and summary_llm can be customized in the Docs() object, but the Embeddings provider is not customizable.

Allowing the Docs() object to be initialized with an embeddings_model would allow paper-qa to be run fully locally.

Elaborate on citations

To use paper-qa, you need to have a list of paths (valid extensions include: .pdf, .txt) and a list of citations (strings) that correspond to the paths. You can then use the Docs class to add the documents and then query them.

Just a bit confused on what citations means or how to provide the right input here. I see in your examples you provided citations like WikiMedia Foundation, 2023, Accessed now.

Currently I have a scraper to grab PDFs from various sources based on a query ("vix volatility") and dump all PDFs into a dir labeled as such.

Can you provide some other examples or how citations could be derived from various sources?

Potential Licensing Conflicts

I looked up the licenses for several of the libraries this code base depends on, and there appears to be some potential conflicts. I've listed them below for discussion:

  1. html2text: GPL 3.0 - requires all publicly distributed derivative work retain the GPL 3.0 license
  2. paper-scraper: GNU GPL v3.0 - requires all publicly distributed derivative work retain the GNU GPL v3.0 license

Regarding converting html to text form, the gazpacho repository may be a good alternative (MIT license). For the paper-scraper tool, I have not been able to find anything yet.

ZoteroDB import issue

I'm trying to run

paper-qa/README.md

Lines 172 to 180 in b20f4e9

from paperqa.contrib import ZoteroDB
docs = paperqa.Docs()
zotero = ZoteroDB(library_type="user") # "group" if group library
for item in zotero.iterate(limit=20):
if item.num_pages > 30:
continue # skip long papers
docs.add(item.pdf, key=item.key)

I've installed

pip install paper-qa
pip install pyzotero

I added a .env and tested it.

I get this error:

Traceback (most recent call last):
  File "C:\Users\mclic\AppData\Local\Programs\Python\Python39\lib\site-packages\paperqa\contrib\zotero.py", line 9, in <module>
    from pyzotero import zotero
  File "C:\Users\mclic\Desktop\Projects\dev\paper-qa\pyzotero.py", line 4, in <module>
    from paperqa.contrib import ZoteroDB
ImportError: cannot import name 'ZoteroDB' from partially initialized module 'paperqa.contrib' (most likely due to a circular import) (C:\Users\mclic\AppData\Local\Programs\Python\Python39\lib\site-packages\paperqa\contrib\__init__.py)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\mclic\Desktop\Projects\dev\paper-qa\pyzotero.py", line 4, in <module>
    from paperqa.contrib import ZoteroDB
  File "C:\Users\mclic\AppData\Local\Programs\Python\Python39\lib\site-packages\paperqa\contrib\__init__.py", line 1, in <module>
    from .zotero import ZoteroDB
  File "C:\Users\mclic\AppData\Local\Programs\Python\Python39\lib\site-packages\paperqa\contrib\zotero.py", line 11, in <module>
    raise ImportError("Please install pyzotero: `pip install pyzotero`")
ImportError: Please install pyzotero: `pip install pyzotero`

Any ideas what I should try next? 😀

token price

        formatted_answer += f"\nTokens Used: {tokens} Cost: ${tokens/1000 * 0.02:.2f}"

based on https://openai.com/pricing
gpt-3.5-turbo | $0.002 / 1K tokens

I guess it should be

        formatted_answer += f"\nTokens Used: {tokens} Cost: ${tokens/1000 * 0.002:.2f}"

less scary

Add ability to add PDF to Docs object from Zotero Clipboard Data

In Zotero, you can select a publication entry (or multiple) and press CTRL+SHIFT+C to copy the BibTex data for each highlighted publication to your clipboard. This is handy because these BibTex files contain both the citation data, was well as the absolute paths to the PDF file on your computer.

It would be nice to have a method, perhaps for the Docs class, to load PDFs and add them to the Docs from the Zotero Bibtex data in the user's clipboard (rather than finding the paths of files manually)

Cache embeddings and OCR results?

Hey @whitead,

I think it would be really nice if I could spin up a paperqa instance without waiting for all papers to be OCR'd and embedded each time. Would it be okay to cache a few more things? Specifically:

  1. The embeddings used for FAISS, and
  2. The OCR of a PDF.

I think both of these could be stored in a separate cache file. The llmchain.cache.SQLiteCache makes this pretty easy, as there are only two methods: lookup for getting a result, and update for updating the cache. Both use strings as keys and values, but you could serialize the metadata into a string.

Here's an example for the OCR. This would go into readers.py and wrap by parse_pdf:

OCR_CACHE = llmchain.cache.SQLiteCache(CACHE_PATH.parent / "ocr_cache.db")

Then, parse_pdf would have1:

def parse_pdf(...):
    cache_key = dict(prompt=str(pdf_path), llm_string="")
    test_out = OCR_CACHE.lookup(**cache_key)

    out = _parse_pdf(...) if test_out is None else deserialize(test_out)

    if test_out is None: 
        OCR_CACHE.update(**cache_key, serialize(out))

    return out

def _parse_pdf(...):
    # Regular _parse_pdf

What do you think? For serialization I would use json.dumps and json.loads2. One might argue that it's better to use a custom database for this but why not keep it simple as you are already using the llmchain database.

Cheers,
Miles

Footnotes

  1. You might want to hash the PDF file itself, or maybe the first N bytes, rather than the filename. But I think for starters the filename is simpler and fine.

  2. For posterity, it would be wise to also serialize the paperqa.__version__ in each cache entry, and, if it's a different version, then ignore the cache and overwrite it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.