mpaepper / content-chatbot Goto Github PK

View Code? Open in Web Editor NEW

515.0 4.0 51.0 301 KB

Build a chatbot or Q&A bot of your website's content

Home Page: https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your-website-using-langchain/

Python 100.00%

deep-learning llm machine-learning

content-chatbot's Introduction

This repo reached the front page of hacker news on March 22nd 2023, see the discussion here.

Your website content -> chatbot / Q&A agent

Turn your website content into a question answering bot which can cite your document sources.

Alternatively, use it in an interactive chatbot style fashion.

All this can be achieved with a tool called langchain which in turn uses the OpenAI API.

This simple repository showcases how to apply it on your own website content.

To do so, there are three scripts:

create_embeddings.py: this is the main script which loops your website's sitemap.xml to create embeddings (vectors representing the semantics of your data) of your content
ask_question.py: after you have the embeddings (a file called faiss_store.pkl was created), this script can be used to directly ask a question. It will answer the question and return the URLs of your website which were used as the source.
start_chat_app.py: starts a simple chat interface where you can ask a question and then follow-up on the answer. If the bot is uncertain, it will indicate so. Note that you can tune the query in this script to be more relevant for your content. In my case I mentioned it to be specific to machine learning and technical topics.

To install the dependencies, simply run pip install -r requirements.txt.

Create your embeddings

This is the most important step and you will need to obtain an OpenAI API key to use it.

Once you have the $api_key, you can run export OPENAI_API_KEY='$api_key' in your terminal.

Then simply run python create_embeddings.py --sitemap https://path/to/your/sitemap.xml --filter https://path/to/your/blog/posts.

This will create your embeddings in a file called faiss_store.pkl. You need to point your website's sitemap.xml to the script and you can filter for URL's to start with filter. If you want to include all pages of your site, you can just set --filter https://.

For more details about this, please check this blog post.

Answering a question while getting the answer source documents

With the embeddings set up, ask a question like this: python ask_question.py "How to detect objects in images?"

Answer:

Object detection in images can be done using algorithms such as R-CNN, Fast R-CNN, and data augmentation techniques such as shifting, rotations, elastic deformations, and gray value variations.

Sources:

https://www.paepper.com/blog/posts/deep-learning-on-medical-images-with-u-net/
https://www.paepper.com/blog/posts/end-to-end-object-detection-with-transformers/

Starting a chatbot on your content

With the embeddings set up, start a chatbot like this: python start_chat_app.py. Then when it's running, ask your questions and follow-ups.

Zendesk Content Embedding

This repository includes an enhancement to the LangChain chatbot project, introducing the create_embeddings for Zendesk feature. This functionality utilizes the Zendesk API to retrieve website content and construct a Faiss knowledge base for improved chatbot responses.

How it Works

The create_embeddings script performs the following steps:

Zendesk API Integration: Retrieves articles using the Zendesk API.
Text Cleaning: Parses HTML content, extracting and cleaning text for embedding.
Text Splitting: Breaks down the content into smaller chunks for efficient embedding processing.
Embedding Creation: Utilizes OpenAI Embeddings to create embeddings for the text chunks.
Faiss Knowledge Base Construction: Constructs a Faiss store with the generated embeddings, facilitating efficient similarity search.

Usage

Zendesk API Credentials: To create a Faiss knowledge base from Zendesk content, please obtain and configure your Zendesk API credentials.

Running the Script

Execute the Script: Run the create_embeddings.py script to generate the Faiss store.

Example:

 python create_embeddings.py -m zendesk -z "https://your.zendesk.api/"   #replace the link

content-chatbot's People

Contributors

Stargazers

Watchers

Forkers

discobean claudey linyi-taiwan wprobot timson2010 boosh karlhorky liucoj spacedudem ai-jie01 donkey24 limcheekin rivitech janek sacherjc mihirsoni tuantranf workamiracle rimmhb hnjm aaronkhawkins yudataguy andykeh710 donggegithub michaelluk ryonzhang fatcatofbupt fordnox azhitkev buwntxsnuh godvvvzzz vicnent raga11 daniel-prando vgees lalithabhat xenosnikos price1999a totoro2205 petedev0105 drfarasat yonghaozhao722 kevw madontak helloheidi takkyi83 8bitsats

content-chatbot's Issues

problem with lenght

Traceback (most recent call last):
File "/Users/pelleri/Desktop/dahu/content-chatbot-main/create_embeddings.py", line 49, in
store = FAISS.from_texts(docs, OpenAIEmbeddings(), metadatas=metadatas)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/vectorstores/faiss.py", line 250, in from_texts
embeddings = embedding.embed_documents(texts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 254, in embed_documents
response = embed_with_retry(
^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 53, in embed_with_retry
return _completion_with_retry(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 289, in wrapped_f
return self(f, *args, **kw)
^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 379, in call
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/[email protected]/3.11.2_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/opt/homebrew/lib/python3.11/site-packages/tenacity/init.py", line 382, in call
result = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/langchain/embeddings/openai.py", line 51, in _completion_with_retry
return embeddings.client.create(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/openai/api_resources/embedding.py", line 33, in create
response = super().create(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/openai/api_resources/abstract/engine_api_resource.py", line 153, in create
response, _, api_key = requestor.request(
^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/openai/api_requestor.py", line 226, in request
resp, got_stream = self._interpret_response(result, stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/openai/api_requestor.py", line 619, in _interpret_response
self._interpret_response_line(
File "/opt/homebrew/lib/python3.11/site-packages/openai/api_requestor.py", line 679, in _interpret_response_line
raise self.handle_error_response(
openai.error.InvalidRequestError: This model's maximum context length is 8191 tokens, however you requested 8951 tokens (8951 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
pelleri-macpro14:content-chatbot-main pelleri$

How to get good results, including sources?

Thanks for this project, it's very helpful and inspiring!

I've tried running it on your website, but the results I'm getting aren't as great as I was hoping, especially w.r.t finding sources. I'm running the chatbot version, and copypasted code from the earlier "Q&A" version, in order to get sources from there.

Example 1 - Object detection

Question:
How to detect objects in images?

Chatbot result:
Object detection is a task that involves finding and labeling objects of interest in an image. A common approach is to use a model such as Faster R-CNN, which uses a selective search algorithm to find regions of interest in a bottom-up manner. Each region is then cropped and fed to a CNN backbone which extracts features. Finally, support vector machines are used to determine if the features are predictive of one of the classes. Non-maximum suppression is used to get rid of duplicate detections. There are also end-to-end approaches which directly output the final object predictions when presented with an input image. Data augmentation is often used to generate more data when there is a low amount of data available. DreamBooth is a tool that is trending in social media which is similar to textual inversion in the sense that you provide images of a concept and then afterwards you can use a special prompt token to generate new images of your concept.

Q&A result:
Object detection approaches usually generate many object predictions candidates either by using a neural network or by using an algorithm. Object classification is done for each candidate and regression is used to refine the bounding box coordinates.

Sources:
https://www.paepper.com/blog/posts/end-to-end-object-detection-with-transformers/
https://www.paepper.com/blog/posts/end-to-end-object-detection-with-transformers/

Problem:

Instead of getting two relevant links like you did, I'm getting one relevant link, but twice.

Example 2 - Hyperparameter tuning

Question:
How to tune hyperparameters?

Chatbot result:
The best methods for tuning hyperparameters in machine learning vary depending on the type of model and the data set. Generally, grid search, random search, and Bayesian optimization are popular methods for hyperparameter tuning. Grid search involves systematically exploring a range of hyperparameter values, while random search involves randomly sampling from a range of hyperparameter values. Bayesian optimization is a more sophisticated approach that uses a probabilistic model to identify the best hyperparameter values.

QA result: Hyperparameter tuning can be done using PyTorch Lightning and weights & biases.

sources: https://www.paepper.com/blog/posts/hyperparameter-tuning-on-numerai-data-with-pytorch-lightning-and-wandb/

Followup question:
Can I tune hyperparameters with fastai?

Chatbot result:
Numerai is a machine learning competition platform that uses data science to predict stock market movements. It provides a data set of anonymized financial features and a tournament where data scientists can submit their models to compete for rewards. The goal is to create a model that can accurately predict the stock market movements. To do this, data scientists must use machine learning algorithms to optimize their models and tune the hyperparameters. Fastai is a deep learning library that can be used to quickly build and train models for the Numerai tournament. It also provides tools for hyperparameter tuning, such as the weights and biases library, which can be used to run sweeps to test different combinations of hyperparameters.
QA result: Yes, you can tune hyperparameters with fastai.

Sources:
https://www.paepper.com/blog/posts/hyperparameter-tuning-on-numerai-data-with-fastai-and-wandb/

Problem

The question is clearly relevant to both, but only one is being detected. Follow-up question found more with an extra keyword

Large scale Costs

The cost of this on a large scale ends up being a little high, what do you suggest for these cases?

Because we have a messaging product which ends up increasing the cost of the high number of messages we receive.

Would you have any suggestions?

4 most similar docs

If I understand correctly, the default is set to search for 4 most relevant docs. Is it possible to change this somehow?

What if existing content get updated and new contents created?

Thanks for sharing the code.
What happen when the existing content get updated and new contents created, do it need to create embeddings for all contents again? The current approach is not good as create embeddings cost money? Please see https://github.com/mpaepper/content-chatbot/blob/main/create_embeddings.py#L49.
Would it be possible progressively update the vector store?
Please advise. Thank you.

How did you generate the pictures in your README.md

Is there a specific tool or was it custom work?

argparse==1.3.1 in requirements.txt not found

Good afternoon, when installing the requirements via pip install -r requirements.txt, pip gave me an error:

ERROR: Could not find a version that satisfies the requirement argparse==1.3.1 (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0, 0.8.0, 0.9.0, 0.9.1, 1.0, 1.0.1, 1.1, 1.2.1, 1.2.2, 1.3.0, 1.4.0)
ERROR: No matching distribution found for argparse==1.3.1

When I view argparse on PyPi, 1.3.1 is not available. I was able to fix this issue by using argparse 1.4.0.

RateLimitError

I was testing the create_embeddings.py file with the default values and it got stuck. It was the first ever time I was testing OpenAI API on my account.

# python3 create_embeddings.py
Split https://www.paepper.com/blog/posts/how-to-build-a-chatbot-out-of-your-website-content/ into 4 chunks
Split https://www.paepper.com/blog/posts/ into 3 chunks
Split https://www.paepper.com/blog/posts/build-q-and-a-bot-of-your-website-using-langchain/ into 6 chunks
Split https://www.paepper.com/blog/posts/interactive-visualization-of-stable-diffusion-image-embeddings/ into 9 chunks
Split https://www.paepper.com/blog/posts/semantic-segmentation-with-prototype-based-consistency-regularization/ into 6 chunks
Split https://www.paepper.com/blog/posts/everything-you-need-to-know-about-stable-diffusion/ into 15 chunks
Split https://www.paepper.com/blog/posts/frame-decisions-as-bets-to-improve-decision-making/ into 7 chunks
Split https://www.paepper.com/blog/posts/how-and-why-stable-diffusion-works-for-text-to-image-generation/ into 10 chunksSplit https://www.paepper.com/blog/posts/rethinking-depthwise-separable-convolutions/ into 8 chunks
Split https://www.paepper.com/blog/posts/creating-pleasant-plots-with-seaborn/ into 5 chunks
Split https://www.paepper.com/blog/posts/dino-emerging-properties-in-self-supervised-vision-transformers/ into 10 chunksSplit https://www.paepper.com/blog/posts/rethinking-batch-in-batchnorm/ into 7 chunks
Split https://www.paepper.com/blog/posts/hyperparameter-tuning-on-numerai-data-with-pytorch-lightning-and-wandb/ into 13 chunks
Split https://www.paepper.com/blog/posts/hyperparameter-tuning-on-numerai-data-with-fastai-and-wandb/ into 10 chunks
Split https://www.paepper.com/blog/posts/p-diff-learning-classifier-with-noisy-labels-based-on-probability-difference-distributions/ into 5 chunks
Split https://www.paepper.com/blog/posts/git-config/ into 4 chunks
Split https://www.paepper.com/blog/posts/bash-string-manipulation/ into 4 chunks
Split https://www.paepper.com/blog/posts/metalearning-from-noisy-labels/ into 12 chunks
Split https://www.paepper.com/blog/posts/refactoring-machine-learning-code-namedtuple/ into 2 chunks
Split https://www.paepper.com/blog/posts/refactoring-machine-learning-code-einops/ into 4 chunks
Split https://www.paepper.com/blog/posts/refactoring-machine-learning-code-comments-as-code/ into 4 chunks
Split https://www.paepper.com/blog/posts/depthwise-separable-convolutions-in-pytorch/ into 8 chunks
Split https://www.paepper.com/blog/posts/pyramidal-convolution-rethinking-convolutional-neural-networks-for-visual-recognition/ into 9 chunks
Split https://www.paepper.com/blog/posts/how-to-properly-manage-ssh-keys-for-server-access/ into 8 chunks
Split https://www.paepper.com/blog/posts/end-to-end-object-detection-with-transformers/ into 12 chunks
Split https://www.paepper.com/blog/posts/accurate-large-minibatch-sgd-training-image-net-in-1-hour/ into 8 chunks
Split https://www.paepper.com/blog/posts/swift_as_a_viable_python_alternative/ into 5 chunks
Split https://www.paepper.com/blog/posts/eigenvectors_eigenvalues_machine_learning/ into 5 chunks
Split https://www.paepper.com/blog/posts/parsing-json-data-from-aws-lambda-functions-in-python/ into 2 chunks
Split https://www.paepper.com/blog/posts/pytorch-multi-gpu-training-for-faster-machine-learning-results/ into 7 chunks
Split https://www.paepper.com/blog/posts/plot-pytorch-tensors-with-matplotlib/ into 2 chunks
Split https://www.paepper.com/blog/posts/do-you-know-which-inputs-your-neural-network-likes-most/ into 8 chunks
Split https://www.paepper.com/blog/posts/shapeshifting-pytorch/ into 3 chunks
Split https://www.paepper.com/blog/posts/what-are-embeddings-in-machine-learning/ into 4 chunks
Split https://www.paepper.com/blog/posts/bash-keep-script-running-restart-on-crash/ into 2 chunks
Split https://www.paepper.com/blog/posts/writing-command-line-tools-in-python-argument-parsing/ into 3 chunks
Split https://www.paepper.com/blog/posts/better-unit-tests-django-using-mommy/ into 4 chunks
Split https://www.paepper.com/blog/posts/pytorch-gpu-inference-with-docker/ into 8 chunks
Split https://www.paepper.com/blog/posts/unfreeze-vim/ into 2 chunks
Split https://www.paepper.com/blog/posts/pytorch-model-in-production-as-a-serverless-rest-api/ into 12 chunks
Split https://www.paepper.com/blog/posts/graphical-explanation-of-neural-networks-and-gradients-with-python-and-numpy/ into 10 chunks
Split https://www.paepper.com/blog/posts/debugging-tensorflow/ into 10 chunks
Split https://www.paepper.com/blog/posts/deep-learning-on-medical-images-with-u-net/ into 9 chunks
Split https://www.paepper.com/blog/posts/build-your-own-javascript-game-in-5-minutes/ into 6 chunks
Split https://www.paepper.com/blog/posts/monte-carlo-for-better-time-estimates-of-your-work/ into 6 chunks
Split https://www.paepper.com/blog/posts/play-video-games-using-neural-networks/ into 13 chunks
Split https://www.paepper.com/blog/posts/vim-commands-you-should-know-about/ into 5 chunks
Split https://www.paepper.com/blog/posts/database-backup-in-the-cloud-using-aws/ into 4 chunks
Split https://www.paepper.com/blog/posts/training-a-neural-network-with-numpy/ into 5 chunks
Split https://www.paepper.com/blog/posts/domain-driven-design-ddd/ into 7 chunks
Split https://www.paepper.com/blog/posts/clean-code-book/ into 5 chunks
Split https://www.paepper.com/blog/posts/javascript-bitwise-operations/ into 6 chunks
Split https://www.paepper.com/blog/posts/mod-rewrite-operations-you-should-know-about/ into 5 chunks
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._completion_with_retry in 8.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._completion_with_retry in 10.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..
Traceback (most recent call last):
...

It would be interesting to know how to prevent this. I have checked on https://platform.openai.com/account/usage and I've used so far... $0. I rechecked it an hour later for the metrics to stabilize but still I got the same usage.

Site map problem

Looking at create_embeddings.py it appears that the code does not crawl what it finds in the site map. It seems to only use the URLs directly referred by the site map. If the site map has links to other site maps then the script does not work. See the example attached here.
sitemap_index.xml.txt

This should be made clear in the README.