Giter Site home page Giter Site logo

biopyassistant's Introduction

AI-powered conversational agent designed to assist biology students
in learning the Python programming language.

Website Build GitHub License

Installation

To install BioPyAssistant and its dependencies, run the following commands:

Clone the repository:

git clone https://github.com/pierrepo/biopyassistant.git
cd biopyassistant

Install Conda:

To install Conda, follow the instructions provided in the official Conda documentation.

Create a Conda environment:

conda env create -f environment.yml

Usage

Step 1: Activate the Conda Environment

Activate the Conda environment by running:

conda activate biopyassistantenv

Step 2: Process the course content

Process the course content by running:

python src/parse_clean_markdown.py --in data/markdown_raw --out data/markdown_processed

This command will process Markdown files located in the data/markdown_raw directory and save the processed files to the data/markdown_processed directory.

Step 3: Set up OpenAI API key

Create a .env file with a valid OpenAI API key:

OPENAI_API_KEY=<your-openai-api-key>

Remark: This .env file is ignored by git.

Step 4: Create the Vector Database

Create the Vector database by running:

python src/create_database.py --data-path [data-path] --chroma-path [chroma-path] --chunk-size [chunk-size] --chunk-overlap [chunk-overlap] 

Where :

  • [data-path] (mandatory): Directory containing processed Markdown files.
  • [chroma-path] (mandatory): Output path to save the vectorial ChromaDB database.
  • [chunk-size] (optional): Size of text chunks to create. Default: 1000.
  • [chunk-overlap] (optional): Overlap between text chunks. Default: 200.

Example:

python src/create_database.py --data-path data/markdown_processed --chroma-path chroma_db

This command will create a vectorial Chroma database from the processed Markdown files located in the data/markdown_processed directory. The text will be split into chunks of 1000 characters with an overlap of 200 characters. And finally the vectorial Chroma database will be saved to the chroma_db directory.

Remark: The vector database will be saved on the disk.

Step 5: Query the chatbot.

You can query the chatbot using either the command line or the graphical interface:

Command Line

python src/query_chatbot.py --query "Your question here" [--model "model_name"]  [--include-metadata]

Customization options:

  • ๐Ÿค– Model Selection: Choose between gpt-4o, gpt-4-turbo, gpt-4 and gpt-3.5-turbo to suit your needs and preferences. Default: gpt-3.5-turbo.

  • ๐Ÿ“ Include Metadata: Include metadata in the response, such as the sources of the answer. By default, metadata is excluded.

Example:

python src/query_chatbot.py --query "What is the difference between list and set ?" --model gpt-4-turbo --include-metadata

This command will query the chatbot with the question "What is the difference between list and set ?" using the gpt-4-turbo model and include metadata in the response.

Output:

Query:
What is the difference between list and set ?

Response:
A list is an ordered collection of elements, while a set is an unordered collection of unique elements. In a list, the order of elements is preserved, and duplicate elements are allowed. In contrast, a set does not preserve the order of elements, and duplicate elements are not allowed. Additionally, a set is optimized for membership testing and eliminating duplicate elements, making it more efficient for certain operations than a list.

For more information, you can refer to the following sources:
- Chapter ... (Link to the source : ...)
- Chapter ... (Link to the source : ...)

Graphical Interface :

Streamlit App:

Run the following command:

streamlit run src/streamlit_app.py

This will launch the Streamlit app in your browser, where you can start interacting with the RAG model.

Gradio App:

Run the following command:

python src/gradio_app.py

This will launch the Gradio app in your browser, where you can start interacting with the RAG model.

biopyassistant's People

Contributors

essmaw avatar pierrepo avatar

Stargazers

 avatar

Watchers

 avatar

biopyassistant's Issues

Update create_database.py to allow specifying chunk size and path

This pull request seeks feedback from @pierrepo regarding proposed updates to the create_database.py script. The changes aim to introduce options for specifying the chunk size and the path to save it. This enhancement would facilitate storing chunks of various sizes for analysis, eliminating the need to recreate them every time.

Conda env is not reproducible

I can't create the Conda env:

$ conda env create -f environment.yml
Channels:
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

  - chromadb
  - langchain
  - langchain-chroma
  - langchain-community
  - langchain-core
  - langchain-openai
  - langchain-prompts
  - langchain-text-splitters
  - langchain-vectorstores
  - tiktoken

Current channels:

  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/r/linux-64

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Could you please have a look @Essmaw ?

Update README a couple of things

@Essmaw could you please:

Create directories data/markdown_raw and data/markdown_processed. Git cannot handle empty directory so create and commit an empty .gitkeep file in both directories.

In the README, invert Step 3 and 4, since you need an embedding model (hence access to OpenAI API) to create the Chroma database.

In Step 2, the create_database.py should take an --in and an --out argument to point to input and output directories (in our case data/markdown_raw and data/markdown_processed).

Improve interface for `markdown_parser.py` script

  • Add explicit arguments: --in for input Markdown directory and --out for output Markdown directory. These arguments are mandatory but do not contain default values.
  • Do not store default values for directories in the script.
  • In the README file, provide a real life example: python src/markdown_parser.py --in data/markdown_raw --out data/markdown_processed

Get best 3 (or 4) chunks

Instead of returning chunks above a given threshold:

https://github.com/pierrepo/biopyassistant/blob/06393688f56c18f8e109215bf47faaae9d803282/src/query_chatbot.py#L219C5-L219C91

Could you retrieve the k best chunks (k should be specified as an argument of the search_similarity_in_database() function with 3 as default value).

Then you print as usual the chunk id, score and number of tokens... (and ideally the 20 or 30 first characters of the chunk)

Before returning the results, you filter out chunks based on a score_threshold (also as an argument of the search_similarity_in_database() function with 0.35 as default value). And return an empty list if needed.

With this design, we could see what are retrieved chunks and their corresponding scores for all questions, including the bad ones.

Simplify `src` directory structure

At thus point, we do need subfolders in the src directory.

Could you be please move all scripts in src and also remove README files? Could you also check that the main README file contains all needed documentation and references?

Process markdown file to add chapter, section and sub-section numbers

In the process_md_files function, global counters does not look necessary. The only thing this function needs to pass to the renumber_headers_courses function is the content of the file and the number of the current chapter or annex being processed.

Also renumber_headers_annexes is redundant and should be removed.

In process_md_files, you can get the number of the chapter or the letter of the annex with:

if filename.startswith("annexe"):
    # We only have one annex so far.
    content = renumber_headers(content, "A")
if re.match("\d{2}_", filename):
    chapter_number = int(filename.split("_")[0])
    content = renumber_headers(content, chapter_number)

Then, in renumber_headers, we could have:

logger.info("Renumbering headers...")
# Regex pattern to match headers with leading '#' and no following '#'
header_pattern = r'^(#+)\s+([^#]*)$'
# Define default header levels.
# We should have no more than 4 levels of headers.
headers = {
    1:chapter_number, # Level 1: chapter / annexe.
    2:0, # Level 2: section.
    3:0, # Level 3: Sub-section
    4:0, # Level 4: Sub-sub-section.
}
# Stores the file content with renumbered headers
processed_content = []

for line in content.split("\n"):
    match = re.match(header_pattern, line)
    if match:
        header_level = len(match.group(1))
        header_text = match.group(2)
        # Show errors if we are above level 4
        if header_level > 4:
            loguru.error("Header level beyond level 4!")
            loguru.error(line)
            processed_content.append(line)
            continue
        # Increment the appropriate header level,
        # if below chapter / annexe level:
        if header_level != 1:
            headers[header_level] += 1
        # Reset subsequent levels
        for level in range(header_level + 1, len(headers) + 1):
            headers[level] = 0

        # Create the header with numbers
        header_numbers = list(headers.keys())[:header_level]
        header_numbers_as_str = ".".join([str(level) for level in  header_numbers])
        line = f"{'#' * header_level} {header_numbers_as_str} {header_text}"

    processed_content.append(line)

logger.success("Headers renumbered successfully.")

return "\n".join(processed_content) 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.