Giter Site home page Giter Site logo

boldigger's Introduction

IMPORTANT UPDATE

BOLDigger has been replaced with BOLDigger2 - check it out here: https://github.com/DominikBuchner/BOLDigger2

BOLDigger

Downloads

Python program to query .fasta files against the different databases of www.boldsystems.org

Introduction

DNA metabarcoding data sets often consist of hundreds of Operational Taxonomic Units (OTUs), which need to be queried against databases to assign taxonomy. Barcode of Life Data system (BOLD) offers such a database that is used by many biologists. Unfortunately, only batches of 50 sequences can be identified once. Using BOLD's API does not solve the problem completely since it does not grant access to private and early release data. BOLDigger aims to solve this problem. As a pure python program with a user-friendly GUI, it not only gives automated access to the identification engine but can also be used to download additional metadata for each sequence as well as helping to choose the top hit from the returned results.

For the command-line version please visit https://github.com/DominikBuchner/BOLDigger-commandline

Installation

BOLDigger requires Python version 3.6 or higher and can be easily installed by using pip in any command line:

pip install boldigger

will install BOLDigger as well as all needed dependencies. BOLDigger can be started by typing:

boldigger or python -m boldigger

When a new version is released you can update by typing:

pip install --upgrade boldigger

How to cite

Buchner D, Leese F (2020) BOLDigger – a Python package to identify and organise sequences with the Barcode of Life Data systems. Metabarcoding and Metagenomics 4: e53535. https://doi.org/10.3897/mbmg.4.53535

Login to your account

The identification engine requires an account at www.boldsystems.org. A login is required to query more than one sequence with the identification engine. User data can be saved by ticking "Remember me". Note that your password will be saved unencrypted. Don't use this option if this is not okay.
Only one login per session is required.

Use the BOLD identification engine for COI, ITS, and rbcL & matK

Once logged into the account, the identification engine of BOLD can be used. An output folder needs to be selected where the results will be saved, as well as an input file in .fasta format. Three different databases can be selected: COI, ITS, or rbcL & matK as well as a batch size. The latter handles how many sequences will be identified in one request. 50 is the maximum value as and the default for COI. Batch size depends on various parameters such as internet connection, availability of the BOLD database as well as the length of the requested sequences and needs to be adjusted if a lot of ConnectionErrors occur. A batch size of 50 is recommended for COI, 10 for ITS, and < 5 for rbcL & matK.
The results will be written to the output folder and will always be named "BOLDResults_fastaname.xlsx". In case a workbook with that name already exists in the output folder the results will be appended to this file.
In version 1.2.3 an option was added to check the fasta file for invalid headers and sequences before running the identification engine. If an invalid header (too long name) or invalid sequences are found (invalid characters) a modified version of this fasta will be saved in the same space as the original with a modified name. Invalid headers will be cropped to a length of 99 characters and invalid sequence characters will be replaced with N's. The modified fasta file can be used to directly run BOLDigger or the original fasta file can be checked again and edited manually.
After every batch, the requested sequences will be removed from the input file and written to a new file named "fastaname_done.fasta" in the same folder as the input file. This is to prevent running input files twice: If BOLDigger crashes it can just be restarted with the same output folder and input file and will continue where the crash occurred.

The BOLD server will take some time to respond to the request. The output window will freeze during this time and updated once a response is sent.
Please make sure there are no additional empty lines or invalid sequences (containing letters that don't code for bases) in your .fasta file.

Test input files can be found here

Download additional data from BOLD

The standard output of the identification engine returns information about the taxonomy (Phylum, Class, Order, Family, Genus, Species and Subspecies) as well as a similarity score for each hit in the database, if the data is public, private or early-access as well as the BOLD Process ID.
Additional data can be downloaded via the BOLD API by providing the output of the identification engine. Additional data are BOLD Record ID, BOLD BIN, Sex, Life stage, Country, Identifier, Identification method, the institution storing the sample, and a link to the specimen page. Note that in order to open the specimen page login to boldsystems.org is required.

Find the best fitting hit from the top 20 (COI) and top 99 (ITS / rbcL & matK)

There are three options available to determine the best fitting hit:

  • First hit
  • JAMP Pipeline
  • BOLDigger

First hit

This option uses the first hit and can be used for all markers supported by BOLDigger.

JAMP Pipeline

This option reproduces the output from the JAMP Pipeline. Therefore different thresholds (98%: species level, 95%: genus level, 90%: family level, 85%: order level, <85%: class level) for the taxonomic levels are used to find the best fitting hit.
After determining the threshold for all hits the most common hit above the threshold will be selected. Note that for all hits below the threshold the taxonomic resolution will be adjusted accordingly (e.g. for a 96% hit the species level information will be discarded and genus level information will be used as the lowest taxonomic level).

The algorithm works as follows:

  1. Find the maximum similarity value of the top 20 hits currently looked at.
  2. Set the threshold to this level, remove all hits with a similarity below this threshold (e.g. if the highest hit has 100% Similarity, the threshold will be set to 98% and all hits below will be removed for the moment.)
  3. Count all individual classifications, sort by abundance.
  4. Drop all classifications that contain missing data (e.g. if the most common hit is "Arthropoda --> Insecta" with a similarity of 100% but missing values for Order, Family, Genus and Species)
  5. Look for the most common hit that has no missing values
  6. If one is found, return that hit.
  7. If none is found, set the threshold to the next higher level and repeat the process until a hit is found.

BOLDigger - requires additional data

This option is similar to the JAMP option but flags suspicious hits. Make sure additional data was downloaded. There are currently 4 flags implemented, which will be updated if needed:

  1. Reverse BIN taxonomy has been used for any of the top 20 hits representing the selected match. (Reverse BIN taxonomy: Deposited sequences on BOLD that lack species information may be assigned by BIN affiliation. By that, a species name is shown for deposited sequences eben when no morphological identification down to species level was carried out.)

  2. There are two or more entries with differing taxonomic information above the selected threshold. (e.g. two species above 98%). Note that this is sometimes the case due to deposited sequences bearing invalid epitheton information such as "sp." or even a blank entry.

  3. All of the top 20 hits representing the top hit are private or early-release hits.

  4. The top hit result represents a unique hit of the top 20 hits.

  5. The hit has been corrected via the API correction tool.

A closer look at all flagged hits is advised since they represent a certain degree of uncertainty for the selected hit.

Correction of top hits via BOLD API

This option scans the BOLDigger top hits for hits with high similarity (>= 98%) and missing species-level assignment. This can happen, if the top 20 hits are populated with high similarity hits and missing species-level assignment "hide" away better hits. Those are queried against the BOLD identification API and corrected with the most common name in the published hits. The results will be saved in a new tab in the BOLDResults file for further inspection.

Still to do

  • Implement the identification engine API for quick analyses
  • rework the command-line version with the new methods - will take some time
  • Add a failsafe to detect fasta format (2line, line-wrapping) and correct it before starting the search
  • Add failsaves for its and rbcl downloads
  • Add metadata async download to speed up the process

boldigger's People

Contributors

dominikbuchner avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

boldigger's Issues

xlsx output file error?

Hi @DominikBuchner

I am having issues making sense of the xlsx output file. I have a fasta with 99 seqs, ASV_1 to ASV_99.

However, the first column of the xlsx output file only contains odd numbered ASV IDs and some raw sequences in between. Where are my even numbered ASVs and what do the raw sequences mean? Further, the file only contains ASVs down to ASV_87, don't know what happened to the rest. I checked my input fasta, all 99 ASVs are definitely there and have the regular format >ASV_x followed by the raw sequence on the next line.

My output file is attached.

BOLDResults_queryFile.fa.1.0.560931114014238.xlsx

Thanks for having a look at it. I really would like to use your tool, but given the current output this seems a bit difficult ;-)

cheers
Nauras

Limit PySimpleGUI requirement to version 4

PySimpleGUI 5 is out, but is now closed-source and subscription-based

Although the subscription is free for non-commercial users, which includes most of us, it requires you to acquire a free license once a year by ticking a pop-up box. This creates problems in automatic installations of BOLDigger, like when setting up venvs or conda environments, and is especially annoying for people only using BOLDigger-commandline, where the PySimpleGUI requirement is completely superfluous. It also means with PySimpleGUI 5 BOLDigger cannot be called completely open-source anymore...

I would suggest to just limit the range of acceptable PySimpleGUI versions in the requirements spec for BOLDigger.
"PySimpleGUI >= 4.18.2, < 5.*.*"

Dataframe index issues in boldblast_coi.py, with workaround

Kudos for creating BOLDigger! This is the smoothest way to use BOLD I have encountered so far.
I encountered this error while trying to launch the Boldigger GUI from PowerShell:

PS C:\Users\username> boldigger
Traceback (most recent call last):
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\username\AppData\Local\Programs\Python\Python38-32\Scripts\boldigger.exe_main
.py", line 7, in
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger_main
.py", line 70, in main
boldblast_coi.main(session, values['fasta_path'], values['output_folder'], values['batch_size'])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 247, in main
dataframes = save_as_df(html_list, sequences_names[querys.index(query)])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 105, in save_as_df
cols = dataframes[index].columns.tolist()
UnboundLocalError: local variable 'index' referenced before assignment

The same issue occurred under Windows 10 and Ubuntu 18.04 with Python 3.8.5 and a fresh install of Boldigger 1.2.1. Looking at line 105 in boldblast_coi.py, there is potential for a variable scope error:

    ## save columns before to sort them after
    cols = dataframes[index].columns.tolist()

This 'index' is the iterator variable in the preceding for...range loop. I inserted an explicit 'index = 0' before the loop, which lets the GUI launch. However, when I try to run the IDS query, I get a new error in the same statement:

PS C:\Users\username> boldigger
Traceback (most recent call last):
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 194, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\username\appdata\local\programs\python\python38-32\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\username\AppData\Local\Programs\Python\Python38-32\Scripts\boldigger.exe_main
.py", line 7, in
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger_main
.py", line 70, in main
boldblast_coi.main(session, values['fasta_path'], values['output_folder'], values['batch_size'])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 248, in main
dataframes = save_as_df(html_list, sequences_names[querys.index(query)])
File "c:\users\username\appdata\local\programs\python\python38-32\lib\site-packages\boldigger\boldblast_coi.py", line 106, in save_as_df
cols = dataframes[index].columns.tolist()
IndexError: list index out of range

I think the root of the problem this time was that my COI FASTA file included some sequences that are too short for IDS; six sequences out of 3747 were shorter than 30 nt, and any batch of 100 sequences containing these failed to process. Somehow that led to 'dataframes' being assigned an empty list (verified by 'print(len(dataframes))' ) instead of 'nomatch_df'. This issue was partially fixable by adding a check for 'dataframes' being empty:

    ## add process IDs to published sequence
    **index = 0**
    for index in range(len(dataframes)):
        {...}

    ## save columns before to sort them after
    cols = dataframes[index].columns.tolist() **if len(dataframes) > 0 else []**
    ## add sequence names to dataframes
    for index in range(len(dataframes)):
        {...}

With the additions in boldface, BOLDigger completed successfully, although batches including reads too short for IDS led to empty dataframes, effectively skipping that batch. Adding additional data and selecting best hits for the downloaded hits worked without issues.

I have not investigated the scripts for the other databases in detail, but if the same code was used for 'save_as_df', the same problem could occur for ITS and rbcL/Matk.

API verificiation fails

Hi

API verification fails in version 2.0.5, just like in the command line version. Very briefly (for a split second), something like the following pops up:

14:55:20: Starting API verification.
14:55:20: Collection OTUs without species level identification and high similarity.

Then Boldigger just terminates. Below the error message I got when running the command line version. Guess it is applicable for the GUI version as well:

Traceback (most recent call last):
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Nauras\Programs\Python\Python39\Scripts\boldigger-cline.exe_main.py", line 7, in
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline_main.py", line 73, in main
api_verification.main(args.xlsx_path, args.fasta_path)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline\api_verification.py", line 17, in main
raw_data, data_to_check, seq_dict = extract_data(xlsx_path, fasta_path)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger\api_verification.py", line 31, in extract_data
raw_data = pd.read_excel(xlsx_path, sheet_name = 'BOLDigger hit')
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\util_decorators.py", line 296, in wrapper
return func(*args, **kwargs)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_base.py", line 304, in read_excel
io = ExcelFile(io, engine=engine)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_base.py", line 867, in init
self._reader = self.enginesengine
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_xlrd.py", line 22, in init
super().init(filepath_or_buffer)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_base.py", line 353, in init
self.book = self.load_workbook(filepath_or_buffer)
File "c:\users\nauras\programs\python\python39\lib\site-packages\pandas\io\excel_xlrd.py", line 37, in load_workbook
return open_workbook(filepath_or_buffer)
File "c:\users\nauras\programs\python\python39\lib\site-packages\xlrd_init.py", line 170, in open_workbook
raise XLRDError(FILE_FORMAT_DESCRIPTIONS[file_format]+'; not supported')
xlrd.biffh.XLRDError: Excel xlsx file; not supported

Seems like a problem with reading Excel files?? BOLDiger results file with BOLDigger top hits attached.
trial.xlsx

Cheers

Nauras

Hymenoptera issue

Hey Dominic,

First: BOLDIGGER is truly a great tool. We do a lot of insect metabarcoding at the moment and wouldn't get such high quality results without your it, so thank so much for constructing it!

One small issue keeps coming up that I think you should know: Hymenoptera hits don't get transported correctly to the results. Here's an example in the BOLDResults file:
image

And here's the hits on BOLD website:
image

I also notice that the order in which the ASVs are in the 'BOLDigger hit' tab of the BOLDResults excel file are skewed whenever a Hymenopteran comes up:
image
versus the order in the first tab (starting with 'Run dd/mm/yyyy tt.tt'):

0cc778207a38b7a6e881d96d2ae63bbf
1a994f0b7d83ca1fd64f85e5463ab493
20f61de5ad7c87a8320d5a7c6c403aae
28052a092ec236890bf2034336b82950 (this one is different, but also a Hymenopteran)
4a8e00925b4fcd21335a0ee102ffcba2

I'm not sure what could be causing it, but just wanted to flag it. The primer pair I use (mlCOIintF, HCO2198) is not great at Hymenopterans (or so I understand from literature) but I'm wondering if it could have to do with the bioinformatics..

Thanks for your thoughts in advance, hope I explained it well to you,

Best regards,

Marcel Polling

BOLD did not respond! several weeks?

Dear Dominik and BOLDigger users,

I wonder if you also experience the same cycle of:

10:19:50: Requesting BOLD. This will take a while.
10:19:51: Downloading results.
10:19:51: Parsing html.
10:19:51: BOLD did not respond! Retrying.

these days. I thought it was due to the BOLD maintenance some days ago, but perhaps not?

I tried decreasing the batch size to 10 and then to 1, which was finally successful, but after a while it got stuck again.
Now I also noticed that the single sequence in *_done.fasta is incomplete and its last line remained in the original fasta file (attached). Maybe here we have a new format error to include in the checks?
The fasta file passed the BOLDigger check and when I copied the sequences directly into ID engine on BOLD website, I got results normally.

Thanks,
Ondrej

P.S.:
With this issue I realized how wonderful it would be if BOLDigger did not crash every time I close its process window or when the pc goes to sleep unintentionally. This way I have to enter the folder, file and account info again and again (for some reason Ctrl+C works only once there or not even once in case I click elsewhere before pasting :D )
boldigger_issue.zip

Best fitting hit incorrect

Hi,

I used classificaiton for COI and then ran the program to find the best fitting hit both with the JAMP and the BOLDigger method. I found a case where both BOLDigger and the JAMP method choose a hit even though another publicly available hit with better taxonomic classification and similarity was found.

Please check the entry for ASV308 in the attached file. BOLDigger and JAMP choose a published hit with a classification to class level and a similarity of 95.12. However, in the sheet showing the 20 best hits, there are several published hits with higher similarity and better classification. The top hit has a similarity of >97% and classification down to species level (although given the similarity value, only genus level callsification should be trusted, obviously).

BOLDResults_COI_cluster_reps_curated_no_contam.xlsx

A bug?

Cheers

Nauras

Broken login

Hi, Dominik,

Recently, I've encountered an issue where the Boldigger GUI becomes unresponsive when attempting to log in to BOLD. This problem occurs not only on my computer (macOS, python 3.9.18) but also on those of my colleagues (Ubuntu 22.04, python 3.9.18), even on fresh installs of Boldigger.

I am unsure if this issue is related to specific versions of required Python modules or if it is a problem with the BOLD webpage itself. It appears that the POST request to "https://boldsystems.org/index.php/Login" is not functioning as expected.

Best,
Víctor

Finding best fitting hit fails for JAMP and BOLDigger method

Hi,

I having issues getting the top hit for the JAMP and BOLDigger methods. I am using the most recent version.

Error message (BOLDigger method, JAMP method fails with the same error):

$ boldigger-cline digger_hit BOLDResults_COI_cluster_reps_curated_no_contam.xlsx
12:08:01: Opening resultfile.
12:08:23: Filtering data for JAMP hits.
Traceback (most recent call last):
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 197, in _run_module_as_main
return run_code(code, main_globals, None,
File "c:\users\nauras\programs\python\python39\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\Users\Nauras\Programs\Python\Python39\Scripts\boldigger-cline.exe_main.py", line 7, in
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline_main.py", line 70, in main
digger_sort.main(args.xlsx_path)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline\digger_sort.py", line 15, in main
jamp_hits = [jamp_hit(otu) for otu in otu_dfs]
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger_cline\digger_sort.py", line 15, in
jamp_hits = [jamp_hit(otu) for otu in otu_dfs]
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger\jamp_hit.py", line 55, in jamp_hit
threshold, level = get_threshold(df)
File "c:\users\nauras\programs\python\python39\lib\site-packages\boldigger\jamp_hit.py", line 13, in get_threshold
elif threshold >= 98:
TypeError: '>=' not supported between instances of 'str' and 'int'

There seems to be an error with how similarity values are being assessed? The first hit method works fine. My results file is attached.
BOLDResults_COI_cluster_reps_curated_no_contam.xlsx. The same issue appears with the commandline version.

Cheers

Nauras

trouble running jamp_hit and digger_hit

Hello @DominikBuchner ,

I am struggling with getting jamp and boldigger methods to run, both in command line and interface. This is the first dataset I am running since I have updated to the new version, and it keeps failing when creating a new sheet:

13:27:20: Opening resultfile.
13:27:47: Filtering data for JAMP hits.
13:30:35: Extracting additional data.
13:31:41: Flagging the hits.
13:34:59: Saving result to new tab.
Traceback (most recent call last):
File "/home/filipamsmartins/.local/bin/boldigger-cline", line 8, in
sys.exit(main())
File "/home/filipamsmartins/.local/lib/python3.8/site-packages/boldigger_cline/main.py", line 70, in main
digger_sort.main(args.xlsx_path)
File "/home/filipamsmartins/.local/lib/python3.8/site-packages/boldigger_cline/digger_sort.py", line 30, in main
save_results(xlsx_path, output)
File "/home/filipamsmartins/.local/lib/python3.8/site-packages/boldigger/digger_sort.py", line 128, in save_results
writer.book = wb
AttributeError: can't set attribute

Thank you in advance for your help. Best,
Filipa

windows: crash fixed by updating beautifulsoup4

BOLDigger crashes on windows 10, claiming Pandas needs beautifulsoup4 v 4.11.1 or newer (my machine had 4.10 installed).

Fixed by forcing an upgrade of bs4 using:
pip install --upgrade beautifulsoup4

Is BOLD blocking BOLDigger

Hi Dom!

Me and some lab mates have been trying to use BOLDigger this week without much luck, unfortunately. I haven't had any trouble in the past, but now I can't get the program to run without crashing. It always produces the same error: "too many 500 error responses". I assume that this means the connection is timing out - or that BOLD itself is blocking my IP from accessing the ID Engine. I've tried multiple different batch sizes in case the issue is a time-out error related to a too-large batch size. But alas, I get the same error even with a batch size of 1. I have also logged into BOLD and manually submitted ID Engine requests ranging from 1-20 sequences at a time with no issue, so I don't think it's a problem with my firewall or internet connection. I've even tried it from various locations (different networks), different computers, on VPN vs off. None of it works!

All of this to say - are you experiencing the same problems? Do you have any idea what's going on? Perhaps I'm missing something simple!

BOLD Not Responding: Attempting Retry

Hi, I run the rbcl analysis on BOLDigger with 3 batch size (since it is reported that it should be less than 5). However, I encountered an error with BOLD did not respond! Retrying. How can I solve this issue?

Thanks for your time.
Anil

I added my fasta file if you want to look at it.
M6_1-2-3_C10r005_twoline.zip

Discrepancy between BOLDigger output and BOLD's identification engine

Hi,

I came a cross a weird by chance while going through my BOLDigger output file.

I have ~8,000 COI metabarcoding sequences which I classified with BOLDigger. I was using boldigger-cline v2.1.2 at that time, which was in July 2023. I opened this issue here because I dont think this is an issue specific to the commandline tool.

Below is are the top 20 hits for ASV17731:

grafik

When I manually check this sequence on BOLD's website against the All Barcode Records on BOLD database, I get the following nearest matches:

grafik

What btohers me is that these 20 matches in BOLD and BOLDigger are almost identical. But BOLDigger says this ASV has 87.38% similarity with the insect family Chironomidae, while on BOLD, this similarity value is attributed to a taxon of Ochrophyta.

I checked this now, in August 2023, so a month from the initial classification. But I honestly dont think that this has anything to do with this. Or am I wrong?

How can this sequence - according to BOLdigger - have the exact same similarity value for an insect family as well as an algae, while the former is not even listed in the output when I manually consult the BOLD identification engine?

This is the sequence in case you would like to reproduce the problem:

ASV17731
ATTATCATCTATTCAAGCGCATTCAGGGCCTTCAGTAGATATGGCGATTTTTAGTTTACATTTATCAGGTGCAGGTTCTATTTTAGGAGCAATTAATTTTATTGTAACTATCTTTAACATGCGTGCCCCAGGACTTTTCTTACATAAAATGCCTCTTTTTGTATGATCTGTATTAGTAACTGCATTTTTACTTTTATTATCTTTACCAGTTTTCGCTGGAGCAATTACTATGCTTTTAACAGATCGTAACTTTAATACAAGCTTTTATGATCCTGCCGGAGGAGGAGATCCAGTATTATACCAACATCTTTTC

Cheers

nauras

Bold not responding

Hi and thank you for developing this tool
We have tried to our fasta file with BolDigger but unfortunately we get an error message saying that bold is not responding when it tries to download the results

We have run the cline pipeline with your test database and get the same results. Any suggestion s on how to solve this?
Thi is the code we have tried
boldigger_cline ie_coi username password C:\Users\elbr2874\Downloads\COI.fasta C:\Users\elbr2874\Desktop\boldresults
Best
Francisco

Passing literal html to 'read_html' is deprecated and will be removed in a future version.

Hello @DominikBuchner,

Thank you for creating this awesome tool!

Just to note that I've received the following warning when running BOLDIGGER via the command line (boldigger-cline ie_coi). It seems to run fine at this point, but might become an issue in the future?

boldigger/boldblast_coi.py:69: FutureWarning: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.

Thanks,
Gert-Jan

Assigning BIN to top hits

Getting Barcode Index Numbers (BINs) could be useful for certain analyses as a extra source of information, especially in datasets with many undescribed species.

I imagine the procedure for assigning BIN as something along the lines of:

  • Initially use the same algorithm as for assigning JAMP hits.
  • If the threshold is set at 98% at step 2, then attempt to assign a BIN.
  • For this, use the same subset of hits as used in step 5 of the JAMP algorithm (“Look for the most common hit that has no missing values”) and check for the number of unique BINs within this subset.
  • If this number is 1, assign that unique BIN.
  • If more than one BIN is found in the subset, do not assign a BIN (i.e., leave the BIN field empty). Perhaps these cases could be flagged (with a number and a list of the unique BINs found).

BOLD Not Responding: Attempting Retry

Description

The messages displayed in the screenshot (attached) are repeating every second. I'm uncertain on how long I should wait for the messages to load efficiently. Do you have any insights on this situation?

I am encountering the following log messages:

2023-01-29 19:04:06.902 python3.9[50142:2472810] +[CATransaction synchronize] called within transaction
2023-01-29 19:04:26.391 python3.9[50142:2472810] +[CATransaction synchronize] called within transaction

Additionally, I've included the script below as an example of my "*.fasta" file. The file contains information, followed by quality score, and actual sequence data shortened as "SEQUENCESEQUENCE", original lenght of DNA seq is like 190 bp.

My original fasta file contains many entries, but for now, I'm testing with just like 10 lines. Can you suggest a specific fasta file to use following a step such as Obitools?

***:50:HL725AFX3:1:11101:7839:1038 count=1128; obiclean_count={'XXX': 1296}; obiclean_head=True; obiclean_cluster={'XXX': 'NB552469:50:HL725AFX3:1:11101:7839:1038'}; obiclean_internalcount=0; obiclean_status={'XXX': 'h'}; obiclean_samplecount=1; obiclean_headcount=1; obiclean_singletoncount=0; 1:N:0:ATTGTAAT+NAACCGCG
SEQUENCESEQUENCE...

Screenshots

image

image

Environment

python3.9

Thanks for your time.
Onur.

First hit on corrupt excel files

The first hit sorting method fails on some excel files.
No yet sure what causes the error since the other options work fine.
Fix needed.

Include BOLD API extra data for corrected list of top hits?

Hello!

I've been using BOLDigger for a while and I am so appreciative to have this tool at my fingertips! I'm recently exploring the option to generate a top hit list using the BOLDigger settings with the associated flags and correcting identifications using the BOLD API. I really like that the program has built in the functions to look at conflicting taxonomy and missing species names among the closely related (>98% similarity) top hits.

However, I'm finding that the loss of additional data for these lists (like the process ID, BIN, location, of the top hit is unfortunate as those pieces can really help guide the exploration of taxonomic assignments for my data.

So, that said, this isn't really an "issue" or a "bug", but rather a request to add this feature. Is this possible at all?

Thanks again for creating such an awesome program!
Monica

Long fasta files (>50 seqs) stop the identification engine

For fasta files with many sequences, the identification engine stops after 50 or 100 records, and enters a loop in which no data is retrieved. If the name of the fasta file is changed, then it works again for another 50-100 records. We solved the problem by submitting a different fasta file per sequence (or small batch of sequences) and then merging the information at the end, but it would be good that boldigger did that automatically. I believe BOLD is blocking the retrieval of too many records from the same request file?

Damaged BOLD records stop the identification engine

For unknown reasons some of BOLDs records are broken, leading to a never loading Top 20 hit table. BOLDigger goes through the maximum number of retrys for those specific entrys until is stops working and crashes.

Fix needed:

  • maybe assign "broken record" to those hits and retrieve the top hit later on via API?

Upload fasta with more than 100 seqs?

Hi @DominikBuchner

very cool tool and very easy to use. fantastic. really appreciate it.

The documentation didn't really say, but is your tool actually meant to be used with fastas containing more than 100 sequences and the algorithm automatically plits them into batch sizes of 100 seqs max, or does it work with fastas containing a maximum of 100 seqs?

cheers
Nauras

Ubuntu: rbcl fasta file run error

Hi, I'm using the BOLDigger v1.5.6 with Python 3.8 and there was an error:

Traceback (most recent call last):
File "/home/yao/.local/bin/boldigger", line 8, in
sys.exit(main())
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/main.py", line 108, in main
boldblast_rbcl.main(session, values['fasta_path'], values['output_folder'], values['batch_size'])
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/boldblast_rbcl.py", line 70, in main
tables = asyncio.run(as_session(links))
File "/usr/lib/python3.8/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
return future.result()
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/boldblast_its.py", line 70, in as_session
return await asyncio.gather(*tasks)
File "/home/yao/.local/lib/python3.8/site-packages/boldigger/boldblast_its.py", line 46, in as_request
table = pd.DataFrame([[0], ['No Match'] * 7 + [np.nan] * 3 + [''] * 2] * 20,
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 721, in init
arrays, columns, index = nested_data_to_arrays(
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 519, in nested_data_to_arrays
arrays, columns = to_arrays(data, columns, dtype=dtype)
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 883, in to_arrays
content, columns = _finalize_columns_and_data(arr, columns, dtype)
File "/home/yao/.local/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 985, in _finalize_columns_and_data
raise ValueError(err) from err
ValueError: 13 columns passed, passed data had 12 columns

My input file is .fas file and contains 100 sequences, the batch size was set as 5.
DO anyone know how to fix it ? Thanks a lot !!

ID description

Hi Dominik,

I'm working with BOLDigger v.2.0.3 (and boldigger-cline v.1.0.0) and have the issue that the "ID" in the results file starts with the greater-than (">") symbol (inherited from the fasta file description line). The fasta file is generated with Qiime2 and the ">" symbol unfortunately causes an error when using the BOLDigger taxonomy table together with the Qiime2 read table in TaxonTableTools.
So a fix of this issue would be quite helpful to streamline the usage.

Thanks!
Sascha

BOLDResults_test-dna-sequences_part_1.xlsx
test-dna-sequences.txt

need to check end of fasta file

Hello, first of all, big thanks and thumbs up for this wonderful tool!

If my fasta file comes directly from JAMP's <0.01% read abundance filtering (as attached error_1.fasta), I need to remove the last 2 lines:

>below_0.01
 NA

otherwise boldigger crashes and terminal shows an error UnboundLocalError: local variable 'index' referenced before assignment
I also had to make sure an empty line is after the last sequence. Because if number of OTUs (in the last batch) is equal to batch size and there's no empty line after the last OTU sequence, I get the error as well.
While playing around with the issue, it also sometimes crashed with IndexError: list index out of range and the same traceback (see below). Seems like this happens when the empty line is the only thing in the last batch, which is no problem, because all OTUs are already processed at that point.

I am new to python and now I have no time to suggest a fix. So in case you decide that it's not worth fixing in the code, it would be useful at least to tell users through README that they might need to edit their fastas as I did. Good luck!

Traceback (most recent call last): File "/home/ono/.local/bin/boldigger", line 11, in <module> sys.exit(main()) File "/home/ono/.local/lib/python3.6/site-packages/boldigger/__main__.py", line 60, in main boldblast_coi.main(session, values['fasta_path'], values['output_folder'], values['batch_size']) File "/home/ono/.local/lib/python3.6/site-packages/boldigger/boldblast_coi.py", line 246, in main dataframes = save_as_df(html_list, sequences_names[querys.index(query)]) File "/home/ono/.local/lib/python3.6/site-packages/boldigger/boldblast_coi.py", line 104, in save_as_df cols = dataframes[index].columns.tolist() UnboundLocalError: local variable 'index' referenced before assignment

Correction of top hits via BOLD API not initiated

Hi @DominikBuchner

The correction of the top hits is exactly what I need. There are quite a few cases where the top hit has no species level assignment and higher resolution assignment is masked (or "hidden" as you called it).

The problem is that the correction is njust not being initiated. I am using the GUI for the BOLD API correction (as it doesn't seem to be implemented yet in the command line version). So I did my identification, I downloaded additional data and got my BOLigger top hits. All good. I checked the fasta again, boldigger says it's fine. But when Is elect my results file and the corresponding fasta at the bottom of the GUI and click run, the small window that always pops up when a boldigger process is being initiated and which shows what's going on pops up for like half a second, and then boldigger shuts down the GUI and and boldigger.exe window disappear.

Any idea what is going on here? I would really like to use it as It's a pain to sort this out manually.

Cheers

nauras

Private entry as top hit despite 100% similarity published entry available

Hi,

I am running BOLDigger v2.2.0 on the COI sequences in the txt file attached.

I am performing all the steps:

  1. Running the identification engine with a batch size of 5.
  2. Searching additional data
  3. Adding top hits with BOLDigger method.
  4. Download additional identification data via BOLD API for correction.

I encountered a weird behavior for MOTU23. See the top 20 hits for this MOTU:

grafik

BOLDigger hit sheet:

grafik

BOLDigger hit - API corrected sheet:

grafik

Why does BOLDigger ends up with a private entry as the chosen hit? Why doesn't it choose a published entry? Why doesn't it choose Botrylloides leachi(i) as the top hit?

Sequences attached for you to reproduce the problem.

Cheers

nauras
seqs_test.txt

BOLDigger hit type 2 overflagging

Many records in the BOLD System have in their specific epithet information that does not correspond exactly to the species name.
e.g.: sp. a AK-2021, sp. (Johor), communis A1A2, cf. alpium.

Because of this many hits are labelled in the Boldigger hit pipeline as type 2.

Therefore I think it could be interesting to add a species name cleaning step at the beginning of the Boldigger hit process as follows:

Delete the species name completely if it contains:
"sp." lack species name (e.g. sp. CFJS-2021b)
"cf." doubtful species name (e.g. cf. micrura)
"aff." doubtful species name (e.g. aff. hornsundi)
"grp." group, doubtful species name (e.g. pedellus grp.)
" / " doubtful species name

Erase after: (To leave only the species name)
" ssp." subespecies name
" var." variant name (e.g. australogibba var. subcapitata)
" " addition information for a species added after the species name (e.g. bilobata CEA) After this I would delete the boxes containing numbers. (e.g. sp0949C, Malaise3164)

One element on which I doubt whether or not it should be deleted is hybrids. It might be interesting to remove them by default but leave a command as an option not to do so.
" x " hybrids (e.g. pennsylvanicus x firmus)

Filter possible NUMTS from BOLDigger assignments

It is possible that in the metabarcoding amplification process not only the mitochondrial gene but also nuclear copies of that gene were amplified. And they can lead to false positive detection and identification in the BOLDigger and JAMP pipeline.

These copies are marked as UNVERIFIED by NCBI in its database (GenBank) if it detects internal codons or INDELs (in the internal sequences of genes where there should not be any).

It could be not so hard to incorporate into BOLDigger the detection of these sequences by adding a new FLAG if internal stopcodons are detected in the sequences after assigning them (given that the stopcodons depend on the taxonomic group to which each OTU belongs).

Thank you very much for this wonderful pipeline, best regards!

Mac OSX tkinter.TclError - fix inside

Hi
On some OSX systems the GUI does not start because of an tkinter error:

Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/boldigger/main.py", line 119, in
main()
File "/usr/local/lib/python3.7/site-packages/boldigger/main.py", line 47, in main
event, values = window.read(timeout = 100)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 6957, in Read
results = self._read(timeout=timeout, timeout_key=timeout_key)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 6995, in _read
self._Show()
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 6831, in _Show
StartupTK(self)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 11301, in StartupTK
ConvertFlexToTK(my_flex_form)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 11203, in ConvertFlexToTK
PackFormIntoFrame(MyFlexForm, master, MyFlexForm)
File "/usr/local/lib/python3.7/site-packages/PySimpleGUI/PySimpleGUI.py", line 10577, in PackFormIntoFrame
photo = tk.PhotoImage(data=element.Data)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/tkinter/init.py", line 3545, in init
Image.init(self, 'photo', name, cnf, master, **kw)
File "/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/tkinter/init.py", line 3501, in init
self.tk.call(('image', 'create', imgtype, name,) + options)
_tkinter.TclError: couldn't recognize image data

This seems to be a problem with the Tcl/Tk version 8.5 that comes with the python3 version installed with 'brew'. It can be fixed by uninstalling python3 and manually reinstalling it with the latest version from https://www.python.org/downloads/.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.