Hi I would like to try out pubtator and was running execute.sh and i

Take a look at the extract_tags.py . This scrip

Hi DavidThank you and I saw that .I was wondering if there's a

error processing bioconcepts2pubtator_offsets.gz about pubtator HOT 7 OPEN

greenelab commented on July 28, 2024

error processing bioconcepts2pubtator_offsets.gz

from pubtator.

Comments (7)

danich1 commented on July 28, 2024

Greetings,

Looks like there is a floating NULL character(s) within the bioconcepts2pubtator_offsets.gz file. This is causing the csv module to throw an error. Not sure if this is a version issue or a file reader issue, but a quick fix is to replace the following line of code in pubtator_to_xml.py script:

annts = csv.DictReader(lines[2:], fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)

with

fixed_lines = [str_with_null.replace('\x00', '') for str_with_null in lines[2:]]
annts = csv.DictReader(fixed_lines, fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE)

This fix assumes that the null byte comes at the end of the line. If error occurs again, will look into other possible solutions.

from pubtator.

commented on July 28, 2024

great! it works! Is there a way to use pubtator to parse a text into their tags and frequencies? Thank you.

…

On Tue, Nov 26, 2019 at 10:18 AM David Nicholson ***@***.***> wrote: Greetings, Looks like there is a floating NULL character(s) within the bioconcepts2pubtator_offsets.gz file. This is causing the csv module to throw an error. Not sure if this is a version issue or a file reader issue, but a quick fix is to replace the following line of code in pubtator_to_xml.py script: annts = csv.DictReader(lines[2:], fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE) with fixed_lines = [str_with_null.replace('\x00', '') for str_with_null in lines[2:]] annts = csv.DictReader(fixed_lines, fieldnames=['pubmed_id', 'start', 'end', 'term', 'type', 'tag_id'], delimiter="\t", quoting=csv.QUOTE_NONE) This fix assumes that the null byte comes at the end of the line. If error occurs again, will look into other possible solutions. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18?email_source=notifications&email_token=AAIBKGECRQBBQGOHOTCOLVLQVU44JA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFGL4CQ#issuecomment-558677514>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIBKGHT4BMVYERFF5H437DQVU44JANCNFSM4JRY3EBQ> .

from pubtator.

danich1 commented on July 28, 2024

Take a look at the extract_tags.py script. This script is designed to extract tags from the pubtator xml file. Command to use is:

python scripts/extract_tags.py \
  --input data/pubtator-docs.xml.xz \
  --output data/pubtator-tags.tsv.xz

Once the process has finished you can easily count the frequency of tags.

from pubtator.

commented on July 28, 2024

Hi David Thank you and I saw that script. I was wondering if there's a way to process a non-xml text or string?

…

On Tue, Dec 3, 2019 at 10:34 AM David Nicholson ***@***.***> wrote: Take a look at the extract_tags.py script. This script is designed to extract tags from the pubtator xml file. Command to use is: python scripts/extract_tags.py \ --input data/pubtator-docs.xml.xz \ --output data/pubtator-tags.tsv.xz Once the process has finished you can easily count the frequency of tags. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18?email_source=notifications&email_token=AAIBKGDY6HLOUB33KXQEUSDQWZ4ABA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFZY6RQ#issuecomment-561221446>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIBKGDYEXKEKX3XWQCB3ELQWZ4ABANCNFSM4JRY3EBQ> .

from pubtator.

danich1 commented on July 28, 2024

You mean the tag extraction part correct? Currently, we don't have a pure text parser implemented. We were only concerned with extracting tags solely from pubtator; however, this doesn't erase the possibility of an extension.

from pubtator.

commented on July 28, 2024

Would you be able to point me to the appropriate functions to look at, basically to perform the annotation function in pubtator: i) identifying bio-entities and ii) identify relationships between entities. Thank you.

…

On Tue, Dec 3, 2019 at 10:56 AM David Nicholson ***@***.***> wrote: You mean the tag extraction part correct? Currently, we don't have a pure text parser implemented. We were only concerned with extracting tags solely from pubtator; however, this doesn't erase the possibility of an extension. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#18?email_source=notifications&email_token=AAIBKGG3KEYYBYHFXL5MUC3QWZ6TXA5CNFSM4JRY3EB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFZ3QUY#issuecomment-561231955>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIBKGEGXJER6MLJTFBFDJLQWZ6TXANCNFSM4JRY3EBQ> .

from pubtator.

danich1 commented on July 28, 2024

Should be pretty straight forward looking at the function within the extract_tags.py script. All it does is open the compressed file then has the etree package do the parsing. In your case you won't need/have access to the etree library as it is specifically designed to parse xml like tags. Instead you will just handle the raw text and parse it based on your situational needs.

ii) identify relationships between entities.

I want to clarify that this project is only designed to identify tags. It doesn't have the capability to detect relationship between entities. You'd have to look at other places for that kind of detection or do a manual inspection of the results.

from pubtator.

error processing bioconcepts2pubtator_offsets.gz about pubtator HOT 7 OPEN

Comments (7)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent