This repository hosts a script which extracts frequencies of words from a given word list in text segments. As input a JSON object is passed over an API which saves the text segment and their IDs and optionally a URL to the word list on which basis the freqeuncies are extracted. It returns a JSON in which for each word the frequency per text segment is saved.
It is required that the keys in the Input and the WordList are as below:
- Input Format:
{"segments": [{"segment_id":"xyz", "text":"asd ..."}, ...], "language":"de", "word_list":{"url":"http://..."}}
- WordList Format:
[{"word":abc, "wikidata_ID":"Q12345","category":"plant"}]
- Output Format:
[{"word":"xyz", "wikidata_ID":"Q12345","category":"plant", "overall_frequency":1234, "segment_frequencies":{segment_id:1234,...}}]
This scripts requires spacy
and FastAPI
to be installed. Additionally the spacy models
for English and German must be downloaded: de_core_news_sm
, en_core_web_sm
The script was tested using uvicorn.
Unittests are provided in test/
and can be executed with python -m unittest test/test_extractor.py
The following curl command can be used to post a JSON file to the extractor service:
curl -X POST -H "Content-Type: application/json" 127.0.0.1:8000/extractor --data-binary @test/test.json
. The items in the response should match test/result.json
.