Giter Site home page Giter Site logo

maxent-ai / zeroshot_topics Goto Github PK

View Code? Open in Web Editor NEW
61.0 4.0 7.0 58 KB

Topic Inference with Zeroshot models

License: Apache License 2.0

Python 62.61% Jupyter Notebook 37.39%
zeroshot-learning nlp machine-learning data-science transformers huggingface keybert bert keyword-extraction nli

zeroshot_topics's Introduction

zeroshot_topics

https://static.pepy.tech/personalized-badge/zeroshot_topics?period=total&units=international_system&left_color=black&right_color=orange&left_text=Downloads

Introduction

Hand-labelled training sets are expensive and time consuming to create usually. Some datasets call for domain expertise (eg: medical/finance datasets etc). Given these factors around costs and inflexibility of hand-labelling it would be nice if there are tools which can help us get started quickly with minimal labelled dataset - enter weak supervision.

But what if you do not have any labelled data at all? is there a way to still label your data automatically in some way? That's where zeroshot_topics might be useful! to help you to be up and running quickly.

zeroshot_topics let's you do exactly that! it leverages the power of zeroshot-classifiers, transformers & knowledge graphs to automatically suggest labels/topics from your text data. all you need to do is point it towards your data.

Algorithm

The algorithm contains, 4 stages:

assets/zstm.png

  1. Keyword & Keyphrase extraction: This is done with the help of KeyBERT. but really any sort of keyword extractor can be used.
  2. Keyword/Keyphrase expansion via knowledge graphs/Taxanomy: Then we expand the important keywords we discovered by using some sort of taxanomy/knowledge graph like wordnet, conceptnet etc.
  3. Trace the Hypernyms for the keywords: Identify the Hypernyms(the root/parent word) and use this as the psuedo-label for the zeroshot classifier.
  4. Zeroshot classification: Use the Hypernyms and documents to label via zeroshot classifiers.

Note: Currently, this tends to work well on short-texts in general, in the future I intend to experiment and see how we can support long texts as well.

Installation

zeroshot_topics is distributed on PyPI as a universal wheel and is available on Linux/macOS and Windows and supports Python 3.7+ and PyPy.

$ pip install zeroshot_topics

Usage

from zeroshot_topics import ZeroShotTopicFinder

zsmodel = ZeroShotTopicFinder()

text = """can you tell me anything else okay great tell me everything you know about George_Washington.
he was the first president he was well he I'm trying to well he fought in the Civil_War he was a general
in the Civil_War and chopped down his father's cherry tree when he was a little boy he that's it."""

zsmodel.find_topic(text, n_topic=2)

# Output - Topics: ['War', 'Head Of State']

Roadmap

Some things that i plan to add in the coming days, if there's some interest in this work by the community.

  • Support custom keyword extractors.
  • Support Custom Knowledge-graphs & taxonomy.
  • Support Custom Zeroshot-classifiers in the pipeline.
  • Add Usecase examples & improve documentation.
  • Optimise the overall library and make it a faster.
  • Support Long Text documents.

License

zeroshot_topics is distributed under the terms of

zeroshot_topics's People

Contributors

anjanarita avatar charleneleong-ai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

zeroshot_topics's Issues

Add size to lru_cache

/usr/local/lib/python3.7/dist-packages/zeroshot_topics/__init__.py in <module>()
      1 __version__ = '0.1.0'
      2 
----> 3 from .zeroshot_tm import ZeroShotTopicFinder

/usr/local/lib/python3.7/dist-packages/zeroshot_topics/zeroshot_tm.py in <module>()
      1 import attr
      2 from keybert import KeyBERT
----> 3 from .utils import load_zeroshot_model
      4 from nltk.corpus import wordnet as wn
      5 

/usr/local/lib/python3.7/dist-packages/zeroshot_topics/utils.py in <module>()
      4 
      5 @lru_cache
----> 6 def load_zeroshot_model(model_name="valhalla/distilbart-mnli-12-6"):
      7     classifier = pipeline("zero-shot-classification", model=model_name)
      8     return classifier

/usr/lib/python3.7/functools.py in lru_cache(maxsize, typed)
    488             maxsize = 0
    489     elif maxsize is not None:
--> 490         raise TypeError('Expected maxsize to be an integer or None')
    491 
    492     def decorating_function(user_function):

TypeError: Expected maxsize to be an integer or None

I assume that you have to provide, maxsize parameter to lru_cache. Worked for me, when I provided the parameter.

Error when I run the sample code

I get this when I try to run the sample code:

Traceback (most recent call last):
File "zerotopics.py", line 1, in
from zeroshot_topics import ZeroShotTopicFinder
File "/Users/scharlesworth/opt/anaconda3/envs/text_analytics/lib/python3.7/site-packages/zeroshot_topics/init.py", line 3, in
from .zeroshot_tm import ZeroShotTopicFinder
File "/Users/scharlesworth/opt/anaconda3/envs/text_analytics/lib/python3.7/site-packages/zeroshot_topics/zeroshot_tm.py", line 3, in
from .utils import load_zeroshot_model
File "/Users/scharlesworth/opt/anaconda3/envs/text_analytics/lib/python3.7/site-packages/zeroshot_topics/utils.py", line 6, in
def load_zeroshot_model(model_name="valhalla/distilbart-mnli-12-6"):
File "/Users/scharlesworth/opt/anaconda3/envs/text_analytics/lib/python3.7/functools.py", line 490, in lru_cache
raise TypeError('Expected maxsize to be an integer or None')
TypeError: Expected maxsize to be an integer or None

Specifics:
Python version 3.7.9

pip freeze gives (yeh this virtualenv is getting big :):

absl-py==1.0.0
aiohttp==3.8.1
aiosignal==1.2.0
alabaster==0.7.12
aniso8601==9.0.1
antlr4-python3-runtime==4.8
appnope @ file:///opt/concourse/worker/volumes/live/4f734db2-9ca8-4d8b-5b29-6ca15b4b4772/volume/appnope_1606859466979/work
async-timeout==4.0.2
asynctest==0.13.0
attrs==20.3.0
Babel==2.9.1
backcall @ file:///home/ktietz/src/ci/backcall_1611930011877/work
bertopic==0.6.0
blis @ file:///opt/concourse/worker/volumes/live/cd6a6bea-d063-4b62-4c10-fcc89b17d0ac/volume/cython-blis_1594246851083/work
boto3==1.17.86
botocore==1.20.86
brotlipy==0.7.0
cachetools==4.2.1
catalogue==2.0.6
certifi==2020.12.5
cffi @ file:///opt/concourse/worker/volumes/live/2aa8abfe-8b8d-4889-78d9-837b74c3cd64/volume/cffi_1606255119410/work
chardet @ file:///opt/concourse/worker/volumes/live/9efbf151-b45b-463d-6340-a5c399bf00b7/volume/chardet_1607706825988/work
charset-normalizer==2.0.9
click==7.1.2
colorama==0.4.4
coloredlogs==15.0.1
commonmark==0.9.1
cryptography @ file:///opt/concourse/worker/volumes/live/41c3d62a-f1f8-46ce-414a-9adaf4ea7d96/volume/cryptography_1607636752064/work
cycler==0.10.0
cymem @ file:///opt/concourse/worker/volumes/live/3e8d7428-f57d-4000-44e7-34ac8a744f13/volume/cymem_1605062299053/work
Cython==0.29.23
dataclasses==0.6
datasets==1.17.0
decorator @ file:///home/ktietz/src/ci/decorator_1611930055503/work
dill==0.3.4
docformatter==1.4
docutils==0.15.2
emoji==1.6.1
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.2.0/en_core_web_lg-3.2.0-py3-none-any.whl
en-core-web-md @ https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl
en-core-web-trf @ https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.2.0/en_core_web_trf-3.2.0-py3-none-any.whl
et-xmlfile==1.1.0
fairscale==0.4.4
Faker==8.16.0
fasttext @ file:///Users/scharlesworth/fastText-0.9.2
filelock==3.0.12
flake8==4.0.1
flake8-bugbear==21.11.29
Flask==2.0.2
Flask-Cors==3.0.10
Flask-RESTful==0.3.9
frozenlist==1.2.0
fsspec==2021.11.1
future==0.18.2
gitdb==4.0.9
gitdb2==4.0.2
GitPython==3.1.24
google-api-core==1.26.2
google-api-python-client==2.0.2
google-auth==1.28.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
googleapis-common-protos==1.53.0
grpcio==1.43.0
hdbscan==0.8.27
httplib2==0.19.0
huggingface-hub==0.2.1
humanfriendly==10.0
hydra-core==1.1.1
idna @ file:///tmp/build/80754af9/idna_1593446292537/work
imagesize==1.3.0
importlib-metadata @ file:///tmp/build/80754af9/importlib-metadata_1602276842396/work
importlib-resources==5.4.0
iniconfig==1.1.1
iopath==0.1.9
ipykernel @ file:///opt/concourse/worker/volumes/live/73e8766c-12c3-4f76-62a6-3dea9a7da5b7/volume/ipykernel_1596206701501/work/dist/ipykernel-5.3.4-py3-none-any.whl
ipython @ file:///opt/concourse/worker/volumes/live/ac685347-76d6-4904-4b88-886c6a434f22/volume/ipython_1614616430264/work
ipython-genutils @ file:///tmp/build/80754af9/ipython_genutils_1606773439826/work
itsdangerous==2.0.1
jedi @ file:///opt/concourse/worker/volumes/live/5006b7b5-a924-4788-6cfe-ae05d8be8830/volume/jedi_1606932947370/work
Jinja2==3.0.1
jmespath==0.10.0
joblib==1.0.1
jsonlines==3.0.0
jsonschema==3.0.2
jupyter-client @ file:///tmp/build/80754af9/jupyter_client_1601311786391/work
jupyter-core @ file:///opt/concourse/worker/volumes/live/a699b83f-e941-4170-5136-bf87e3f37756/volume/jupyter_core_1612213304212/work
keybert==0.5.0
kiwisolver==1.3.1
langcodes==3.3.0
llvmlite==0.36.0
loguru==0.5.3
Markdown==3.3.4
markdown-it-py==0.5.8
MarkupSafe==2.0.1
matplotlib==3.4.0
mccabe==0.6.1
mkl-fft==1.2.0
mkl-random==1.1.1
mkl-service==2.3.0
mock==4.0.3
multidict==5.2.0
multiprocess==0.70.12.2
murmurhash @ file:///opt/concourse/worker/volumes/live/9a0582f9-9097-4dab-6d7a-fcf62b4968ae/volume/murmurhash_1607456116622/work
myst-parser==0.12.10
nltk==3.6.5
numba==0.53.1
numpy==1.20.2
oauthlib==3.1.1
omegaconf==2.1.1
openai==0.6.3
openpyxl==3.0.9
packaging==20.9
pandas==1.2.1
parlai==1.5.1
parquet==1.3.1
parso==0.7.0
pathy==0.6.1
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
pickleshare @ file:///tmp/build/80754af9/pickleshare_1606932040724/work
Pillow==8.2.0
plac @ file:///opt/concourse/worker/volumes/live/a94b6881-2d18-4055-5a3c-f24036f05ef6/volume/plac_1594259982880/work
pluggy==1.0.0
ply==3.11
portalocker==2.3.2
praw==7.1.0
prawcore==1.5.0
preshed @ file:///opt/concourse/worker/volumes/live/952fa955-acc7-4aa0-6766-86f802ea8ef1/volume/preshed_1608233410312/work
prompt-toolkit @ file:///tmp/build/80754af9/prompt-toolkit_1616415428029/work
protobuf==3.15.6
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
py==1.11.0
py-gfm==1.0.2
py-rouge==1.1
py4j==0.10.7
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.6.1
pycodestyle==2.8.0
pycparser @ file:///tmp/build/80754af9/pycparser_1594388511720/work
pydantic==1.8.2
pyee==8.2.2
pyflakes==2.4.0
Pygments @ file:///tmp/build/80754af9/pygments_1615143339740/work
PyJWT==2.3.0
pynndescent==0.5.2
pyodbc==4.0.32
pyOpenSSL @ file:///tmp/build/80754af9/pyopenssl_1608057966937/work
pyparsing==2.4.7
pyrsistent @ file:///opt/concourse/worker/volumes/live/656e0c1b-ef87-4251-4a51-1290b2351993/volume/pyrsistent_1600141745371/work
PySocks @ file:///opt/concourse/worker/volumes/live/ef943889-94fc-4539-798d-461c60b77804/volume/pysocks_1605305801690/work
pytest==6.2.5
pytest-datadir==1.3.1
pytest-regressions==2.2.0
python-dateutil @ file:///home/ktietz/src/ci/python-dateutil_1611928101742/work
python-slugify==5.0.2
pytorch-transformers==1.2.0
pytz==2020.5
PyYAML==6.0
pyzmq==20.0.0
regex==2021.11.10
requests @ file:///tmp/build/80754af9/requests_1608241421344/work
requests-mock==1.9.3
requests-oauthlib==1.3.0
requests-toolbelt==0.9.1
rich==10.16.2
rsa==4.7.2
s3transfer==0.4.2
sacremoses==0.0.44
scikit-learn==0.24.1
scipy==1.6.2
seaborn==0.11.1
sentence-transformers==1.0.4
sentencepiece==0.1.91
seqeval==0.0.5
sh==1.14.2
six @ file:///opt/concourse/worker/volumes/live/f983ba11-c9fe-4dff-7ce7-d89b95b09771/volume/six_1605205318156/work
sklearn==0.0
slack-bolt==1.11.1
slack-sdk==3.13.0
slackclient==2.9.3
slackeventsapi==3.0.1
smart-open==5.2.1
smmap==5.0.0
snowballstemmer==2.2.0
spacy==3.2.0
spacy-alignments==0.8.4
spacy-legacy==3.0.8
spacy-loggers==1.0.1
spacy-sentence-bert==0.1.2
spacy-transformers==1.1.2
spark-nlp==3.0.2
Sphinx==2.2.2
sphinx-autodoc-typehints==1.10.3
sphinx-rtd-theme==1.0.0
sphinxcontrib-applehelp==1.0.2
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
srsly==2.4.2
subword-nmt==0.3.8
tensorboard==2.7.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorboardX==2.4.1
text-unidecode==1.3
thinc==8.0.13
threadpoolctl==2.1.0
thriftpy2==0.4.14
tokenizers==0.10.2
toml==0.10.2
torch==1.10.1
torchtext==0.11.1
tornado @ file:///opt/concourse/worker/volumes/live/d531d395-893c-4ca1-6a5f-717b318eb08c/volume/tornado_1606942307627/work
tqdm==4.62.3
traitlets @ file:///home/ktietz/src/ci/traitlets_1611929699868/work
transformers==4.11.0
typer==0.4.0
typing-extensions==3.7.4.3
umap-learn==0.5.1
Unidecode==1.3.2
untokenize==0.1.1
update-checker==0.18.0
uritemplate==3.0.1
urllib3==1.26.7
wasabi==0.8.2
wcwidth @ file:///tmp/build/80754af9/wcwidth_1593447189090/work
webexteamsbot==0.1.4.2
webexteamssdk==1.6
websocket-client==0.57.0
websocket-server==0.6.4
Werkzeug==2.0.1
xlrd==2.0.1
xxhash==2.0.2
yarl==1.7.2
zeroshot-topics==0.1.0
zipp @ file:///tmp/build/80754af9/zipp_1604001098328/work

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.