Giter Site home page Giter Site logo

cardiffnlp / tweetnlp Goto Github PK

View Code? Open in Web Editor NEW
280.0 10.0 22.0 473 KB

TweetNLP for all the NLP enthusiasts working on Twitter! The Python library tweetnlp provides a collection of useful tools to analyze/understand tweets such as sentiment analysis, emoji prediction, and named entity recognition, powered by state-of-the-art language models specialised on Twitter.

Home Page: https://tweetnlp.org/

License: MIT License

Python 100.00%
computational-social-science language-model natural-language-processing python sentiment-analysis sns twitter

tweetnlp's People

Contributors

antypasd avatar asahi417 avatar ashyam95 avatar pedrada88 avatar rmotsar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tweetnlp's Issues

Proposal: Add Support for Apple Silicon Chips Using MPS for GPU Acceleration

I'm writing to propose an enhancement that would significantly improve the efficiency of the tweetnlp library on Apple machines equipped with Apple Silicon 'new' chips, leveraging the Metal Performance Shaders (MPS) backend for GPU training acceleration.

The proposal involves a change at line 76 of the tweetnlp/text_classification/model.py file to select the processing device.
Note: I am focusing on the classification for now because it's the feature I use, but I think this change can be beneficial in other areas too.

Currently, the library checks for CUDA availability to utilize GPU acceleration:

# GPU setup
self.device = 'cuda' if torch.cuda.device_count() > 0 else 'cpu'

However, on Apple's M1 machines, CUDA is not available as they utilize Apple's in-built acceleration chips. A workaround for this is to use MPS (Metal Performance Shaders) that is supported on these M1 machines. I propose adding an additional check for MPS in the device selection process:

# GPU setup
if torch.cuda.is_available() and torch.cuda.device_count() > 0:
    self.device = torch.device('cuda')
elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
    self.device = torch.device("mps")
else:
    self.device = torch.device('cpu')

This will provide an optimized computational path for users running tweetnlp on M1 machines, and fallbacks to CPU when neither CUDA nor MPS is available.

Quick Benchmark:
On the first-generation M1, this modification resulted in reducing the classification task time by more than half, demonstrating a significant improvement in efficiency.

I believe this enhancement would benefit many users and improve the experience of using tweetnlp on the latest generation of Apple machines.

Sentimental Classification for Multiple Sentences

I have understood how to classify one sentence's sentiments from '''tweetnlp.load('sentiment')'''
However, how can I do multiple sentences for their separate sentiments for having a fast processing speed?
I know a for loop is functional for this case it is sooo slow

pip install tweetnlp -> UnicodeDecodeError: 'charmap'

Trying to pip install I get this error:

Windows 10 64 - Spanish
Python 3.8

pip install tweetnlp
`Collecting tweetnlp
Using cached tweetnlp-0.0.8.tar.gz (23 kB)
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [8 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "XXXXXX\AppData\Local\Temp\pip-install-z3zj14sa\tweetnlp_eddf59379cc14f689008a08e6b2f0cb1\setup.py", line 4, in
readme = f.read()
File "XXXXX\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3458: character maps to
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.`

edit:

fixing issue (under previous described conditions)
setup.py, line 3, from
with open('README.md', 'r') as f:
to
with open('README.md', 'r',encoding="utf-8") as f:

now installation is working

Use CPU instead of unsupported GPU

I have an older NVIDIA GPU that is unfortunately not supported by PyTorch. As such, I was unable to use tweetNLP as it tries to use GPU since torch.cuda.is_available() returns True.

One way I was able to get around was modifying the model.py as follows:

        # GPU setup (https://github.com/cardiffnlp/tweetnlp/issues/15)
        if use_gpu: 
            if torch.cuda.is_available() and torch.cuda.device_count() > 0:
                self.device = torch.device('cuda')
                print('Note: Using GPU') 
            elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available() and torch.backends.mps.is_built():
                self.device = torch.device("mps")
                print('Note: Using MPS') 
            else:
                self.device = torch.device('cpu')
                print('Note: Using CPU') 
        else:
            self.device = torch.device('cpu')
            print('Note: Using CPU') 

Sharing it, just in case it helps someone else encountering a similar situation.

sbs

OSError in loading model

Hi, thanks for the great library!

I have been trying to install tweetnlp on a python3.8 virtual environment, running on Ubuntu 20.04 64bit / Linux 5.4.0-139-generic.

Here's the error:

Traceback (most recent call last):
  File "tryloading.py", line 3, in <module>
    model = tweetnlp.load_model('sentiment')
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/tweetnlp/loader.py", line 45, in load_model
    model_class = model_loader(model_name=model_name, max_length=max_length, *args, **kwargs)
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/tweetnlp/text_classification/model.py", line 163, in __init__
    super().__init__(model_name, max_length=max_length, use_auth_token=use_auth_token)
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/tweetnlp/text_classification/model.py", line 70, in __init__
    self.config, self.tokenizer, self.model = load_model(
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/tweetnlp/util.py", line 81, in load_model
    config = AutoConfig.from_pretrained(model, **config_argument)
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/transformers/models/auto/configuration_auto.py", line 725, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/transformers/configuration_utils.py", line 561, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/ebergam/mlenv/lib/python3.8/site-packages/transformers/configuration_utils.py", line 648, in _get_config_dict
    raise EnvironmentError(
OSError: cardiffnlp/twitter-roberta-base-sentiment-latest does not appear to have a file named config.json.

Trying to install without virtualenvironment results in the same error.
Could be due to compatibility issues with huggingface-hub?

Train on my dataset

I want to use the cardiffnlp/tweet-topic-21-multi model from huggingface. But I want to add more labels/categories in the output. It is my understanding that I would need to manually train the model. Any guidance on how to do so?

AttributeError: module 'enum' has no attribute 'IntFlag'

Traceback (most recent call last):
File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/site-packages/pip/main.py", line 27, in
"ignore", category=DeprecationWarning, module=".*packaging\.version"
File "/usr/lib64/python3.6/warnings.py", line 131, in filterwarnings
import re
File "/usr/lib64/python3.6/re.py", line 142, in
class RegexFlag(enum.IntFlag):
AttributeError: module 'enum' has no attribute 'IntFlag'

libcublas.so.*[0-9] not found in the system path on CPU-only machine

I'm trying to make use of this package on a fly.io machine. The image is built with Docker:

# Using the python:3.11-slim image
FROM python:3.11-slim

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 PYTHONUNBUFFERED=1 PIP_NO_CACHE_DIR=1 PIPENV_VENV_IN_PROJECT=1 CUDA_VISIBLE_DEVICES=""

# Update and install dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends gcc libffi-dev libssl-dev && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Install pip and pipenv
RUN pip install --upgrade pip && \
    pip install pipenv

# Copy Pipfile and Pipfile.lock
COPY Pipfile Pipfile.lock ./

# Install dependencies using pipenv
RUN pipenv install --deploy --ignore-pipfile

# Copy the rest of the application files
COPY . .

ENTRYPOINT ["pipenv", "run"]

The image builds successfully, but when run it I get an error from Torch:

   INFO Preparing to run: `pipenv run python worker.py` as root
   INFO [fly api proxy] listening at /.fly/api
  2023/09/13 17:25:40 listening on [fdaa:1:291c:a7b:173:c2e5:763f:2]:22 (DNS: [fdaa::3]:53)
  Loading .env environment variables...
  Traceback (most recent call last):
    File "/app/.venv/lib/python3.11/site-packages/torch/__init__.py", line 168, in _load_global_deps
      ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
    File "/usr/local/lib/python3.11/ctypes/__init__.py", line 376, in __init__
      self._handle = _dlopen(self._name, mode)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
  OSError: libcurand.so.10: cannot open shared object file: No such file or directory
  During handling of the above exception, another exception occurred:
  Traceback (most recent call last):
    File "/app/worker.py", line 5, in <module>
      from workers.classification import process as classify
    File "/app/workers/classification.py", line 2, in <module>
      import tweetnlp
    File "/app/.venv/lib/python3.11/site-packages/tweetnlp/__init__.py", line 2, in <module>
      from .loader import load_model, load_dataset
    File "/app/.venv/lib/python3.11/site-packages/tweetnlp/loader.py", line 3, in <module>
      from .text_classification.model import Sentiment, Offensive, Irony, Hate, Emotion, Emoji,\
    File "/app/.venv/lib/python3.11/site-packages/tweetnlp/text_classification/model.py", line 7, in <module>
      import torch
    File "/app/.venv/lib/python3.11/site-packages/torch/__init__.py", line 228, in <module>
      _load_global_deps()
    File "/app/.venv/lib/python3.11/site-packages/torch/__init__.py", line 189, in _load_global_deps
      _preload_cuda_deps(lib_folder, lib_name)
    File "/app/.venv/lib/python3.11/site-packages/torch/__init__.py", line 154, in _preload_cuda_deps
      raise ValueError(f"{lib_name} not found in the system path {sys.path}")
  ValueError: libcublas.so.*[0-9] not found in the system path ['/app', '/usr/local/lib/python311.zip', '/usr/local/lib/python3.11', '/usr/local/lib/python3.11/lib-dynload', '/app/.venv/lib/python3.11/site-packages']

Looking at #18, it appears to support CPU only, and the machine won't have any GPU's available so I'm not sure why it's not following the CPU init path.

I don't know why I can't load model in tweetnlp

It is that the version changed?

AttributeError                            Traceback (most recent call last)
Cell In[4], line 10
      7 # RoBERTa model Sentiment analysis
      8 # Loading pre-trained RoBERTa model
      9 raw_data_text=[text for text in raw_data['text'][0:5]]
---> 10 model = tweetnlp.load_model('sentiment') 
     11 sentence=model.predict(raw_data_text,batch_size=4096,return_probability=True)
     12 roberta_score=[]

AttributeError: module 'tweetnlp' has no attribute 'load_model'

freeze model

Is it possible to freeze specific layers when fine-tuning the model?

String preprocessor account regex catches only fully-capitalized/numeric account handles

In tweetnlp/tweetnlp/util.py, the get_preprocessor function for unknown preprocessor type replaces user handles like so:

text = re.sub(r"@[A-Z,0-9]+", "@user", text)

but this catches only capitalized alphanumeric handles.

I have never submitted a pull request before, so if you want to let me try it out I can submit one - if of course you agree with the problem!

"Some weights of the model checkpoint were not used"

Hello,
When I load the sentence_embedding model by using tweetnlp.load_model('sentence_embedding') , the following messages pop out.

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

How should I fix it? Appreciate for any help. THX!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.