euagendas / m3inference Goto Github PK

A deep learning system for demographic inference (gender, age, and individual/person) that was trained on massive Twitter dataset using profile images, screen names, names, and biographies

Home Page: http://www.euagendas.org

License: GNU Affero General Public License v3.0

Python 100.00%

m3inference's Introduction

M3-Inference

This is a PyTorch implementation of the M3 (Multimodal, Multilingual, and Multi-attribute) system described in the WebConf (WWW) 2019 paper Demographic Inference and Representative Population Estimates from Multilingual Social Media Data.

About

M3 is a deep learning system for demographic inference that was trained on a massive Twitter dataset. It features three major attributes:

Multimodal
- M3 takes both vision and text inputs. Particularly, the input may contain a profile image, a name (e.g., in the form of a natural language first and last name), a user name (e.g., the Twitter screen_name), and a short self-descriptive text (e.g., a Twitter biography).
Multilingual
- M3 operates in 32 major languages spoken in Europe, but note that these are not all "European" languages (e.g., Arabic is supported). They are ['en', 'cs', 'fr', 'nl', 'ar', 'ro', 'bs', 'da', 'it', 'pt', 'no', 'es', 'hr', 'tr', 'de', 'fi', 'el', 'ru', 'bg', 'hu', 'sk', 'et', 'pl', 'lv', 'sl', 'lt', 'ga', 'eu', 'mt', 'cy', 'rm', 'is', 'un'] in ISO 639-1 two-letter codes (un stands for languages that are not in the list). A list with the full names of the languages is on the wiki.
Multi-attribute
- Thanks to multi-task learning, the model can predict three demographic attributes (gender, age, and human-vs-organization status) at the same time.

Install

TL;DR

pip install m3inference

If there is an error with the installation of torch, you may install it with conda (see here). Alternatively, you could create a conda environment - see instructions below.
Please ensure you have Python 3.6.6 or higher installed.

Manually Install

With pip

You must have Python>=3.6.6 and pip ready to use. Then you can:

Install dependency packages: pip install -r requirements.txt
Install the package python setup.py install

As a conda environment

Simply run conda-env create -f env_conda.yml, you should then have a "m3env" environment available which you can enter with conda activate m3env. Run everything else from within there.
Install the package python setup.py install

How to use

With M3

M3 takes an input of a jsonl file that contains a list of json(dict) objects (or a python object containing the data itself) and outputs the predictions for the three attributes.

Demo with test dir:

Clone this package (git clone https://github.com/zijwang/m3inference.git) and follow Manually Install to install the package.
Preprocess the image to get them resized to the correct shape. To do this, at the same (root) dir, run
```
python scripts/preprocess.py --source_dir test/pic/ --output_dir test/pic_resized/ --jsonl_path test/data.jsonl --jsonl_outpath test/data_resized.jsonl --verbose
```
You may also run python scripts/preprocess.py --help to see detailed usages. Further, see FAQs for more information on images.
In Python, run:

from m3inference import M3Inference
import pprint
m3 = M3Inference() # see docstring for details
pred = m3.infer('./test/data_resized.jsonl') # also see docstring for details
pprint.pprint(pred)

You should see results like the following:

OrderedDict([('720389270335135745',
              {'age': {'19-29': 0.1546,
                       '30-39': 0.114,
                       '<=18': 0.0481,
                       '>=40': 0.6833},
               'gender': {'female': 0.0066, 'male': 0.9934},
               'org': {'is-org': 0.7508, 'non-org': 0.2492}}),
             ('21447363',
              {'age': {'19-29': 0.0157,
                       '30-39': 0.9837,
                       '<=18': 0.0004,
                       '>=40': 0.0002},
               'gender': {'female': 0.9866, 'male': 0.0134},
               'org': {'is-org': 0.0002, 'non-org': 0.9998}}),
    ...
  ...

Each entry of the input file (./test/data.jsonl) should have the following keys: id, name, screen_name, description, lang, img_path.

The first four keys could be extracted directly from the Twitter JSON entry.
For lang, even if the official Twitter JSON entry contains this field, we recommend to try to use our cld2 wrapper method (from m3inference import get_lang) to get the language from either user's biography/description or the user's tweets. You could also hard-code the language if you know the ground truth from other sources.
Images should be downloaded from Twitter as 400x400 pixel images and resized to 224x224 pixels using the preprocess code above.

The output file is a dict in which the ids are the keys and the predictions are the nested values. The values represents the probability of that category ([0, 1]).

For other model settings (e.g., output format, GPU setting, batch_size, etc.), please use the file test/data.jsonl as a sample input file and see the docstrings fo M3Inference initialization and infer method for detailed utilization.

With M3 Twitter Wrapper

Existing JSON Twitter data

If you have a Twitter JSON object representing a user but do not have images ready, you can use our M3Twitter class to:

Download and resize the images
Add a detected language using CLD2 over the biography text
Transform the JSON into the input structure required for M3.

from m3inference import M3Twitter
import pprint

m3twitter=M3Twitter(cache_dir="twitter_cache") #Change the cache_dir parameter to control where profile images are downloaded
m3twitter.transform_jsonl(input_file="test/twitter_cache/example_tweets.jsonl",output_file="test/twitter_cache/m3_input.jsonl")

pprint.pprint(m3twitter.infer("test/twitter_cache/m3_input.jsonl")) #Same method as M3Inference.infer(...)

If you already have images locally, please include the image_path_key parameter and set it to the key in your JSON object containing the path to the image locally. Similarly, if you have detected languages, you can use the lang_key parameter. An example is given in test/test_transform_jsonl.py

Nothing but a screen_name or numeric id

You can also run the Twitter wrapper directly for a Twitter screen_name or numeric id.

Please download the "scripts" folders from this repository.
To run these examples, you need Twitter API credentials. Please create a Twitter app at https://developer.twitter.com/en/apps . Once you have an app, copy scripts/auth_example.txt to auth.txt and insert the API key, API secret, access token, and access token secret into this file.

Then you can run the following commands:

#If you have a screen_name, use
$ python m3twitter.py --screen-name=computermacgyve --auth auth.txt --skip-cache

#If you have a numeric id, use
$python m3twitter.py --id=19854920 --auth auth.txt --skip-cache

The --skip-cache option ensures fresh results are retrieved rather than served from the cache. This is great for debugging but not in a real-world setting; so, remove as needed.

FAQs

What if I just have a Twitter screen name or id?

You can use the M3Twitter class to get all the needed profile information (and image) from the Twitter website. Please note this function should only be used for a small number of screen_names or numeric ids. If you have a large list, please use the Twitter API to get the required information (apart from the profile photo, which can be downloaded separately using the .transform_jsonl(...) method described above).

import pprint
from m3inference import M3Twitter
m3twitter=M3Twitter()

# initialize twitter api
m3twitter.twitter_init(api_key=...,api_secret=...,access_token=...,access_secret=...)
# alternatively, you may do
m3twitter.twitter_init_from_file('auth.txt')

pprint.pprint(m3twitter.infer_id("2631881902"))

The .infer_screen_name(...) method does the same for a Twitter screen name. All results are stored/cached in "~/m3/cache/". This directory can be changed in the M3Twitter constructor and you can skip/update the cache for a single request by setting skip_cache=True on the .infer_id(...) or .infer_screen_name(...) method.

You can also run these examples directly from the terminal to try things out:

python scripts/m3twitter.py --screen-name=barackobama --auth auth.txt

How should I get the images?

If you have nothing that a screen name or numeric id, you can use the M3Twitter.infer_screen_name(...) or M3Twitter.infer_id(...) methods. Please note, however, that these methods directly access the Twitter website, not the API and therefore are suitable only to small lists. With a large list of screen_names/ids, please use the Twitter API to get user information.

Once you have Twitter JSON, you can use the M3Twitter.transform_jsonl(...) to download images, run language detection, and transform the data to the M3 input format.

What if I cannot have image data?

In the package, we do provide the standalone text-based model. You could set use_full_model=False when initializing M3Inference object (i.e., m3=M3Inference(use_full_model=False)). You then do not need to provide img_path field in the input json file.

Warning: as M3 model is optimized with the best performance when both image and text inputs are available. You may experience lower performance when using the text-based model. We recommend using image data whenever possible to get the most accurate predictions.

Citation

Please cite our WWW 2019 paper if you use this package in your project.

@inproceedings{wang2019demographic,
  title={Demographic inference and representative population estimates from multilingual social media data},
  author={Wang, Zijian and Hale, Scott and Adelani, David Ifeoluwa and Grabowicz, Przemyslaw and Hartman, Timo and Fl{\"o}ck, Fabian and Jurgens, David},
  booktitle={The World Wide Web Conference},
  pages={2056--2067},
  year={2019},
  organization={ACM}
}

License

This source code is licensed under the GNU Affero General Public License, which allows for non-commercial re-use of this software. For commercial inqueries, please contact us directly. Please see the LICENSE file in the root directory of this source tree for details.

m3inference's People

Contributors

Stargazers

Watchers

Forkers

zijwang stjordanis bigdatasciencegroup zhuowei ying-ying-chen debrisvector evaarevalo psadda paihengxu slavaspirin ai-uofsc rsquared2016 yanwen-wang antarezaghifary irzaip schmollgruberja cyzhang87 rapha18th asim-v amansrivastava17 dnaaun luoy25 jaggler3 wahyuadi aldovc oliph kloskowskam socially-embedded-lab waynefire gehadaboqamar 7am7 zzz-shibo debaratidas94 vj129 selimsametoglu sonoshah narcisoyu balajitk7 jajs1975 chautong janalasser shannontart politusanalytics pugletka seyda109 fsws mfloresbsc ahren09 jcervantez e-tornike thanhan910 cauqqzhang pedroramaciotti

m3inference's Issues

Problems with installation

Hello,

I am trying to install m3inference through "pip install m3inference", but I get an error code (see below). I tried several things to fix this, but it does not resolve the issue. I believe it has something to do with the "pycld2" - when I tried to install it separately, it did also not work.

Thanks in advance:

(base) C:\Users\Rude>pip install m3inference
Collecting m3inference
Using cached m3inference-1.1.5-py3-none-any.whl (58 kB)
Requirement already satisfied: tqdm in c:\users\rude\appdata\local\continuum\ana
conda3\lib\site-packages (from m3inference) (4.28.1)
Collecting pycld2>=0.31
Using cached pycld2-0.41.tar.gz (41.4 MB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: torch>=1.0.0 in c:\users\rude\appdata\local\conti
nuum\anaconda3\lib\site-packages (from m3inference) (1.10.1)
Requirement already satisfied: Pillow in c:\users\rude\appdata\local\continuum\a
naconda3\lib\site-packages (from m3inference) (5.3.0)
Requirement already satisfied: pandas>=0.20 in c:\users\rude\appdata\local\conti
nuum\anaconda3\lib\site-packages (from m3inference) (1.3.4)
Requirement already satisfied: torchvision>=0.2.2 in c:\users\rude\appdata\local
\continuum\anaconda3\lib\site-packages (from m3inference) (0.11.2)
Requirement already satisfied: rauth in c:\users\rude\appdata\roaming\python\pyt
hon37\site-packages (from m3inference) (0.7.3)
Requirement already satisfied: requests in c:\users\rude\appdata\local\continuum
\anaconda3\lib\site-packages (from m3inference) (2.21.0)
Requirement already satisfied: numpy>=1.13 in c:\users\rude\appdata\roaming\pyth
on\python37\site-packages (from m3inference) (1.21.4)
Requirement already satisfied: pytz>=2017.3 in c:\users\rude\appdata\local\conti
nuum\anaconda3\lib\site-packages (from pandas>=0.20->m3inference) (2018.7)
Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\rude\appdata\l
ocal\continuum\anaconda3\lib\site-packages (from pandas>=0.20->m3inference) (2.7
.5)
Requirement already satisfied: typing-extensions in c:\users\rude\appdata\local
continuum\anaconda3\lib\site-packages (from torch>=1.0.0->m3inference) (4.0.1)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\users\rude\appdata\lo
cal\continuum\anaconda3\lib\site-packages (from requests->m3inference) (3.0.4)
Requirement already satisfied: idna<2.9,>=2.5 in c:\users\rude\appdata\local\con
tinuum\anaconda3\lib\site-packages (from requests->m3inference) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\rude\appdata\local
\continuum\anaconda3\lib\site-packages (from requests->m3inference) (2021.5.30)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\users\rude\appdata\lo
cal\continuum\anaconda3\lib\site-packages (from requests->m3inference) (1.24.1)
Requirement already satisfied: six>=1.5 in c:\users\rude\appdata\local\continuum
\anaconda3\lib\site-packages (from python-dateutil>=2.7.3->pandas>=0.20->m3infer
ence) (1.12.0)
Building wheels for collected packages: pycld2
Building wheel for pycld2 (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\rude\appdata\local\continuum\anaconda3\python.exe' -u -c '
import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Rude\Ap
pData\Local\Temp\5\pip-install-3n3kiofw\pycld2_04bebb99f5e4481caa01025a1abb
1b1f\setup.py'"'"'; file='"'"'C:\Users\Rude\AppData\Local\Temp\5\pip
-install-3n3kiofw\pycld2_04bebb99f5e4481caa01025a1abb1b1f\setup.py'"'"';f = ge
tattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else
io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().re
place('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'
exec'"'"'))' bdist_wheel -d 'C:\Users\Rude\AppData\Local\Temp\5\pip-wheel-7as7f8
gd'
cwd: C:\Users\Rude\AppData\Local\Temp\5\pip-install-3n3kiofw\pycld2_04beb
b99f5e4481caa01025a1abb1b1f
Complete output (10 lines):
running bdist_wheel
The [wheel] section is deprecated. Use [bdist_wheel] instead.
running build
running build_py
creating build
creating build\lib.win-amd64-3.7
creating build\lib.win-amd64-3.7\pycld2
copying pycld2_init_.py -> build\lib.win-amd64-3.7\pycld2
running build_ext
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsof
t C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

ERROR: Failed building wheel for pycld2
Running setup.py clean for pycld2
Failed to build pycld2
Installing collected packages: pycld2, m3inference
Running setup.py install for pycld2 ... error
ERROR: Command errored out with exit status 1:
command: 'c:\users\rude\appdata\local\continuum\anaconda3\python.exe' -u -c
'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\Rude\
AppData\Local\Temp\5\pip-install-3n3kiofw\pycld2_04bebb99f5e4481caa01025a1a
bb1b1f\setup.py'"'"'; file='"'"'C:\Users\Rude\AppData\Local\Temp\5\p
ip-install-3n3kiofw\pycld2_04bebb99f5e4481caa01025a1abb1b1f\setup.py'"'"';f =
getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) el
se io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().
replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'
"'exec'"'"'))' install --record 'C:\Users\Rude\AppData\Local\Temp\5\pip-record-3
t8c0_3x\install-record.txt' --single-version-externally-managed --compile --inst
all-headers 'c:\users\rude\appdata\local\continuum\anaconda3\Include\pycld2'
cwd: C:\Users\Rude\AppData\Local\Temp\5\pip-install-3n3kiofw\pycld2_04b
ebb99f5e4481caa01025a1abb1b1f
Complete output (11 lines):
running install
c:\users\rude\appdata\local\continuum\anaconda3\lib\site-packages\setuptools
\command\install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprec
ated. Use build and pip and other standards-based tools.
setuptools.SetuptoolsDeprecationWarning,
running build
running build_py
creating build
creating build\lib.win-amd64-3.7
creating build\lib.win-amd64-3.7\pycld2
copying pycld2_init_.py -> build\lib.win-amd64-3.7\pycld2
running build_ext
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Micros
oft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

----------------------------------------

ERROR: Command errored out with exit status 1: 'c:\users\rude\appdata\local\cont
inuum\anaconda3\python.exe' -u -c 'import io, os, sys, setuptools, tokenize; sys
.argv[0] = '"'"'C:\Users\Rude\AppData\Local\Temp\5\pip-install-3n3kiofw\
pycld2_04bebb99f5e4481caa01025a1abb1b1f\setup.py'"'"'; file='"'"'C:\Users
\Rude\AppData\Local\Temp\5\pip-install-3n3kiofw\pycld2_04bebb99f5e4481caa0
1025a1abb1b1f\setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(_file
_) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setu
p; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close()
;exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\Rude
AppData\Local\Temp\5\pip-record-3t8c0_3x\install-record.txt' --single-version-ex
ternally-managed --compile --install-headers 'c:\users\rude\appdata\local\contin
uum\anaconda3\Include\pycld2' Check the logs for full command output.

(base) C:\Users\Rude>

id in the results

Can I ask the id in the results is user id or tweet id?

Feature: Support v2 API data as input

Hi! I'm doing a research project about Twitter analysis.

I fetched user data by Twitter Academic API (v2), and after usingM3Twitter.transform_jsonl(...) I got the following error:

KeyError                                  Traceback (most recent call last)
<ipython-input-5-23da1cf5d317> in <module>
      5 ,access_token=' ',access_secret=' ')
      6 
----> 7 m3twitter.transform_jsonl(input_file="test.jsonl", output_file="test_result.jsonl")

~/opt/anaconda3/lib/python3.8/site-packages/m3inference/m3twitter.py in transform_jsonl(self, input_file, output_file, img_path_key, lang_key, resize_img, keep_full_size_img)
     48             with open(output_file, "w") as fhOut:
     49                 for line in fhIn:
---> 50                     m3vals = self.transform_jsonl_object(line, img_path_key=img_path_key, lang_key=lang_key,
     51                                                          resize_img=resize_img, keep_full_size_img=keep_full_size_img)
     52                     fhOut.write("{}\n".format(json.dumps(m3vals)))

~/opt/anaconda3/lib/python3.8/site-packages/m3inference/m3twitter.py in transform_jsonl_object(self, input, img_path_key, lang_key, resize_img, keep_full_size_img)
     80             else:
     81                 img_file_resize = img_path
---> 82         elif user["default_profile_image"]:
     83             # Default profile image
     84             img_file_resize = TW_DEFAULT_PROFILE_IMG

KeyError: 'default_profile_image'

I also run the example data provided in m3inference/test/twitter_cache/ and the function runs perfectly.

Then I double-checked the jsonl file, it looks like the two versions of Twitter API (v1 / v2) returns (slightly) different jsonl files (I suppose the example data were made by v1 API). Details please see: https://developer.twitter.com/en/docs/twitter-api/migrate/data-formats/standard-v1-1-to-v2

I'm not sure if my comment makes sense, maybe you could have a look?
Thanks in advance!

Output file options

I am using M3 for a research project and will be combining the output with other data from Twitter. Is it possible to output the results into something more manageable than the print screen output after running the code? The readme does reference the output format but not sure where to look next.

How to infer local Twitter JSON files?

I have a bunch of local Twitter JSON files. As free Twitter API has quite limited quota, how to do the job with m3 locally?

Improve requests speed

Hello
I already used M3inferince and work well. but in large scale of data it's not fast enough. It takes an average 1.3 second for each account while running on kaggel GPU.
If there are any advice or technique to speed up it's progress.

Incompatibility with Torch 1.7.0

Hi there,

On a new installation pip will attempt to pull the latest version of all the dependency. Since Torch released the latest version of their package (1.7.0) M3 inference seems to misbehave.
I get the following error when attempting to import the preprocessing package.

AttributeError: module 'torch.utils.data' has no attribute "'BatchSamplerDistributedSamplerDataset'"

The whole issue is resolved by downgrading Torch to version 1.6.0.

I hope this helps!
Keep up with the amazing work!

PS: For context I tried this on two OS with two different Python version and got the same result.
Test 1: Linux 5.4.0-1031-azure; Python 3.6.7-1
Test 2: Linux 5.4.39-linuxkit; Python 3.8.2-0
Both are running on x86_64 architecture

Commercial use

Best regards. My compliments.
Is it possible to use the python code or the python library for a commercial project? What are the restrictions or requirements?
Thanks.

Predicting...0%

Hi,

I tried using the library with text_mode and it works fine.
When I use the full_mode prediction doesn't work but I don't get any error. It is just stucked

This is basically my code:

m3 = M3Inference(use_full_model=True) 
        preprocess.download_resize_img(pic_url, "profile_pic.jpg", "profile_pic_fs.jpg")

        with open('data.jsonl', 'w') as outfile:
            for entry in data_set:
                json.dump(entry, outfile)
                outfile.write('\n')
        
        pred = m3.infer('data.jsonl')

This is the output:

10/04/2021 15:47:05 - INFO - m3inference.m3inference -   Version 1.1.5
10/04/2021 15:47:05 - INFO - m3inference.m3inference -   Running on cpu.
10/04/2021 15:47:05 - INFO - m3inference.m3inference -   Will use full M3 model.
10/04/2021 15:47:06 - INFO - m3inference.m3inference -   Model full_model exists at /Users/vv/m3/models/full_model.mdl.
10/04/2021 15:47:06 - INFO - m3inference.utils -   Checking MD5 for model full_model at /Users/vv/m3/models/full_model.mdl
10/04/2021 15:47:06 - INFO - m3inference.utils -   MD5s match.
10/04/2021 15:47:06 - INFO - m3inference.m3inference -   Loaded pretrained weight at /Users/vv/m3/models/full_model.mdl
10/04/2021 15:47:06 - INFO - m3inference.dataset -   1 data entries loaded.
Predicting...:   0%|          | 0/1 [00:00<?, ?it/s]

Any idea about the issue?

Thanks.

Infer_ad and infer_username can't work well

I have probelm while running program, my code is

#The API first needs to validate your Twitter App's credentials m3twitter.twitter_init_from_file('/content/drive/My Drive/Ibu Avi/Last/User/auth-sample.txt')

The output is : True

And
#sample run pprint.pprint(m3twitter.infer_id("3138075595"))

The output is :

04/27/2022 09:40:21 - INFO - m3inference.m3twitter - Results not in cache. Fetching data from Twitter for id 3138075595.
04/27/2022 09:40:21 - INFO - m3inference.m3twitter - GET /users/show.json?id=3138075595
04/27/2022 09:40:21 - WARNING - m3inference.m3twitter - Could not retreive screen_name
04/27/2022 09:40:21 - WARNING - m3inference.m3twitter - Could not retreive id_str
04/27/2022 09:40:21 - WARNING - m3inference.m3twitter - Could not retreive description
04/27/2022 09:40:21 - WARNING - m3inference.m3twitter - Could not retreive name
04/27/2022 09:40:21 - WARNING - m3inference.m3twitter - Could not retreive profile_image_url
04/27/2022 09:40:21 - WARNING - m3inference.m3twitter - Unable to extract image from Twitter. Using default image.
04/27/2022 09:40:21 - INFO - m3inference.dataset - 1 data entries loaded.
Predicting...: 100%|██████████| 1/1 [00:00<00:00, 1.39it/s]{'input': {'description': '',
'id': 'dummy',
'img_path': '/usr/local/lib/python3.7/dist-packages/m3inference/data/tw_default_profile.png',
'lang': 'un',
'name': '',
'screen_name': ''},
'output': {'age': {'19-29': 0.2393,
'30-39': 0.0793,
'<=18': 0.1746,
'>=40': 0.5067},
'gender': {'female': 0.2809, 'male': 0.7191},
'org': {'is-org': 0.0873, 'non-org': 0.9127}}}

Its always like that even after change other active user id and username, what should i do?

Error fetching images will fail the infer method

I am trying to run transform_jsonl (to download images and prepare m3 json file) and right after running the infer method - the issue occurs when transform_jsonl does not find some images but still writes the path to the m3 json file, causing the infer to fail over:
FileNotFoundError: [Errno 2] No such file or directory

ValueError: semaphore or lock released too many times

Hi, I am working with Professor Przemek and Mattia on a project. I am using m3 inference to run predictions. But I am encountering a value error which says "semaphore or lock released too many times" while running the infer method of M3Inference module. This has something to do with multiprocessing. But I am unable to fix this error. Attaching screenshot for your reference.

RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 3 and 1 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

I have the following error when trying to predict the demographics of a list of twitter users.

Predicting...:   0%|                                                                                                                                                        | 36/54307 [04:36<107:30:38,  7.13s/it]
File ".../src/utils/demographic_detector.py", line 43, in infer                                                                                                           [5/1807]
    predictions = self.m3twitter.infer(user_objs)                                                                                                                                                                  
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/m3inference/m3inference.py", line 125, in infer                                                                                        
    for batch in tqdm(dataloader, desc='Predicting...'):                                                                                                                                                           
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/tqdm/std.py", line 1108, in __iter__                                                                                                   
    for obj in iterable:                                                                                                                                                                                           
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in default_collate
    return [default_collate(samples) for samples in transposed]
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 79, in <listcomp>
    return [default_collate(samples) for samples in transposed]
  File ".../.conda/envs/twcovid/lib/python3.7/site-packages/torch/utils/data/_utils/collate.py", line 55, in default_collate
    return torch.stack(batch, 0, out=out)
RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 0. Got 3 and 1 in dimension 1 at /pytorch/aten/src/TH/generic/THTensor.cpp:612

The list of users can be found here

Error in infer_id()

The following code

from m3inference import M3Twitter
m3 = M3Twitter()
m3.infer_id(243344789)

Led to the following error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/winston/.local/lib/python3.7/site-packages/m3inference/m3twitter.py", line 179, in infer_id
    output = self.process_twitter(data, id=id)
  File "/home/winston/.local/lib/python3.7/site-packages/m3inference/m3twitter.py", line 229, in process_twitter
    pred = self.infer(data, batch_size=1, num_workers=1)
  File "/home/winston/.local/lib/python3.7/site-packages/m3inference/m3inference.py", line 125, in infer
    for batch in tqdm(dataloader, desc='Predicting...'):
  File "/home/winston/.local/lib/python3.7/site-packages/tqdm/std.py", line 1119, in __iter__
    for obj in iterable:
  File "/home/winston/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/winston/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 856, in _next_data
    return self._process_data(data)
  File "/home/winston/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/home/winston/.local/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/winston/.local/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/winston/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/winston/.local/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/winston/.local/lib/python3.7/site-packages/m3inference/dataset.py", line 37, in __getitem__
    return self._preprocess_data(data)
  File "/home/winston/.local/lib/python3.7/site-packages/m3inference/dataset.py", line 43, in _preprocess_data
    fig = self._image_loader(img_path)
  File "/home/winston/.local/lib/python3.7/site-packages/m3inference/dataset.py", line 91, in _image_loader
    image = Image.open(image_name)
  File "/home/winston/.local/lib/python3.7/site-packages/PIL/Image.py", line 2809, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/home/winston/m3/cache/indiealehouse_224x224.png'

Potential m3twitter.infer_id bug

Hello, first time GitHub issuer here!

When I try to process certain user id_str's I get a FileNotFound error. Here is a user id_str I chose at random - '238173039'. When I run m3twitter.infer_id, I receive the following error:

Traceback (most recent call last):
  File "is_organization.py", line 24, in <module>
    org = m3twitter.infer_id(id_str)['output']['org']
  File "/home/ndbhagwa/miniconda3/lib/python3.8/site-packages/m3inference/m3twitter.py", line 208, in infer_id
    output=self._twitter_api(id=id)
  File "/home/ndbhagwa/miniconda3/lib/python3.8/site-packages/m3inference/m3twitter.py", line 187, in _twitter_api
    return self.process_twitter(r.json())
  File "/home/ndbhagwa/miniconda3/lib/python3.8/site-packages/m3inference/m3twitter.py", line 245, in process_twitter
    download_resize_img(img, img_file_resize, img_file_full)
  File "/home/ndbhagwa/miniconda3/lib/python3.8/site-packages/m3inference/preprocess.py", line 28, in download_resize_img
    with open(img_out_path_fullsize, "wb") as fh:
FileNotFoundError: [Errno 2] No such file or directory: '/home/ndbhagwa/m3/cache/TheWhaleShark.com/profile_images/2602602416/m8su11Vx_400x400'

I am not sure what the cause of this error might be. I originally thought it was a rate limit error since it does not occur consistently, but for other rate errors, I see warnings like this:

<dt> - INFO - m3inference.m3twitter -   Results not in cache. Fetching data from Twitter for id <#>
<dt> - INFO - m3inference.m3twitter -   GET /users/show.json?id=<#>
<dt> - WARNING - m3inference.m3twitter -   Could not retreive screen_name
<dt> - WARNING - m3inference.m3twitter -   Could not retreive id_str
<dt> - WARNING - m3inference.m3twitter -   Could not retreive description
<dt>  - WARNING - m3inference.m3twitter -   Could not retreive name
<dt> - WARNING - m3inference.m3twitter -   Could not retreive profile_image_url
<dt> - WARNING - m3inference.m3twitter -   Unable to extract image from Twitter. Using default image.
<dt> - INFO - m3inference.dataset -   1 data entries loaded

Error with using infer_id()

Hi! I'm using this code for a research project, thank you for providing it.

I am trying to make an inference based infer_id nd I just replicated the example in the FAQ. Here's what my code looks like:

from m3inference import M3Twitter
load_dotenv()
# authentication twitter_app_auth = { 'consumer_key': os.getenv('TWITTER_API_KEY'), 'consumer_secret': os.getenv('TWITTER_API_SECRET'), 'access_token': os.getenv('TWITTER_ACCESS_TOKEN'), 'access_token_secret': os.getenv('TWITTER_ACCESS_SECRET'), }

# init the api inferenceTwitter.twitter_init(api_key=twitter_app_auth['consumer_key'], api_secret=twitter_app_auth['consumer_secret'], access_token=twitter_app_auth['access_token'], access_secret=twitter_app_auth['access_token_secret'])

pprint.pprint(inferenceTwitter.infer_id("2631881902"))

The traceback that I received was pretty confusing

`RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.`

RuntimeError: DataLoader worker (pid(s) 57016) exited unexpectedly

I'm not sure where to find the freeze_support() function call and how to deal with using the fork() child processes.

Add streaming option to infer

For inferring a large number of users, it would be fantastic if infer would have an option to stream the results to a file as it finishes, rather than returning the values. This behavior is particularly helpful for big inference jobs that need a few hours (days?) to finish and where intermediate results would be useful.

Question about training procedure

Hi,
First of all thank you for your great work.

I was wondering what you used as ground truth label for age and gender when user profiles are organizations. You wouldn't want the model to train to recognize any gender / age on an organization profile.
I believe this is not mentioned in the article, or maybe I misunderstood something about the training procedure ?

"NameError: name 'torch' is not defined" but torch is installed and imported

After creating a virtual environment, I tried to install and import m3inference:

pip install m3inference
import m3inference

But I get the following error, how could I fix it?

NameError                                 Traceback (most recent call last)
<ipython-input-9-50ee37ff85fa> in <module>
----> 1 import m3inference

3 frames
/usr/local/lib/python3.8/dist-packages/m3inference/full_model.py in M3InferenceModel()
     10 
     11 class M3InferenceModel(nn.Module):
---> 12     def __init__(self, device='cuda' if torch.cuda.is_available() else 'cpu'):
     13         super(M3InferenceModel, self).__init__()
     14 

NameError: name 'torch' is not defined

I tried to install and import torch before doing the same with m3inference.

Thanks!

Query regarding 'id'

Hi,
Is it user_id or tweet_id that is used in the function infer()?

Support Different Languages Outside the EU?

Hey, thank you for making this project. What awesome and incredible research.
Is the project is also supported in different languages outside the EU? If not, which part of the project can emphasize this. I am interested to research this project.

Segmentation Fault w/ transform_jsonl()

I believe I've installed m3-inference correctly, but running transform_jsonl() on a json lines file of tweets seems to fetch the first profile picture in the list and then terminate with a segmentation fault.

I believe the file is structured appropriately, in the format below:
{json object}\n
{json object}\n
...

Any idea what I might be running into?

Incompatibility with pytorch 1.8.0

Hi,
I've been enjoying this project a lot for my research, but recently I'm having issues using it in our machine that has Pytorch 1.8.0 installed on it. The error happens when I try to use any of the available models with GPU:

from m3inference import M3Inference import pprint m3 = M3Inference() # see docstring for details pred = m3.infer('./test/data_resized.jsonl') # also see docstring for details pprint.pprint(pred)

where it produces the following error:

pred = m3.infer('./test/data_resized.jsonl') # also see docstring for details
03/19/2021 12:13:13 - INFO - m3inference.dataset - 7 data entries loaded.
Predicting...: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "", line 1, in
File "/home/minje/libraries/m3inference/m3inference/m3inference.py", line 127, in infer
pred = self.model(batch)
File "/opt/anaconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/minje/libraries/m3inference/m3inference/full_model.py", line 99, in forward
username_pack, username_unsort = pack_wrapper(username_embed, username_len)
File "/home/minje/libraries/m3inference/m3inference/utils.py", line 47, in pack_wrapper
packed = pack_padded_sequence(sents_sorted, lengths_sorted, batch_first=True)
File "/opt/anaconda/lib/python3.8/site-packages/torch/nn/utils/rnn.py", line 245, in pack_padded_sequence
_VF._pack_padded_sequence(input, lengths, batch_first)
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

I think this is related to Pytorch's update on the pack_padded_sequence only accepting lengths as CPU form when inputted as tensors [link]. I would appreciate it a lot if you could look into this. Thanks!

Efficient collection of large list of screen-names/ids via Twitter API

Currently the infer_screen_name and infer_id methods in M3Twitter accept one screen-name/id and call the Twitter API to get information for that single user. This is inefficient since the endpoint can get up to 100 users at a time.

New methods should be included in the M3Twitter class to handle a long list of users. These methods should break the list into chunks of 100, respect the rate limit, and gracefully handle any API errors.

(This was previously not needed as the class was scraping profiles from HTML and was designed simply as a demonstration method rather than something to be used at scale. The change recently made to use the API opens up this opportunity, which would make the library even more user-friendly)

Error:"profile_image_url_https"

I am getting the following error with m3twitter.transform_jsonl()
The data has been shared privately with Zijian Wang.

Question about training procedure code

Hi team,
Thank you so much for your great work.
I was wondering could you please upload the training procedure code? I read the uploaded code but didn't find the code about multi-task classification procedure. I only found the code about evaluation.
Thanks in advance.

Segmentation fault for certain ids in Apple M1 computers - no problem in Apple Intel computers

There may be an incompatibility issue for those running M3Inference with Apple M1 computers.

I just converted to a newer Apple M1 laptop and tried running m3. For certain ids, there are no problems. However, for most ids, I get "segmentation fault" (see error Output for A below).

I tried running it in my old laptop (Apple Intel). There are no problems for all ids. It runs smoothly. Examples can be found below:

Both (A) and (B) run fine for my Apple Intel laptop:
(A)

python3 scripts/m3twitter.py --skip-cache --id 7259022 --auth scripts/auth.txt

Output for (A)

{'input': {'description': 'Techonomist who runs International Development '
                          'Projects and works on Technology Platforms in the '
                          'Philippines, specifically @gloryreborn & @symphco',
           'id': '7259022',
           'img_path': '/Users/szoriac/m3/cache/7259022_224x224.jpg',
           'lang': 'en',
           'name': 'Dave Overton',
           'screen_name': 'daveove'},
 'output': {'age': {'19-29': 0.0087,
                    '30-39': 0.8318,
                    '<=18': 0.0002,
                    '>=40': 0.1593},
            'gender': {'female': 0.0004, 'male': 0.9996},
            'org': {'is-org': 0.0001, 'non-org': 0.9999}}}

(B)

python3 scripts/m3twitter.py --skip-cache --id 373269437 --auth scripts/auth.txt

Output for (B)

{'input': {'description': '',
           'id': '373269437',
           'img_path': '/Users/szoriac/m3/cache/373269437_224x224.jpg',
           'lang': 'un',
           'name': 'BANISCH Dominique',
           'screen_name': 'Nasch57'},
 'output': {'age': {'19-29': 0.0013,
                    '30-39': 0.0003,
                    '<=18': 0.0052,
                    '>=40': 0.9932},
            'gender': {'female': 0.021, 'male': 0.979},
            'org': {'is-org': 0.1689, 'non-org': 0.8311}}}

But only (B) works for my Apple M1 laptop:

I get this error for (A)

11/20/2021 16:22:54 - INFO - m3inference.m3inference -   Version 1.1.5
11/20/2021 16:22:54 - INFO - m3inference.m3inference -   Running on cpu.
11/20/2021 16:22:54 - INFO - m3inference.m3inference -   Will use full M3 model.
11/20/2021 16:22:54 - INFO - m3inference.m3inference -   Model full_model exists at /Users/wdwg/m3/models/full_model.mdl.
11/20/2021 16:22:54 - INFO - m3inference.utils -   Checking MD5 for model full_model at /Users/wdwg/m3/models/full_model.mdl
11/20/2021 16:22:55 - INFO - m3inference.utils -   MD5s match.
11/20/2021 16:22:55 - INFO - m3inference.m3inference -   Loaded pretrained weight at /Users/wdwg/m3/models/full_model.mdl
11/20/2021 16:22:55 - INFO - m3inference.m3twitter -   skip_cache is True. Fetching data from Twitter for id 7259022.
11/20/2021 16:22:55 - INFO - m3inference.m3twitter -   GET /users/show.json?id=7259022
[1]    22412 segmentation fault  python3 scripts/m3twitter.py --skip-cache --id 7259022 --auth scripts/auth.tx

But not for (B)

{'input': {'description': '',
           'id': '373269437',
           'img_path': '/Users/wdwg/m3/cache/373269437_224x224.jpg',
           'lang': 'un',
           'name': 'BANISCH Dominique',
           'screen_name': 'Nasch57'},
 'output': {'age': {'19-29': 0.0013,
                    '30-39': 0.0003,
                    '<=18': 0.0052,
                    '>=40': 0.9932},
            'gender': {'female': 0.021, 'male': 0.979},
            'org': {'is-org': 0.1689, 'non-org': 0.8311}}}

Is anyone else encountering the same problem? Am I doing something wrong? Is there a way to fix this?

Possibly helpful information: I used m3inference = 1.1.5 on both laptops. The Python version for my Apple M1 is 3.9.7 while the Intel version runs 3.8.5. M1 does not support 3.8.5. It may be a version issue or not.