mehdigolzadeh / bodegha Goto Github PK

View Code? Open in Web Editor NEW

24.0 4.0 11.0 2.98 MB

A python tool to predict the identity type in github activities (Human,Bot)

License: Other

Python 100.00%

bodegha's People

Contributors

Stargazers

Watchers

Forkers

tommens mafesan noppadol-assava gymgym1212 mehdigol biazottoj bockthom lin334421 uhourri stefanostone natarajan-chidambaram

bodegha's Issues

version tagging

@mehdigolzadeh, could you add git version tags to the respective commits of BoDeGa that correspond to a specific version? It will be easier to keep track of which version is being used, or to keep track of bug reporting that may be for a specific version. Currently, I have not seen any way of knowing which version of BoDeGa is being used.

I would therefore suggest:
(1) to use git version tags whenever a new version is becoming available;
(2) to add the version number information in the bodega tool (when you run it, or ask for help with bodega -h or perhaps even bodega --version, it could report the version number)
(3) mention the version number in the readme file, and update this version number each time a new version is becoming available.

using multiple repositories

Would it be possible to run the tool on multiple repositories at once? And I do not mean, running the tool on each of these repositories separately, and have an output for each of them separately. What I mean is to consider the set of considered repositories as some kind of single virtual repositories, by considering all accounts having contributed to at least one of these repositories. If an account has contributed to more than one repositories, then all PR comments and all issue comments associated to this account will be considered by the tool, regardless of which repository it comes from. Like this, even if an account was less active in some repository, it will be easier to get to the minimum threshold (number of comments) required to classify an account. In addition, given that the number of comments for each account will likely be higher, it might further improve the accuracy of the analysis.
I think that this kind of "use case" is relevant for software projects that tend to break down their development into multiple separate repositories, but that still may wish to do an analysis for the software project as a whole, considering all its repositories together.

Bodega name

Hello all, I checked on github and there is a huge number of projects (and probably tools) called bodega. Perhaps not very surprising now that I come to think of it. Perhaps we should consider renaming the tool at some point?

Several bugs: Parameters not used, misclassifications, and other errors...

Dear BoDeGHa team, thanks for this interesting tool to detect bots on GitHub, your approach sounds really promising and helpful. I tried to run the tool just on a single project and, however, identified a bunch of issues:

(1) Misclassifications: First of all, I tried the tool with its default parameters on the GitHub project owncloud/core. As I am interested in the complete project history, I used --start-date 01-01-2015. This resulted in 20 bots and almost 600 humans. However, many of those 20 bots aren't bots -- just checking a few issues or pull requests shows that they are not bots. I won't paste their names here (for data privacy reasons), but if you run your tool with the same parameters for your own, you will be able to check that. I assume that they either have opened many pull requests following the pull-request template of owncloud - or they have only few comments (e.g., having 22 comments, some of them pull request templates, may not be enough and may lead to the misclassification.) But also for some of the other misclassified ones, it might be a problem to just look at 100 comments and ignoring all previous ones, which leads to the next issue...

(2) In a second step, I tried to increase the number of comments analyzed per account to circumvent the problems listed above. To be precise, I wanted to set the number of maxComments to 2000 (as I think 2000 might be an appropriate number to get rid of the accounts classified as bots which actually are humans.) However, there are several bugs in your implementation:

From your README file it is not clear whether the parameter is called --maxComments or --maxComment (singular or plurarl), as the Usage section of your README contains both versions.
Independent from whether the parameter is called maxComments or maxComment, this parameter is never used in your code: The parameters minComments and maxComments are passed to your process_comments function, but within this function, they are not used, and thus, useless. Please fix this to pass them to the right places. Instead, I identified 5 lines in which the number 100 is hard-coded, so I guess all those 5 lines should make use of the maxComments parameter. (Same holds for minComments.)

(3) To circumvent the bugs in your tool described above, I just changed the hard-coded number 100 in your code manually to another number - at all the 5 lines where 100 occurs. Unfortunately, whenever one of those 5 numbers is replaced by a number greater than 100 , the tool runs into an error right at the beginning of downloading the comments (i.e., the error occurs already if just one of those 5 numbers is set to 101). I don't know what the problem is, maybe there are too many requests for one single API-Token then, but I think this is not the problem as reducing all the other numbers to very small numbers (e.g., 5) still produces the error if one exceeds 100 -- so I am actually not sure what's going on when there is one number greater than 100.

Please try to fix these issues to make your tool applicable to the wild 😉 The second issue should be easy to fix, but the third one, which is the most important to me, looks kind of strange. And in the end, I am still not sure how to treat the first one, as I could not try using 2000 comments, so that I don't know whether this solves part 1 or not.

University-of-Mons

Given that BoDeGa has been created as a research project that was carried out as part of, and financed on research funds received by, the University of Mons, I think it makes sense to move the github repo to the github.com/University-of-Mons domain. Can this be done without losing any historic information, such as all commits, issues and pull requests? (I guess so...)
If we go for this solution, we should do it soon, before the tool is getting traction. What are your thoughts om that?

Rename BoDeGa into BoDeGHa

The name Bodega is already used by many other repositories.
Moreover the name "BoDeGHa" is better, since it is an acronym for "Bot Detection in GitHub accounts".
Can you therefore rename this git repository, the name of the tool as well, and also all references to this tool in all our publications referring to it? Know is the time to do so, before we have an official publication of the tool...

update readme file

@mehdigolzadeh can you update the readme file to:

specify the (open source) licence we have used
put a link to the, now accepted, JSS publication accompanying the tool
improve the readme file documentation explaining the use of the tool and the limitations of the parameters (e.g. the fact that we have not tested the classification model certain values outside predefined parameter boundaries; as a result, the tool may be less precise for those cases)
make it sufficiently clear to the user of the tool that the classification model DOES misclassify bots and humans from time to time; this is not a bug, it is an inherent part of the classification model
perhaps also inform the user that, in some cases (such as the use of templates in comments), the tool may give rise to more misclassifications than usual

Warnings regarding scikit-learn version: Version mismatch might lead to invalid results

Up to now, I used BoDeGHa with sci-learn version 0.22, as stated in requirements.txt:

BoDeGHa/requirement.txt

Line 3 in ac8a5d6

scikit-learn == 0.22

However, when installing BoDeGHa freshly, it uses sci-learn version 1.0.1, since this is the version given in setup.py:

BoDeGHa/setup.py

Line 34 in ac8a5d6

'scikit-learn == 1.0.1',

But using 1.0.1 leads to warnings when running BoDeGHa, as the pretrained model was trained with 0.22:

bodegha/lib/python3.10/site-packages/sklearn/base.py:324: UserWarning: 
Trying to unpickle estimator DecisionTreeClassifier from version 0.22 when using version 1.0.1. 
This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
bodegha/lib/python3.10/site-packages/sklearn/base.py:324: UserWarning: 
Trying to unpickle estimator RandomForestClassifier from version 0.22 when using version 1.0.1. 
This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
bodegha/lib/python3.10/site-packages/sklearn/base.py:438: UserWarning: 
X has feature names, but RandomForestClassifier was fitted without feature names
warnings.warn(

So, as there is a mismatch of the scikit-learn versions in your repository, this needs to be fixed somehow – using a pretrained model that was not trained using the current scikit-learn version could lead to wrong results.
To fix this, one can either set the scikit-learn version in setup.py back to 0.22, or you need to provide a new pretrained model for 1.0.1 in the repository.

I tried to set the version of scikit-learn in setup.py back to 0.22 , but without success: scikit-learn 0.22 is not compatible with the current version of numpy any more (AttributeError: module 'numpy' has no attribute 'float'. `np.float` was a deprecated alias for the builtin 'float'). Downgrading numpy to version 1.19.5 (the version before the deprecation of np.float) was not possible, as numpy 1.19.5 does not work with python 3.10. Installing numpy 1.21.2 (which is compatible with python 3.10), results in another error (ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject). I also tried other versions of numpy in-between 1.19.5 and 1.21.2, also without success.
So, finally, I did not manage to install scikit-learn version 0.22 on python3.10, on which your pretrained model was trained.

Could you please update the pretrained model in this repository to work with scikit-learn 1.0.1? – or could you prove that using your 0.22-pretrained model with 1.0.1 is still correct and prevent the corresponding warnings somehow?

Thanks in advance! This would help a lot and increase the reliability of your tool when such risk warnings would disappear 😉

deprecation warning

Hello Mehdi, when running bodega, I get the following warning, can this be fixed please?

FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

including or excluding accounts?

By default, when running bodega on a GitHub repository, all accounts in that repository are included in the analysis. It is also possible to run bodega on a set of accounts provided as input. However, this is very impractical if the set of accounts is very big. What if you want to run bodega on all accounts in a repository, except for a given number of known accounts. In that case, it would be much more practical to specify which accounts need to be excluded, instead of mentioning all of those that should be included. What do you think?

misclassified cases

In those rare cases where bodega misclassifies a human as bot, or a bot as human, it would be nice to have a way to actually make bodega aware of this, to avoid having the tool report this issue over and over again. I do not know what would be the best way to achieve this, but I can see different possibilities: whenever a misclassification is found, it could be marked as such (in some file with a specific format and filename), and when the tool is run, it checks in the file for the misclassification. It will be then up to the user of bodega to decide whether to include the misclassified accounts when re-running bodega.
Where should such a file be stored? Different solutions can be envisioned:
(1) On the bodega GitHub repository itself, we could have a file containing all known misclassifications (i.e. all cases that have been reported to us, and verified by us, of accounts that were misclassified when running bodega). When running bodega, this file can then be consulted to report the correct classification of the account.
(2) On the GitHub repository that is being analysed by bodega. Again, when running bodega, this file can then be consulted to report the correct classification of the account.
(3) In the directory of the user that is actually using bodega to run the analysis (e.g. if that user does not have write access to be able to use solution (2) and if that user does not want to share the misclassification for whatever reason).
If we want to combine these multiple solutions, we should probably set a precedence order.

mehdigolzadeh / bodegha Goto Github PK

bodegha's People

Contributors

Stargazers

Watchers

Forkers

bodegha's Issues

version tagging

using multiple repositories

Bodega name

Several bugs: Parameters not used, misclassifications, and other errors...

University-of-Mons

Rename BoDeGa into BoDeGHa

update readme file

Warnings regarding scikit-learn version: Version mismatch might lead to invalid results

deprecation warning

including or excluding accounts?

misclassified cases

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent