mehdigolzadeh / bodegha Goto Github PK
View Code? Open in Web Editor NEWA python tool to predict the identity type in github activities (Human,Bot)
License: Other
A python tool to predict the identity type in github activities (Human,Bot)
License: Other
@mehdigolzadeh, could you add git version tags to the respective commits of BoDeGa that correspond to a specific version? It will be easier to keep track of which version is being used, or to keep track of bug reporting that may be for a specific version. Currently, I have not seen any way of knowing which version of BoDeGa is being used.
I would therefore suggest:
(1) to use git version tags whenever a new version is becoming available;
(2) to add the version number information in the bodega tool (when you run it, or ask for help with bodega -h or perhaps even bodega --version, it could report the version number)
(3) mention the version number in the readme file, and update this version number each time a new version is becoming available.
Would it be possible to run the tool on multiple repositories at once? And I do not mean, running the tool on each of these repositories separately, and have an output for each of them separately. What I mean is to consider the set of considered repositories as some kind of single virtual repositories, by considering all accounts having contributed to at least one of these repositories. If an account has contributed to more than one repositories, then all PR comments and all issue comments associated to this account will be considered by the tool, regardless of which repository it comes from. Like this, even if an account was less active in some repository, it will be easier to get to the minimum threshold (number of comments) required to classify an account. In addition, given that the number of comments for each account will likely be higher, it might further improve the accuracy of the analysis.
I think that this kind of "use case" is relevant for software projects that tend to break down their development into multiple separate repositories, but that still may wish to do an analysis for the software project as a whole, considering all its repositories together.
Hello all, I checked on github and there is a huge number of projects (and probably tools) called bodega. Perhaps not very surprising now that I come to think of it. Perhaps we should consider renaming the tool at some point?
Dear BoDeGHa team, thanks for this interesting tool to detect bots on GitHub, your approach sounds really promising and helpful. I tried to run the tool just on a single project and, however, identified a bunch of issues:
(1) Misclassifications: First of all, I tried the tool with its default parameters on the GitHub project owncloud/core. As I am interested in the complete project history, I used --start-date 01-01-2015
. This resulted in 20 bots and almost 600 humans. However, many of those 20 bots aren't bots -- just checking a few issues or pull requests shows that they are not bots. I won't paste their names here (for data privacy reasons), but if you run your tool with the same parameters for your own, you will be able to check that. I assume that they either have opened many pull requests following the pull-request template of owncloud - or they have only few comments (e.g., having 22 comments, some of them pull request templates, may not be enough and may lead to the misclassification.) But also for some of the other misclassified ones, it might be a problem to just look at 100 comments and ignoring all previous ones, which leads to the next issue...
(2) In a second step, I tried to increase the number of comments analyzed per account to circumvent the problems listed above. To be precise, I wanted to set the number of maxComments
to 2000 (as I think 2000 might be an appropriate number to get rid of the accounts classified as bots which actually are humans.) However, there are several bugs in your implementation:
--maxComments
or --maxComment
(singular or plurarl), as the Usage section of your README contains both versions.maxComments
or maxComment
, this parameter is never used in your code: The parameters minComments and maxComments are passed to your process_comments
function, but within this function, they are not used, and thus, useless. Please fix this to pass them to the right places. Instead, I identified 5 lines in which the number 100 is hard-coded, so I guess all those 5 lines should make use of the maxComments parameter. (Same holds for minComments.)(3) To circumvent the bugs in your tool described above, I just changed the hard-coded number 100 in your code manually to another number - at all the 5 lines where 100 occurs. Unfortunately, whenever one of those 5 numbers is replaced by a number greater than 100 , the tool runs into an error right at the beginning of downloading the comments (i.e., the error occurs already if just one of those 5 numbers is set to 101). I don't know what the problem is, maybe there are too many requests for one single API-Token then, but I think this is not the problem as reducing all the other numbers to very small numbers (e.g., 5) still produces the error if one exceeds 100 -- so I am actually not sure what's going on when there is one number greater than 100.
Please try to fix these issues to make your tool applicable to the wild ๐ The second issue should be easy to fix, but the third one, which is the most important to me, looks kind of strange. And in the end, I am still not sure how to treat the first one, as I could not try using 2000 comments, so that I don't know whether this solves part 1 or not.
Given that BoDeGa has been created as a research project that was carried out as part of, and financed on research funds received by, the University of Mons, I think it makes sense to move the github repo to the github.com/University-of-Mons domain. Can this be done without losing any historic information, such as all commits, issues and pull requests? (I guess so...)
If we go for this solution, we should do it soon, before the tool is getting traction. What are your thoughts om that?
The name Bodega is already used by many other repositories.
Moreover the name "BoDeGHa" is better, since it is an acronym for "Bot Detection in GitHub accounts".
Can you therefore rename this git repository, the name of the tool as well, and also all references to this tool in all our publications referring to it? Know is the time to do so, before we have an official publication of the tool...
@mehdigolzadeh can you update the readme file to:
Up to now, I used BoDeGHa with sci-learn version 0.22
, as stated in requirements.txt
:
Line 3 in ac8a5d6
However, when installing BoDeGHa freshly, it uses sci-learn version 1.0.1
, since this is the version given in setup.py
:
Line 34 in ac8a5d6
But using 1.0.1
leads to warnings when running BoDeGHa, as the pretrained model was trained with 0.22
:
bodegha/lib/python3.10/site-packages/sklearn/base.py:324: UserWarning:
Trying to unpickle estimator DecisionTreeClassifier from version 0.22 when using version 1.0.1.
This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
bodegha/lib/python3.10/site-packages/sklearn/base.py:324: UserWarning:
Trying to unpickle estimator RandomForestClassifier from version 0.22 when using version 1.0.1.
This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/modules/model_persistence.html#security-maintainability-limitations
warnings.warn(
bodegha/lib/python3.10/site-packages/sklearn/base.py:438: UserWarning:
X has feature names, but RandomForestClassifier was fitted without feature names
warnings.warn(
So, as there is a mismatch of the scikit-learn versions in your repository, this needs to be fixed somehow โ using a pretrained model that was not trained using the current scikit-learn version could lead to wrong results.
To fix this, one can either set the scikit-learn version in setup.py
back to 0.22
, or you need to provide a new pretrained model for 1.0.1
in the repository.
I tried to set the version of scikit-learn in setup.py
back to 0.22
, but without success: scikit-learn 0.22
is not compatible with the current version of numpy any more (AttributeError: module 'numpy' has no attribute 'float'. `np.float` was a deprecated alias for the builtin 'float'
). Downgrading numpy to version 1.19.5
(the version before the deprecation of np.float
) was not possible, as numpy 1.19.5
does not work with python 3.10. Installing numpy 1.21.2
(which is compatible with python 3.10), results in another error (ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
). I also tried other versions of numpy in-between 1.19.5
and 1.21.2
, also without success.
So, finally, I did not manage to install scikit-learn version 0.22
on python3.10, on which your pretrained model was trained.
Could you please update the pretrained model in this repository to work with scikit-learn 1.0.1
? โ or could you prove that using your 0.22
-pretrained model with 1.0.1
is still correct and prevent the corresponding warnings somehow?
Thanks in advance! This would help a lot and increase the reliability of your tool when such risk warnings would disappear ๐
Hello Mehdi, when running bodega, I get the following warning, can this be fixed please?
FutureWarning: The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead
By default, when running bodega on a GitHub repository, all accounts in that repository are included in the analysis. It is also possible to run bodega on a set of accounts provided as input. However, this is very impractical if the set of accounts is very big. What if you want to run bodega on all accounts in a repository, except for a given number of known accounts. In that case, it would be much more practical to specify which accounts need to be excluded, instead of mentioning all of those that should be included. What do you think?
In those rare cases where bodega misclassifies a human as bot, or a bot as human, it would be nice to have a way to actually make bodega aware of this, to avoid having the tool report this issue over and over again. I do not know what would be the best way to achieve this, but I can see different possibilities: whenever a misclassification is found, it could be marked as such (in some file with a specific format and filename), and when the tool is run, it checks in the file for the misclassification. It will be then up to the user of bodega to decide whether to include the misclassified accounts when re-running bodega.
Where should such a file be stored? Different solutions can be envisioned:
(1) On the bodega GitHub repository itself, we could have a file containing all known misclassifications (i.e. all cases that have been reported to us, and verified by us, of accounts that were misclassified when running bodega). When running bodega, this file can then be consulted to report the correct classification of the account.
(2) On the GitHub repository that is being analysed by bodega. Again, when running bodega, this file can then be consulted to report the correct classification of the account.
(3) In the directory of the user that is actually using bodega to run the analysis (e.g. if that user does not have write access to be able to use solution (2) and if that user does not want to share the misclassification for whatever reason).
If we want to combine these multiple solutions, we should probably set a precedence order.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.