Comments (8)
I didn't understand the second option and in the third option, it can only improve the result for that specific user. But the first one seems to be feasible. Could you please elaborate on this idea a bit more?
from bodegha.
I think it's better to ask users (or to have a semi-automztic way to do it) to report those misclassified cases, so we can add them to the training set and release a new version of the tool with an improved model.
From a reusability point of view, it's better to improve the model rather than having a list of "edge cases" whose target class override the one of the model.
from bodegha.
We already mention in the README that users can report misclassified cases to us if they find any. To have a semi-automatic way, would it be feasible to add support in the tool itself to report misclassified cases to us? I do not immediately see how to.
from bodegha.
Since an API key has to be provided, we can add a subcommand to report about invalid cases (e.g. enter usernames that are misclassified in a given repository) and that automatically open an issue in this repository with them?
I'm not convinced we need something like this, since we can simply ask/expect/hope users to report misclassified cases "manually".
from bodegha.
It is probably too positive to think that people will report misclassified cases manually, just because it is mentioned in our readme
I think that any support that can help to automate the process will reduce the workload, both for the user that wants to report the misclassification, and for us to keep track of reported misclassifications. Therefore, if it is possible and not too difficult to implement such a reporting scheme as part of the tool, that will automatically open an issue on the bodega github repository that could be a nice solution.
from bodegha.
Any built-in possibility to report misclassified cases as Github issues will require a second execution of the tool (since it is not interactive, and it won't be given we want to keep it as a reusable CLI). Why a second execution is needed? Because we should be able to reproduce the example, so we need the exact set of comments that were considered by the model (or, at least, the exact set of features that were considered for that specific case).
One "easy" possibility to do so would be to add an extra "--report" flag, accepting a list of accounts that are misclassified, e.g., if the tool was run with bodega request/request --key <my token> --start-date 01-01-2017 --verbose
(example taken from the readme), one could use bodega request/request --key <my token> --start-date 01-01-2017 --verbose --report greenkeeperio-bot hktalent
for example to report automatically these two accounts as misclassified. This should create an issue in the bodega repository, with enough information for each account so that we can check and confirm the misclassification. I believe we only need the version of bodega that was used and, for each account (accounts do NOT have to be provided) a list of considered comments (that way, we can download them, compute the features, predict its class, and add the "opposite" class in the training set for this example, rebuild the model, and release a new version of bodega).
Btw, doing all of this manually could be very time-consuming for us, but if it's the case, we can still try to implement all these steps as part of a CI (e.g. let's dream of a bot we would develop, that downloads the comments, compute the features and prediction, and posts all of this in the corresponding issue, so that one of us can "confirm" the misclassified case by putting a "confirmed" label on the issue, and then the CI rebuilds the model and pushes it on the repository, with an incremented version of bodega and a tag for the new release). But honestly, given the work all of this represents, I think it's too much for a "research tool" ;)
from bodegha.
Notice we can ask a student to do this (e.g., as a M1 project).
from bodegha.
Yes, looks like an interesting master student project to pursue. Let' try that. If you want, you can close this issue for now (or leave it open until we have a worling implementation, but this can take quite a while).
from bodegha.
Related Issues (11)
- deprecation warning HOT 1
- version tagging HOT 4
- University-of-Mons HOT 3
- using multiple repositories HOT 6
- including or excluding accounts? HOT 3
- Bodega name HOT 3
- Rename BoDeGa into BoDeGHa HOT 2
- Several bugs: Parameters not used, misclassifications, and other errors... HOT 6
- update readme file HOT 1
- Warnings regarding scikit-learn version: Version mismatch might lead to invalid results HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bodegha.