Giter Site home page Giter Site logo

mapbox / gabbar Goto Github PK

View Code? Open in Web Editor NEW
19.0 98.0 7.0 13.28 MB

Guarding OpenStreetMap from harmful edits using machine learning

License: MIT License

Python 0.26% JavaScript 2.30% Jupyter Notebook 97.44%
machine-learning openstreetmap scikit-learn jupyter-notebook vandalism banished

gabbar's Introduction

gabbar

EXPERIMENTAL: UNDER DEVELOPMENT

Guarding OpenStreetMap from invalid or suspicious edits, Gabbar is an alpha package of a pre-trained binary problematic/not problematic classifier that was trained on manually labelled changesets from OpenStreetMap.

https://en.wikipedia.org/wiki/Gabbar_Singh_(character)

Installation

pip install gabbar

Setup

# Setup a virtual environment with Python 3.
mkvirtualenv --python=$(which python3) gabbar_py3

# Install in locally editable (``-e``) mode.
pip install -e .[test]

# Install node dependencies.
npm install

Prediction

screen shot 2017-06-30 at 4 17 46 pm

# A prediction of "-1" represents that this feature is an anomaly (outlier).
gabbar 49172351
[
    {
        "attributes": {
            "action_create": 0,
            "action_delete": 0,
            "action_modify": 1,
            "area_of_feature_bbox": 109591.9146,
            "feature_name_touched": 0,
            "feature_version": 17,
            "highway_tag_created": 41,
            "highway_tag_deleted": 0,
            "highway_value_difference": 0,
            "length_of_longest_segment": 0.1577,
            "primary_tags_difference": 1
        },
        "changeset_id": "49172351",
        "feature_id": "124863896",
        "feature_type": "way",
        "prediction": -1,
        "score": -0.1493,
        "timestamp": "2017-07-10 10:33:02.925012",
        "version": "0.6.2"
    }
]

Testing

npm test

Hyperlinks

gabbar's People

Contributors

bkowshik avatar joneskoo avatar kapadia avatar pratikyadav avatar regisb avatar sgillies avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gabbar's Issues

Use model_selection instead of deprecated cross_validation

Presently, when training a model, we see the following deprecation message.

$ python training/datatrain.py
/Users/demo/.virtualenvs/gabbar/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
training samples: 12364
[testing] good samples: 5299
[testing] problematic samples: 671
precision = 0.915625
recall = 0.442348
f1_score = 0.596514

sklearn.neighbors.LocalOutlierFactor

Per chat with @jcsg, I briefly tried the LocalOutlierFactor model. Posting early, will spend some more time for a detailed analysis of both the model and the results.

Per https://stackoverflow.com/a/36869611/3453958

The NearestNeighbors class is unsupervised and can not be used for classification but only for nearest neighbor searches.

Neighbors for an inlier (good highway)

screen shot 2017-07-14 at 9 14 11 pm

Neighbors for an outlier (harmful highway)

screen shot 2017-07-14 at 9 18 41 pm


Building a classifier on changeset comments

Changeset comments can be super interesting! Can a model be trained to learn what changeset comments of ๐Ÿ‘ changesets and ๐Ÿ‘Ž changesets look like?

minor edits / repetition / redundancy / abreviation / duplication / contraction / shortening / cleaning up overbloating /

Beautiful Fountain nice place for tourist. and a nice grass park where you could sit down and enjoy nature in the city


cc: @anandthakker @geohacker @batpad

Bag of Tags

Ref: #69

In the field of Natural Language Processing (NLP), the Bag of Words technique is a popular one. Basically, text is represented as a bag of words, disregarding grammar and even word order but keeping multiplicity.

Something on these lines is the concept of Bag of Tags. All property tags from all samples in the training dataset for the Bag of Tags. Ex:

NOTE: harmful=0 represents a good changeset and harmful=1 a problematic changeset.

Changeset harmful highway name oneway surface maxspeed vehicle ...
47514474 0 1 1 1 1 0 0 ...
46429851 0 1 1 0 0 0 0 ...
47349936 0 1 1 1 1 1 1 ...

We collect all tags from changesets labelled with a ๐Ÿ‘Ž and OneHotEncode them. Then, use this as attributes to train a classifier to learn and predict if changesets are good or problematic based on the occurrence of tags.


cc: @anandthakker @geohacker @batpad

Feature engineering for Gabbar

Changeset meta data

  • Changeset source - Local knowledge
  • Changeset comment - Added a building
  • changeset_imagery_used - Mapbox
  • Words in the changeset comment - 3

Features in changeset

  • Counts of primary tags for all changeset features - highway: 5, building: 20
  • Primary tags created, modified and deleted
  • Features of type node, way and relation
  • Do features in changeset overlap with each other

User name

  • Number of digits or special characters in username

Harmful changests of types manually not seen before

With the current supervised learning based classifier, we train the model on changesets labelled ๐Ÿ‘ and ๐Ÿ‘Ž Soon, a classifier will start predicting new changesets based on training on the labelled dataset. But, we have not manually ๐Ÿ‘€ all types of harmful changesets, maybe we never will as new kinds of problematic edits come along.

So, ideally in the future we will need some kind of an unsupervised classifier which is not limited by the subset of labelled samples in the dataset instead can make use of each and every changeset that comes along on OpenStreetMap.


cc: @anandthakker @batpad @geohacker

Review a random sample of highways

Ref: #69 and #80

I prepared a random sample of highway features touched to manually ๐Ÿ‘€ for identifying good and harmful highways. We then use this knowledge to make the highway classifier better.

With @amishas157's help, I created a To-Fix task with 9,533 randomly selected highways. I used the Not an error button for good highways and Fixed button for harmful highway.

To start with I reviewed 100 highways and did not find any harmful.

screen shot 2017-07-03 at 8 01 01 pm


cc: @anandthakker @batpad @geohacker

Increase training size for feature level classifier

Ref #43


  • We currently use 5,269 changesets for training our feature level classifier.
  • From changesets reviewed on osmcha with one feature modifications, it looks like we can potentially add upto 4,000 changesets.
  • This increase in the samples in the training dataset should in-turn improve the model.

Next actions

  • Update dataset with the additional 4,000 changesets - @bkowshik

cc: @batpad @geohacker

Increasing number of changesets flagged

For the last two days that gabbar has been live on osmcha-staging, it has flagged less than 30 changesets as problematic everyday. In the real-world, there could potentially be more changesets that are harmful. So:

  • How can we increase the number of changesets flagged by gabbar?
  • How can we bring accuracy down a little so that we bring number of false negatives down?

cc: @rodowi

Weekly update from Gabbarland

17th Apr - 23rd Apr, 2017

Datasets for training and testing the model are now on S3.

Workflow

Command line API

  • Package is now wired up to take a changeset ID and output predictions.
  • python gabbar/scripts/cli.py --changeset 47734592

Model performance metrics

NOTE: This is our very first weekly update! ๐ŸŽ‰


cc: @anandthakker @geohacker @batpad

Bot to catch simple invalid capitalization on OpenStreetMap

From @planemad's post here:

Validation is a good angle to have some bots running to catch simple issues like a invalid capitalization in a tag like Highway=residential


I ran a tile-reduce script looking for invalid capitalization in the 26 primary tags below:

aerialway, aeroway, amenity, barrier, boundary, building, craft, emergency, geological,
highway, historic, landuse, leisure, man_made, military, natural, office, place, power,
public_transport, railway, route, shop, sport, tourism, waterway

I eye-balled a few from the list and the results were true positives. Some of the invalid capitalization's were: Building, Highway, etc.

Ex: The feature way!455096754 has a invalid Building tag. So, as soon the changeset 43871967 was created a bot keeping an ๐Ÿ‘๏ธ on the stream, corrects it to a highway and leaves a changeset discussion informing the user about the same along with some documentation links and the corrected changeset ID.

screen shot 2017-03-08 at 11 19 13 pm

Invalid capitalization do happen often on OpenStreetMap and are corrected by other community members. Ex: For node/426859638, the Highway was corrected to highway after a month.

screen shot 2017-03-08 at 11 32 07 pm

I love the idea letting the user who created the invalid capitalization know about the simple mistake and make an appropriate change automatically. @planemad, what are the next actions you are seeing to make this a real thing on OpenStreetMap?

Choosing a versioning scheme for packaging gabbar on PyPI

Per https://packaging.python.org/distributing/#choosing-a-versioning-scheme

Different Python projects may use different versioning schemes based on the needs of that particular project, but all of them are required to comply with the flexible public version scheme specified in PEP 440 in order to be supported in tools and libraries like pip and setuptools.

Here are some examples of compliant version numbers:

1.2.0.dev1  # Development release
1.2.0a1     # Alpha Release
1.2.0b1     # Beta Release
1.2.0rc1    # Release Candidate
1.2.0       # Final Release
1.2.0.post1 # Post Release
15.10       # Date based release
23          # Serial release

cc: @rodowi @sgillies

Flag changesets predicted problematic on osmcha

We had Gabbar as part of osmcha previously to predict changesets as they come in for whether they are problematic or not. Changesets predicted by Gabbar would have a label Flagged by gabbar

With Gabbar now predictions now easily accessible on the gabbar-frontend, we should be getting the connection with osmcha back up and running.

As the work on the feature level classifier gets more interesting, it should be a good one to start sending in changesets that Gabbar predicts as problematic to osmcha. This will make it super-easy to consume predictions from Gabbar. ๐Ÿ˜ƒ


cc: @batpad @geohacker

Gabbar 0.5

In preparation to release Gabbar 0.5, we will be ๐Ÿ‘€ changesets flagged as potentially problematic by the latest trained model.

2017-05-15 (Mon)

  • Changesets predicted harmful by model: 385
  • Changesets reviewed: 50
  • Changesets actually problematic: 2
  • Changesets unsure if problematic: 7

Notes

  • Potential bias in the predictions:
    • Changesets by new users.
    • Changesets with low features created. (Ex: 1 building created with 5 nodes)

Automate preparation of changesets for manual review

Per #43 (comment)

With the current workflow, every time we have a new trained model, we generate two csv files for manual review:

  1. Fifty unlabelled changesets predicted good
  2. Another fifty unlabelled changesets predicted problematic

Current workflow

  1. Sort changesets by descending order of Gabbar predictions.
  2. Select top 50 rows - changesets with prediction 1, denoting problematic.
  3. Select bottom 50 rows; changesets with prediction of 0, denoting good.

The challenge here is that changesets are by default ordered by changeset ID, thus we don't have a way to have good variety in the results for manual ๐Ÿ‘€

Let's automate this step, so that when notebook is run, changesets for manual review are automatically generated.

Using the new osmcha API to download changeset labels

The new osmcha is here! ๐ŸŽ‰

With it comes some changes to the API, specially render_csv=True being deprecated. ๐Ÿ˜ž The documentation of the API is at the link below:

Changes

  • The API is paginated so, we will have to do multiple requests
  • The format of the csv is different and changeset_id is not the first column anymore
geometry.coordinates
geometry.type
id
properties.area
properties.check_date
properties.check_user
properties.checked
properties.comment
properties.create
properties.date
properties.delete
properties.editor
properties.harmful
properties.imagery_used
properties.is_suspect
properties.modify
properties.source
properties.uid
properties.user
type

cc: @anandthakker @batpad @geohacker

Convert binary attributes into rich numericals

Ref #43


In https://osmcha.mapbox.com/47414802/, a place=village was converted to place=town.

screen shot 2017-05-30 at 12 56 22 pm

At present, the context we give the machine learning model about this modification is along with other attributes are:

  • place: 1
  • place_old: 1
  • place_modification: 1
  • harmful = 1

But, the model has no knowledge on what the modification was to make an effective prediction on whether the feature modification was a ๐Ÿ‘ or a ๐Ÿ‘Ž. So, how about we convert the binary value representing the modification into better numerical's to aid the model to make a more informed decision.

Popularity from TagInfo

TagInfo provides values for what percentage of place features have say a city as the value. Ex: 0.43% of all place objects on OpenStreetMap are place=city. With this, the model will currently get the following attributes:

  • place_new: 24.92% - Percentage of place=village
  • place_old: 2.21% - Percentage of place=town

cc: @batpad @geohacker

Understanding validation and vandalism detection work on Wikipedia

NOTE: This is a work in progress. Posting here to start discussion around the topic


Wikimedia uses Artificial Intelligence for the following broad categories:

  • Vandalism detector. Use edit statistics to find correlations and predict if an edit is problematic.
  • Article edit recommender. Use user edit history to predict which articles could be edited by user.
  • Article quality prediction. To assess quality of articles on Wikipedia.

On Wikipedia there are 160k edits, 50k new articles and 1400 new editors everyday. The goal is to split the 160k edits into:

  1. Probably OK, almost certainly not vandalism
  2. Needs manual review, might possibly be vandalism

Themes for validation

  • Points of view or standpoint:
    • Wikipedia is a firehose
    • Bad edits must be reverted
    • Minimize manual effort wasted on quality control work
    • Socialize and train newcomers
  • Design tools for Empowerment vs a Power over.
    • Empowerment: I want to hear you, so I'll make space for you to speak and listen.
    • Power over: I want to set the tone of our conversation by talking first.
  • A flipped publication model: Publish first and review later.
  • Given enough eyeballs, all bugs are shallow. If we have a large enough group of people looking at something, somebody will know the tight way to solve the problem.

Welcoming newcomers

More newcomers is a major Wikimedia goal and new spaces have been developed to support newcomers. Quality control in Wikipedia is being designed with newcomer socialization in mind so that newcomers (especially those who don't conform) are not marginalized and good-faith newcomers are retained. Although anonymous edits on Wikipedia are twice as likely to be vandalism, 90% of anonymous edits are good.

From this Slate article:

Most people first get involved with Wikipediaโ€”one of the largest social movements in historyโ€”by making some minor corrections or starting a small article that is missing. If their contributions get deleted, especially if there is no sufficient explanation why, they are likely to quit. It is quite destructive to the communityโ€™s long-term survival, as Wikipedia has struggled for quite a while with editor retention.

Popular validation tools

There are around 20 volunteer developed tools, 3 major Wikimedia product initiatives. Some popular ones are:

  • Objective Revision Evaluation Service (ORES) is intended to provide a generalized service to support quality control and curation work in all wikis.
    • Edit quality models for predicting whether or not an edit cause damage, was saved in good-faith or will eventually be reverted.
    • Article quality models that helps gauge progress and identify missed opportunities (popular articles that are low quality). Wikipedia 1.0 assessment
  • Huggle, a diff browser intended for dealing with vandalism and other un-constructive edits on Wikimedia projects.
  • STiki, a tool used to detect and revert vandalism or other un-constructive edits on Wikipedia, available to trusted users.
  • User:ClueBot NG, an anti-vandal bot that tries to detect and revert vandalism quickly and automatically. A 0.1% false-positive rate and able to detect 40% of all vandalism.

Basic web interface for ORES at https://ores.wikimedia.org/ui Some of the features used to aid classification of a revision as problematic or not are: Is user anonymous, number of characters/words added, modified and removed, number of repeated characters and bad words added. Prediction scores for a problematic revision look like below:

https://ores.wmflabs.org/scores/enwiki/damaging/642215410

{
  "642215410": {
    "prediction": true,
    "probability": {
      "false": 0.11271979528262599,
      "true": 0.887280204717374
    }
  }
}

There has been quite a lot of research in this field evident from the number of results on Google scholar about Wikipedia vandalism detection.

Hyperlinks

Reading

Videos


cc: OpenStreetMap Community

Feature level classifier in Gabbar

Gabbar has traditional been a changeset level classifier. Which means, given a changeset ID, Gabbar extracts features at the changeset level to predict if the changeset is harmful or not. Let's try a feature level classifier as part of Gabbar.

Why feature level classifier?

  • On osmcha, users review changesets and labelled them as either good or harmful. This is a little to binary for a machine learning model. Question arises that when a changeset is labelled harmful, does that mean all features touched in the changeset are harmful?
  • We have accurate information at the feature level on why a feature modification is a ๐Ÿ‘ or a ๐Ÿ‘Ž which gets generalized at the changeset level.

Feature level dataset

Thanks to osmcha's filters, we can filter out changesets reviewed with the maximum number of features created, modified and deleted being one or less.

  • Looks like there are 14,314 changesets. Yay!!!
One feature Number of changesets reviewed Harmful changesets
Created 3,333 413
Modified 9,727 2,264
Deleted 321 20

cc: @batpad

Feature selection

From https://en.wikipedia.org/wiki/Feature_selection

In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construction.

Feature selection techniques are used for three reasons:

  • Simplification of models to make them easier to interpret by researchers/users
  • Shorter training times,
  • Enhanced generalization by reducing overfitting

Model learning a pattern incorrectly during training

Ref: #43

There were 5 changesets in the training dataset, that the model was not able to learn correctly. They were labelled to be ๐Ÿ‘ on osmcha but somehow the model was predicting them to be a ๐Ÿ‘Ž

Predicted good Predicted harmful
Labelled good 4850 5
Labelled harmful 0 437

Curious to understand why, I ๐Ÿ‘€ the results myself. 4 out of the 5 had a pattern. In each of them, a natural=water feature got a new property in water=marsh.

water-is-marsh

All attributes except the following are same for all these samples.

  • changeset_bbox_area
  • feature_area
  • feature_area_old

Next actions

  • Why is the model learning this incorrect behavior?
  • How do we re-train the model to predict such changesets as ๐Ÿ‘

cc: @anandthakker @geohacker @batpad

Make this repository public

@rodowi, you had brought this in one of our voice conversations.

Current setup for osmcha

  • Copy over changeset_to_data and predict utility functions
  • Copy over the trained model, autovandal.pkl

Benefits

  • autovandal can we added to requirements.txt in osmcha
  • We can version the model to measure progress over time
  • We only need to update the package version of autovandal in osmcha instead of copying over everything.

cc: @batpad @geohacker

Prototype an anomaly detection model for highways

Ref: #80 and #69

tumblr_inline_o6kjvapgbs1ta78fg_540

We all know labelled data is gold in machine learning land. But, in the context of OpenStreetMap and osmcha, there are two things:

1. Labelled harmful highways

On osmcha, labelling happens at changeset level. A changeset is either good or harmful. But, there are scenarios where not all features of a changeset are harmful. So, we should not assume all features of harmful changeset are harmful. In Gabbar, we worked with changesets where one feature was touched thus, if the changeset was good, the only feature was good and if the changeset is harmful, the only feature was harmful as there was only one feature in the changeset.

This worked ok for a generic classifier, but in the highway classifier, the size of the dataset is too low. For example, the latest highway classier was trained on 2217 good highways and a mere 55 harmful highways. Yes, the number of harmful highways is low. This means, supervised learning algorithms might not be fed enough to be strong and healthy.

2. Labelled good highways

But, we have an abundance (comparatively) of labelled highway that are good. The 2217 changesets from ^ are there but there is even more. When a changeset is labelled good, it is safe to assume all features in the changeset are good. Which in-turn means, all features in the changeset are good too including the highway features. Yay!

There are 50,000+ changesets labelled on osmcha and assuming every changeset has atleat one highway as highway are one among the frequently edited features on OpenStreetMap, we could potentially have around 50,000+ labelled good highways. This might be an interesting scenario to try anomaly detection models.

From https://en.wikipedia.org/wiki/Anomaly_detection

anomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

Another potentially big advantage of anomaly detection models is that they flag when things are different than expected. This means, we are now not limited by the different types of harmful edits we have seen or given the model for training but in a way are ready for new and unknown types of anomalies. One important thing about anomaly detection is these models don't tell you whether a changeset is good or bad, they tell you if is something expected or something different.


cc: @anandthakker @geohacker @batpad

Datasets: Training, Validation and Testing

1. Training

  • Labelled changesets from osmcha between January, 2017 to April, 2017
  • Model will initially be trained on 20% of this dataset called the sample
  • Before publishing, model will be trained on 100% of this dataset

2. Validation

  • To estimate how well your model has been trained
  • Using labelled changesets from osmcha from May, 2017

3. Testing

  • All changesets from OpenStreetMap on 1st May, 2017

Model baseline performance

A model baseline will help in understanding and measuring progress we are making with the model in terms of its performance. scikit, the package we use in Gabbar has a model just to do that:

I trained the DummyClassifier on the training dataset and got predictions on the validation dataset. Baselines look close to what a model generating random predictions would give.

Confusion matrix

Predicted good Predicted harmful
Labelled good 2086 247
Labelled harmful 223 27

Classification report

                precision   recall      f1-score    support

0.0             0.90        0.89        0.90        2333
1.0             0.10        0.11        0.10        250

avg / total     0.83        0.82        0.82        2583

roc_auc

  • Score: 0.49 (0.02) - mean(std dev)

These look very close to what I was expecting. No next actions.


cc: @anandthakker @batpad @geohacker

Using reverted changesets for model training

Per text with @batpad,

Changeset comment has revert

There are a total of 13,125 changesets on osmcha with revert in the changeset comment. Interestingly, 2,505 (20%) changesets are one feature modification changesets which is what we use in the latest version of Gabbar.

Assuming, mappers revert a problematic or wrong feature in these one feature modification changesets, this could be an additional dataset we could make use of for the current iteration of the feature level classifier of Gabbar. I manually ๐Ÿ‘€ a couple of these changesets and they are definitely want we want to catch with Gabbar.

screen shot 2017-06-15 at 7 20 14 pm

screen shot 2017-06-15 at 7 23 52 pm

Changesets from revert user accounts

Mappers and DWG sometimes maintain a separate account for reverts. Changesets from these accounts will be interesting to look at as well. Ex:

screen shot 2017-06-15 at 7 27 42 pm


cc: @anandthakker @geohacker

Regression test suite for automated testing

Per http://machinelearningmastery.com/deploy-machine-learning-model-to-production/

Develop Automated Tests For Your Model

Write regression tests for your model.

  • Collect or contribute a small sample of data on which to make predictions.
  • Use the production algorithm code and configuration to make predictions.
  • Confirm the results are expected in the test.

We could start with a 2x manually verified dump of 100 changesets that contains:

  • 50 changesets that are good, and
  • 50 changesets that are problematic

cc: @anandthakker @geohacker @batpad

Effect of attributes on the feature level classifier

Similar to work on training size, we have questions on effect of number of attributes on model:

  • Does the model have enough attributes
  • What attributes contribute how much to model metrics
  • Can less attributes be better in the long term

Workflow

  • Get a list of all attributes available for training
  • Increase the training attributes appending one at a time from the attributes list
  • Train a model with these attributes from the training dataset
  • Get predictions from the model on this subset of attributes from the validation dataset
  • Store model metrics on the validation dataset and plot

Notes

index

  • There are interesting dips in metrics when the following attributes are added to the list of attributes:
    • user_changesets_with_discussions_count
    • old_user_name_special_characters_count
    • feature_version
    • feature_has_website_old
    • iD
    • Vespucci
  • The metrics somewhat reach their maximum around the 20 attributes mark except for the occasional dips
  • I am not sure what else to read out off of this graph.

cc: @anandthakker @batpad @geohacker

Translating names to English for validation using external APIs

NOTE: Posting here to document a potential idea.


In changeset 48269805, there was one feature that was modified:

  • The name of the marketplace was modified from English to Chinese

screen shot 2017-05-23 at 12 56 03 pm

I gave Google Translate API a try to translate the new name ไบšๅบ‡**ๅทดๅˆน, back to English. The result was Kota Kinabalu. These two words match with the English name in the previous version of the feature, Pasar Kota Kinabalu Central.

'use strict';

const Translate = require('@google-cloud/translate');
const projectId = 'Insert project ID';
const translateClient = Translate({
    projectId: projectId
});

translateClient.translate('์„ผํŠธ๋Ÿด๋งˆ์ผ“', 'en')
.then((results) => {
    const translation = results[0];

    console.log(`Text: ${text}`);
    console.log(`Translation: ${translation}`);
})
.catch((err) => {
    console.error('ERROR:', err);
});

Explore GradientBoosting algorithm used by Wikimedia's ORES

From https://meta.wikimedia.org/wiki/Objective_Revision_Evaluation_Service

The Objective Revision Evaluation Service (ORES) is a web service that provides machine learning as a service for Wikimedia Projects. The system is designed to help automate critical wiki-work -- for example, vandalism detection and removal.

It looks like Wikimedia's Objective Revision Evaluation Service (ORES) makes use of the GradientBoosting algorithm. I am curious about:

  • Why the choice of GradientBoosting
  • How would it be useful in the context of gabbar

Hyperlinks

Effect of cross validation parameter on model metrics

In cross-validation, the cv parameter determines the cross-validation splitting strategy. Ex: If cv=3, it is a 3-fold cross-validation. I was curious to see the impact of the value of cv on the model metrics.

Workflow

  • Load up a trained model
  • Vary cv from 1 to 320 and run cross validation
  • On each run, record the model metrics in precision, recall and f1 score

The following is the graph I got.

index

Questions

  • What should be a good value of cv to use to consistently measure model performance?

cc: @anandthakker @batpad @geohacker

Detect changesets that are very likely to have problems

The 2 Parts

There are two parts to the problem:

  1. High precision
    • High percentage of correct vs incorrect predictions or fewer false positives.
    • Ex: Predictions are right about 80% of the time but it finds less than 20% of all the problematic edits.
  2. High recall
    • Find all or most of the problematic edits.
    • Ex: Finds 80% of all problematic edits but is right only 20% of the time.

Ideally, we want a model that has both high precision and high recall. ๐Ÿ˜‡ But, practically we can only hit one at a time. And once we hit one well, we work on the other problem to hit that out too.


With this ticket, I would like to propose:

  • Building a model that can predict changesets that are VERY LIKELY TO HAVE PROBLEMS.
  • Tackle the first problem of HIGH PRECISION
  • Model is right about 80% of the time but finds less than 20% of all the problematic changesets.

cc: @anandthakker @geohacker @batpad

Prototyping Gabbar for highway features

One of the popular problems in machine learning is dogs vs cats; given a picture predict whether the picture is of a dog or a cat. Coming from this initial experience about machine learning, I kept thinking the problem of classification of changesets as good or problematic is something similar. But, today I did an exercise where I wanted to identify one attribute about the changeset that makes it good or problematic. I started with:

screen shot 2017-06-16 at 9 15 25 am

The following questions came to mind

  • What could be the source of knowledge to modify?
  • Isn't residential better than unclassified; I mean something is better than nothing right?
  • At version 15, this is quite a mature feature. So, is that alright?
  • What is the length of the highway; smaller should be residential and longer unclassified?
  • Why is source=google maps Really?

From https://wiki.openstreetmap.org/wiki/Key:highway

  • highway=unclassified

The least most important through roads in a country's system โ€“ i.e. minor roads of a lower classification than tertiary, but which serve a purpose other than access to properties. Often link villages and hamlets.

  • highway=residential

Roads which serve as an access to housing, without function of connecting settlements.

From https://osmlab.github.io/osm-deep-history/#/way/103217436

  • The feature has mostly been highway=unclassified since creation in 2011.

screen shot 2017-06-16 at 9 19 59 am

Looking deeper into other changesets where a highway=residential gets modified into highway=unclassified, I find this user, ะŸะพั€ั„ะธั€ะธะน who has lots of changesets with the same behavior. Interestingly, the user who added highway=residential is ะŸะพั€ั„ะธั€ะธะน too.

screen shot 2017-06-16 at 9 30 27 am

Eureka!

When a highway modification has so many questions to answer and attributes to look at, what will the scale be when we look at all 26 primary tags together? What about features that don't have any primary tags? Too many questions! Too many attributes! Right?

  • This does not look a traditional cats vs dogs. It is a little something else.
  • How about we try something different? How about we build one machine learning model for each object type?
  • How would it look when there is a model trained on highway's to classify whether the new/modified highway is a ๐Ÿ‘ or a ๐Ÿ‘Ž
  • Another trained on buildings, another in water bodies, etc and each knew what a good highway looks like and a problematic highway looks like?
  • Is this it?

cc: @anandthakker @geohacker @batpad

Metrics for measuring Gabbar performance

NOTE: The numbers below are for the purpose of illustration only.

Everyday

  • Total number of changesets that day: 25,000
  • Number of changesets flagged as problematic by Gabbar: 500 (2%)
  • Number of changesets manually reviewed on osmcha: 250 (1%)
  • Confusion matrix between manually reviewed and flagged as problematic changesets:
Predicted good Predicted harmful
Labelled Good 200 5
Labelled Harmful 10 35

On the validation set (Changesets labelled on omscha in May, 2017)

  • Total number of changesets: 5,000
  • Confusion matrix between manually reviewed and flagged as problematic changesets:
Predicted good Predicted harmful
Labelled Good 4500 50
Labelled Harmful 100 350

cc: @batpad

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.