Giter Site home page Giter Site logo

best-practice-se-text-mining's Introduction

Best-Practice-SE-text-mining

For BIGDSE โ€™16

Notes

  • Understand the relevance of SE processes for ML/Big Data

  • There has been a constant push for using ML in SE. But, what about SE for ML

  • We'd like to explore what SE can teach ML

  • Big data and ML practitioners have an variety of tools at their disposal, with the growing size of such requires a validation team.

  • Mythical man month - 1/2 of the time is used of testing

  • Coding takes only 1/6th the entire time.

  • Industrial data mining has taught us that the significance of goal of a certain task

  • Key take way: Your goals are not my goals.

DM at LN

  • The problem is indeed unique. No dearth of data, but labeling data is quite expensive
  • Emulating real world data is hard โ€” forum such as stack exchange can be used to address these issues.
  • TAR is primarily a binary classification task.
  • StackEx using a site level granularity produces a satisfactory analogy to the real problem in hand.
  • Binary classification of this sort is vastly different from other techniques. This enables us to take to shortcuts.
  • These lessons are by no means general, we only endeavor to highlight the challenges in industrial data mining.

Structure: feel free to modify this

Abstract

Introduction

  • Motivations and background
  • Description of Data
  • Related works

Technology Assisted Review

  1. My goals are not your goals.

  2. My data isn't your data

  3. Describe Prec/Recall and their importance

  4. StackEx data

  5. Prevelance

  6. Sampling - Stratified sampling, Unequal Sampling

  7. Big data sometimes isn't

  8. Challenges in EDISC: See p13 of the refernce.

Experiments

  • Best Decision
  • All other decisons comparing with the best one

Discusions

  • some words to justify the best decision
  • lessons learnt from this project
  • The role of validation teams:
    • Mining large industrial data has signification lessons that can be learnt from SE practices
  • validity threats

Conclusion

best-practice-se-text-mining's People

Contributors

ai4se avatar rahlk avatar timm avatar

Watchers

 avatar  avatar  avatar  avatar

best-practice-se-text-mining's Issues

Introduction2: Data Description

Describe why we are using StackExchange Data, how is it like.
Especially for SExx data sets, how we get it, the characteristic of the data sets.

Data sets will be used in this paper:

tex drupal academia apple gamedev rpg english electronics physics scifi SE0 SE1 SE2 SE3 SE4 SE5 SE6 SE7 SE8 SE9 SE10 SE11 SE12 SE13 SE14

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.