For BIGDSE โ16
-
Understand the relevance of SE processes for ML/Big Data
-
There has been a constant push for using ML in SE. But, what about SE for ML
-
We'd like to explore what SE can teach ML
-
Big data and ML practitioners have an variety of tools at their disposal, with the growing size of such requires a validation team.
-
Mythical man month - 1/2 of the time is used of testing
-
Coding takes only 1/6th the entire time.
-
Industrial data mining has taught us that the significance of goal of a certain task
-
Key take way: Your goals are not my goals.
- The problem is indeed unique. No dearth of data, but labeling data is quite expensive
- Emulating real world data is hard โ forum such as stack exchange can be used to address these issues.
- TAR is primarily a binary classification task.
- StackEx using a site level granularity produces a satisfactory analogy to the real problem in hand.
- Binary classification of this sort is vastly different from other techniques. This enables us to take to shortcuts.
- These lessons are by no means general, we only endeavor to highlight the challenges in industrial data mining.
- Motivations and background
- Description of Data
- Related works
-
My goals are not your goals.
-
My data isn't your data
-
Describe Prec/Recall and their importance
-
StackEx data
-
Prevelance
-
Sampling - Stratified sampling, Unequal Sampling
-
Big data sometimes isn't
-
Challenges in EDISC: See p13 of the refernce.
- Best Decision
- All other decisons comparing with the best one
- some words to justify the best decision
- lessons learnt from this project
- The role of validation teams:
- Mining large industrial data has signification lessons that can be learnt from SE practices
- validity threats