Giter Site home page Giter Site logo

gurol / dsprofiling Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 2.46 MB

DsProfiling – Dataset Profiling

License: GNU Affero General Public License v3.0

R 100.00%
dataset profiling data-science big-data machine-learning malware-detection malware-samples descriptive-statistics quantitative-analysis density sparsity data-quality data-quality-measurement

dsprofiling's Introduction

DsProfiling – Dataset Profiling: A Research Compedium of

New Techniques in Profiling Big Datasets for Machine Learning with A Concise Review of Android Mobile Malware Datasets

Last-changedate License: AGPL v3 ORCiD

This platform is a research compedium of our academic publication below.

Gürol Canbek, Seref Sagiroglu, and Tugba Taskaya Temizel. “New Techniques in Profiling Big Datasets for Machine Learning with A Concise Review of Android Mobile Malware Datasets”, International Congress on Big Data, Deep Learning & Fighting Cyber Terrorism (IBIGDELFT 2018), 3–4 December 2018: IEEE.

Full-text is available at ResearchGate and IEEE Xplore. The contents:

  • Tables/Extra Materials (Open Document Spread Sheet)*
  • with novel dsprofiling.R - An R script to calculate some of the profiling criteria (will be updated later)
  • The presentation given in the conference is also available.

* Best viewed with LibreOffice.

*Example figure visualizing a profiling criteria (colored version is in the article)*

Contents of Tables/Extra Material

  • Table I. Reviewed Android Mobile Malware Datasets
  • Table II. Basic Profiling
  • Table III. Time Line Profiling
  • Table IV. Density/Sparsity Profiling
  • Table V. Overall Profiling Results
  • Table Extra I. Datasets Usage in the Literature
  • Table Extra II. Dataset Sizes
  • Table Extra III. Android Mobile Malware Family and Variants

Please, refer to the article for more information and methodology (a link will be provided).

Abstract

As the volume, variety, velocity aspects of big data are increasing, the other aspects such as veracity, value, variability, and venue could not be interpreted easily by data owners or researchers. The aspects are also unclear if the data is to be used in machine learning studies such as classification or clustering. This study proposes four techniques with fourteen criteria to systematically profile the datasets collected from different resources to distinguish from one another and see their strong and weak aspects. The proposed approach is demonstrated in five Android mobile malware datasets in the literature and in security industry namely Android Malware Genome Project, Drebin, Android Malware Dataset, Android Botnet, and Virus Total 2018. The results have shown that the proposed profiling methods reveal remarkable insight about the datasets comparatively and directs researchers to achieve big but more visible, qualitative, and internalized datasets.

Keywords

Data profiling, data quality, big data, malware detection, mobile malware, machine learning, classification, Android, feature engineering

dsprofiling's People

Contributors

gurol avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.