Giter Site home page Giter Site logo

file_type_identification's Introduction

File Type Identification

Worksample by BlueOptima to identify file type using various sources by using files name and file extensions.

Problem​ ​Statement:

With​ ​the​ ​enormous​ ​number​ ​of​ ​languages​ ​and​ ​file​ ​types​ ​used​ ​for​ ​writing​ ​logical​ ​source​ ​or​ ​for​ ​data​ ​purposes,​ ​it is​ ​very​ ​important​ ​for​ ​a​ ​product​ ​like​ ​BlueOptima​ ​to​ ​effectively​ ​identify​ ​and​ ​categorize​ ​a​ ​file​ ​into​ ​its​ ​type.​ ​And this​ ​has​ ​to​ ​be​ ​done​ ​solely​ ​based​ ​on​ ​Extension​ ​and​ ​Name​ ​of​ ​the​ ​file​ ​itself. This​ ​work​ ​sample​ ​requires​ ​you​ ​to​ ​identify​ ​different​ ​sources​ ​that​ ​could​ ​be​ ​used​ ​to​ ​identify​ ​details​ ​of​ ​a​ ​file​ ​type like​ ​following​ ​(but​ ​not​ ​limited​ ​to)

  • Short​ ​Description​ ​(explaining​ ​the​ ​usage​ ​of​ ​the​ ​file​ ​type)
  • Category​ ​(i.e.​ ​Logical​ ​Source,​ ​Configuration,​ ​Data,​ ​etc.)
  • Language​ ​Family​ ​(Java,​ ​Python,​ ​Perl,​ ​etc.)
  • Programming​ ​Paradigm​ ​(Procedural,​ ​OOP,​ ​Dynamic,​ ​etc)
  • Associated​ ​applications

Solution (Execution Flow)

  • Deliverable 1 - Identification and Analysis of Data Sources.

    • Identify at least 5 different Data sources.
    • Expand on the rationale for using the Data source.
  • Deliverable 2 - Implementation and Presentation of information about the given input file types.

    • Extract (Web scraping) data from Fileinfo.com using python script and store in sourceFileInfo.json file.
    • Extract tika.xml using java parser and store in sourceTika.json file.
    • Extract (web scraping) data from IANA source using python script and store in .json file.
    • Create an input input.csv file for passing all the inputs.
    • Implement the main Program - fileTypeIdentification.java
      • Store all the input filenames in a list.
      • Access various data sources (extracted previously in .json files) and load each data source in the main memory (hash maps) fileInfoHM, tikaHM.
      • For each file Extension input, parse it in the hash maps to search for required data in a priority.
      • Write the information about each file input in output.txt

Input

The input file is found in the 'data' directory of the File-Type-Identification. We have taken filenames with its extension in a csv file as shown below.

/input/input.csv

      binarySort.CPP
      linkList.cpp
      Readme.pdf
      fibonacci.XCODEPROJ
      about.txt
      scrape.py
      xmlParser.java

Output

The output for the program is written on a text file output.txt in the main directory. Given below is a sample output.

output.txt

      ______________________________________________________________________________________________________________
 
      File: binarySort.CPP
      ______________________________________________________________________________________________________________
 
	        Category	: Developer File
	        Type		: C++ Source Code File
	        Description	: A CPP file is a source code file written in C++, a popular programming language that adds features such as object-oriented programming to C.  It may be a standalone program, containing all the code or one of many files referenced in a development project.  CPP files must be compiled by a C++ compiler for the target platform before the code can be run.
	        Programs	: File Viewer Plus, Microsoft Visual Studio 2017, Microsoft Visual Studio Code, Eclipse CDT, Code::Blocks, Embarcadero Technologies C++ Builder, ES-Computing EditPlus, BloodshedSoftware Dev-C++, Apple Xcode, GNU Compiler Collection (GCC), MacroMates TextMate, Freescale CodeWarrior Development Tools, File Viewer for Android

Steps to Run the Program

  1. In /input/ Create your input file in csv as given in the above input format or just use the pre built one.
  2. Execute the main program: /src/fileTypeIdentification/FileTypeIdentification.java
  3. Enter the input file name example: input1.csv in the console or else it will take the default input0.csv on return.
  4. Check the output in output.txt file in the main directory.

java version "12.0.2" 2019-07-16 https://www.oracle.com/technetwork/java/javase/downloads/jdk12-downloads-5295953.html

Developers

  • Mohammed Ataaur Rahaman
  • Siddharth Singh
  • Shivani Bangalore

file_type_identification's People

Contributors

ataago avatar cddharthsingh avatar shivbang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.