Giter Site home page Giter Site logo

fossology / atarashi Goto Github PK

View Code? Open in Web Editor NEW
25.0 15.0 22.0 51.46 MB

Atarashi scans for license statements in open source software, focusing on text statistics. Designed to work stand-alone and with FOSSology.

Home Page: http://fossology.github.io/atarashi

License: GNU General Public License v2.0

Python 98.49% Dockerfile 1.51%
license license-scan fossology text-processing information-retrieval

atarashi's Introduction

Atarashi

Build Status

Open source software is licensed using open source licenses. There are many of open source licenses around and adding to that, open source software packages involve sometimes multiple licenses for different files.

Atarashi provides different methods for scanning for license statements in open source software. Unlike existing rule-based approaches - such as the Nomos license scanner from the FOSSology project - atarashi implements multiple text statistics and information retrieval algorithms.

Anticipated advantages is an improved precision while offering an as easy as possible approach to add new license texts or new license references.

Atarashi is designed to work stand-alone and with FOSSology. More info at https://fossology.github.io/atarashi

Requirements

  • Python >= v3.5
  • pip >= 18.1

Steps for Installation

Install

Install from PyPi

  • pip install atarashi

Source install

  • pip install .
  • It will download all dependencies required and trigger build as well.
  • Build will generate 3 new files in your current directory
    1. data/Ngram_keywords.json
    2. licenses/<SPDX-version>.csv
    3. licenses/processedList.csv
  • These files will be placed to their appropriate places by the install script.

Installing just dependencies

  • pip install -r requirements.txt

Build (optional)

  • $ python3 setup.py build

How to run

Get the help by running atarashi -h or atarashi --help

Example

  • Running DLD agent

    atarashi -a DLD /path/to/file.c

  • Running wordFrequencySimilarity agent

    atarashi -a wordFrequencySimilarity /path/to/file.c

  • Running tfidf agent

    • With Cosine similarity

      atarashi -a tfidf /path/to/file.c

      atarashi -a tfidf -s CosineSim /path/to/file.c

    • With Score similarity

      atarashi -a tfidf -s ScoreSim /path/to/file.c

  • Running Ngram agent

    • With Cosine similarity

      atarashi -a Ngram /path/to/file.c

      atarashi -a Ngram -s CosineSim /path/to/file.c

    • With Dice similarity

      atarashi -a Ngram -s DiceSim /path/to/file.c

    • With Bigram Cosine similarity

      atarashi -a Ngram -s BigramCosineSim /path/to/file.c

  • Running in verbose mode

    atarashi -a DLD -v /path/to/file.c

  • Running with custom CSVs and JSONs

    • Please reffer to the build instructions to get the CSV and JSON understandable by atarashi.
    • atarashi -a DLD -l /path/to/processedList.csv /path/to/file.c
    • atarashi -a Ngram -l /path/to/processedList.csv -j /path/to/ngram.json /path/to/file.c

Running Docker image

  1. Pull Docker image

    docker pull fossology/atarashi:latest

  2. Run the image

    docker run --rm -v <path/to/scan>:/project fossology/atarashi:latest <options> /project/<path/to/file>

Since docker can not access host fs directly, we mount a volume from the directory containing the files to scan to /project in the container. Simply pass the options and path to the file relative to the mounted path.

Test

  • Run imtihaan (meaning Exam in Hindi) with the name of the Agent.
  • eg. python atarashi/imtihaan.py /path/to/processedList.csv <DLD|tfidf|Ngram> <testfile>
  • See python atarashi/imtihaan.py --help for more

Creating Debian packages

  • Install dependencies
# apt-get install python3-setuptools python3-all debhelper
# pip install stdeb
  • Create Debian packages
$ python3 setup.py --command-packages=stdeb.command bdist_deb
  • Locate the files under deb_dist

License

SPDX-License-Identifier: GPL-2.0

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

How to generate the documentation using sphinx

  1. Go to project directory 'atarashi'.

  2. Install Sphinx and m2r pip install sphinx m2r (Since this project is based on python so pip is already installed).

  3. Initialise docs/ directory with sphinx-quickstart

    mkdir docs
    cd docs/
    sphinx-quickstart
    • Root path for the documentation [.]: .
    • Separate source and build directories (y/n) [n]: n
    • autodoc: automatically insert docstrings from modules (y/n) [n]: y
    • intersphinx: link between Sphinx documentation of different projects (y/n) [n]: y
    • Else use the default option
  4. Setup the conf.py and include README.md

    • Enable the following lines and change the insert path:

      import os
      import sys
      sys.path.insert(0, os.path.abspath('../'))
    • Enable m2r to insert .md files in Sphinx documentation:

      [...]
      extensions = [
        ...
        'm2r',
      ]
      [...]
      source_suffix = ['.rst', '.md']
    • Include README.md by editing index.rst

      .. toctree::
          [...]
          readme
      
      .. mdinclude:: ../README.md
  5. Auto-generate the .rst files in docs/source which will be used to generate documentation

    cd docs/
    sphinx-apidoc -o source/ ../atarashi
  6. cd docs

  7. make html

This will generate file in docs/_build/html. Go to: index.html

You can change the theme of the documentation by changing html_theme in config.py file in docs/ folder. You can choose from {'alabaster', 'classic', 'sphinxdoc', 'scrolls', 'agogo', 'traditional', 'nature', 'haiku', 'pyramid', 'bizstyle'} Reference

atarashi's People

Contributors

ag4ums avatar aman-codes avatar amanjain97 avatar gmishx avatar hastagab avatar its-sushant avatar kaushl2208 avatar mcjaeger avatar singhshreya05 avatar tanweerulhaque avatar vasudevmaduri avatar xavierfigueroav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

atarashi's Issues

Build fails due to import error from nirjas

I get the following error:

Traceback (most recent call last):
  File "setup.py", line 26, in <module>
    from atarashi.build_deps import download_dependencies
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/build_deps.py", line 32, in <module>
    from atarashi.license.licensePreprocessor import LicensePreprocessor
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/license/licensePreprocessor.py", line 31, in <module>
    from atarashi.libs.commentPreprocessor import CommentPreprocessor
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/libs/commentPreprocessor.py", line 30, in <module>
    from nirjas import extract as commentExtract, LanguageMapper
ImportError: cannot import name 'LanguageMapper' from 'nirjas' (/home/xavierfigueroav/Documents/atarashi-project/atarashi/.env/lib/python3.8/site-packages/nirjas/__init__.py)

This happens when running python setup.py build, after cloning the repository and installing the dependencies.

ModuleNotFoundError occurs when running after installing with pip.

I tried running it after installing it with pip on python 3.6.9. Then, the following error occurs.
Are there additional modules I need to install?

$pip install atarashi 
$atarashi -h
Traceback (most recent call last):
  File "/home/soimkim/test/venv/bin/atarashi", line 5, in <module>
    from atarashi.atarashii import main
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/atarashii.py", line 26, in <module>
    from atarashi.agents.cosineSimNgram import NgramAgent
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/agents/cosineSimNgram.py", line 30, in <module>
    from atarashi.agents.atarashiAgent import AtarashiAgent
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/agents/atarashiAgent.py", line 27, in <module>
    from atarashi.libs.commentPreprocessor import CommentPreprocessor
  File "/home/soimkim/test/venv/lib/python3.6/site-packages/atarashi/libs/commentPreprocessor.py", line 23, in <module>
    import code_comment  # https://github.com/amanjain97/code_comment/
ModuleNotFoundError: No module named 'code_comment'

Problem with identifying the short license text

Generally, license contained in the source code file is either is short license itself or a block of large license which becomes difficult for the information retrieval algorithms and similarity finding algorithms to classify efficiently.

Please suggest how this should be resolved before implementing other IR (Information retrieval) algorithms.

Removing third party module in dameruLevenDist agent

Right now atarashi is using damerau_levenshtein_distance imported from pyxdameraulevenshtein in dameruLevenDist agent. The function is not that long and we do not if it will get removed. So, We can remove it from atarashi and write our own damerau_levenshtein_distance function to increase the overall speed of dameruLevenDist agent and make atarashi less dependent on other repository.
I have already started working on it. Can i proceed further?

Invalid File Path in Atarashi

Whenever a invaid file path is provided to atrashi it generate the following error:
env) akshay@akshay-VirtualBox:~/atarashi/atarashi/evaluator$ atarashi -a tfidf Testfiles/APSL-style.html
Traceback (most recent call last):
File "/home/akshay/atarashi/env/bin/atarashi", line 8, in

sys.exit(main())

File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/atarashii.py", line 123, in main
result = atarashii_runner(inputFile, processedLicense, agent_name, similarity, ngram_json, verbose)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/atarashii.py", line 83, in atarashii_runner
result = scanner.scan(inputFile)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/agents/tfidf.py", line 140, in scan
return self.__tfidfcosinesim(filePath)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/agents/tfidf.py", line 112, in __tfidfcosinesim
processedData1 = super().loadFile(inputFile)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/agents/atarashiAgent.py", line 44, in loadFile
self.commentFile = CommentPreprocessor.extract(filePath)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/libs/commentPreprocessor.py", line 129, in extract
data1 = licenseComment(data)
File "/home/akshay/atarashi/env/lib/python3.8/site-packages/atarashi/libs/commentPreprocessor.py", line 42, in licenseComment
for id, item in enumerate(data[0]["multi_line_comment"]):
IndexError: list index out of range
(env) akshay@akshay-VirtualBoxatarashi/atarashi/evaluator$

### Instead It should Generate a simple error msg that the file path that was provided was wrong

FEAT: Increasing the overall performance of Atarashi

We can improve the performance of atarahi, nirjas, and others by using Numba and RAPIDS by Nvidia. Regular NumPy, pandas, and other libraries are slow. Maximum amount of time is wasted in serialization, deserialization, pre-processing, transfer of memory between CPU and others. We can make it fast using Numba's parallel processing, JIT, and in built features which can even work on CPU. Also, most of the programs can be made even faster using RAPIDS' cuML, cuDF, dask, etc by executing everything through a GPU like pre-processing, vectorization, database query, serialization, deserialization, parallel processing, etc. The entire codebase can be translated without much hassle resulting in computational efficiency, higher accuracy, and lower memory usage.
This can ensure Atarashi's integration with FOSSology.
I have somewhat started with the work. Can I proceed with the same??
@hastagAB @GMishx

Shift from argparse module to plac command line parser.

Description

argparse is a parser for command-line options, arguments, and sub-command.
Read Docs: https://docs.python.org/3/library/argparse.html
Currently, argparse is used as command line parser in atarashi and we're planning to shift it to plac.
plac does the same thing with very less line of code.
Repo: https://github.com/micheles/plac

How to Solve

Read the plac documentation: http://micheles.github.io/plac/

Files to be changed

Comment extraction not working properly

Multi line comment extraction for following file types are still not working JS, PHP, and Python.

For example in Python

print """Some long
print message
"""
print 'Some different message'
"""
Actual comment
"""

In this case, the script returns print 'Some different message' as a comment and leave the Actual comment.

Create a unified entry point

Create a unified entry point for every file using command line arguments instead of adding __main__ to every point. This will help in keeping code concise, easy to maintain and readable.

List Index Out of Range

When Running atarashi -a {Any agent } -s {Similarities} following error produce:

image

Take a reference from here#80

Steps To Reproduce

--> Previously @Aman-Codes make changes in atarahii.py file #L83
->which was working fine but after This PR it again started creating the same error
-> or I think we need to implement the scan function inside the scan function like [here] or update the atarashii.py file

Ability to scan directories

Currently atarashi can scan only files. If a directory is provided as input, it should be able to find all files under it and run the selected agent on them.
The results of each scan can be stored in a list and printed as a JSON array.

It will be preferred, however, to print results as they come maintaining the validity of the JSON array. So if someone is running a scan in interactive terminal, it should not give a feeling as nothing is happening.
It can be emulated as printing a starting [ followed by printing of scan result object {...} and a ,. The last result will not have a trailing , and a ] can be printed at the end of scan. This approach will eliminate the need of additional list to hold temporary results.

Comment extraction not working on curling quotes

Curling quotes (,, , ) are not filtered in comment extractor which results in some wrong results.

A more extensive listing of problematic word characters:

Character UTF-8 ASCII Name
\u2013 - EM DASH
\u2014 - EM DASH
\u2015 - Horizontal Bar
\u2018 ' Left single quotation mark
\u2019 ' Right single quotation mark
\u201a , Single low-9 quotation mark
\u201b ' Single high-reversed-9 quotation mark
\u201c " Left double quotation mark
\u201d " Right double quotation mark
\u201e " Double low-9 quotation mark
\u2026 ... Horizontal ellipsis
\u2032 ' Prime
\u2033 " Double prime
© \u00a9 (c) Copyright sign

Error in CommentPreprocessor

I get the following error when running python atarashii.py -a wordFrequencySimilarity <file>:

Traceback (most recent call last):
  File "atarashii.py", line 213, in <module>
    main()
  File "atarashii.py", line 167, in main
    result = run_scan(scanner_obj, inputPath)
  File "atarashii.py", line 116, in run_scan
    return scanner.scan(inputFile)
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/../atarashi/agents/wordFrequencySimilarity.py", line 41, in scan
    processedData = super().loadFile(filePath)
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/../atarashi/agents/atarashiAgent.py", line 44, in loadFile
    self.commentFile = CommentPreprocessor.extract(filePath)
  File "/home/xavierfigueroav/Documents/atarashi-project/atarashi/atarashi/../atarashi/libs/commentPreprocessor.py", line 131, in extract
    data = json.loads(data_file)
  File "/usr/lib/python3.8/json/__init__.py", line 341, in loads
    raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not dict

I got this error when running the wordFrequencySimilarity agent, but since commentPreprocessor is used by all the others, this may be affecting the whole package.

The error occurs because the result of the function extract from Nirjas is passed in the method loads of the module json, but extract returns a dictionary and loads expects a string. See lines 129 and 130 in commentPreprocessor.py.

with open(outputFile, 'w') as outFile:
# if the file extension is supported
if fileType in supportedFileExtensions:
data_file = commentExtract(inputFile)
data = json.loads(data_file)
data1 = licenseComment(data)
outFile.write(data1)

So the fix consists of removing the line 130 and passing data_file in licenseComment, instead of data.

[Proposal] Improve the speed of matching

WHAT

Atarashi can use a lot of agents and can use a lot of similarity types.
But according to the previous tests, we observed that there is a need for improvement in speed of scanning the licenses.

Proposal

to be decided

Pipfile: the replacement for requirements.txt

We can migrate our current requirement.txt file to pipfile due to several reasons like:

  • TOML syntax for declaring all types of Python dependencies.
  • One Pipfile (as opposed to multiple requirements.txt files).
  • A Pipfile is inherently ordered.

or, Refer: Why?

Build fails

Hi, I'm getting an error when trying to build (master branch) on Python 3.7:

Installing collected packages: code-comment
  Running setup.py develop for code-comment
    Complete output from command /home/rob/projects/atarashi/.venv/bin/python -c "import setuptools, tokenize;__file__='/home/rob/projects/atarashi/.venv/src/code-comment/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" develop --no-deps --user --prefix=:
    usage: -c [global_opts] cmd1 [cmd1_opts] [cmd2 [cmd2_opts] ...]
       or: -c --help [cmd1 cmd2 ...]
       or: -c --help-commands
       or: -c cmd --help
    
    error: option --user not recognized

I'm able to build when I run the command with sudo, but this shouldn't be necessary per my understanding.

Parallelize the evaluator algorithm

Description

There is a script to evaluate the algorithms for Atarashi: evaluator.py

Currently, it scans the test files sequentially (One by One).
We have to parallelize the script by using multiprocessing, multithreading or something else to reduce the effective time of scanning.

How to solve

Use multiprocessing, multithreading in the main loop of evaluator.py

Make evaluation.py more informative

The evaluation script should

  • Allow to print a comparison table with all the algorithms supported by atarashi. You can find examples of comparison tables in #95 and #65.
  • Allow to print a confusion matrix so that we can easily do error analysis to make decisions on how to improve current agents or implement new ones.

Run the evaluator command without any 'similarity' parameter

Description

The Evaluator commands are set for both two parameters i.e agent_name and similarity but some agents runs without similarity type also.

Example: for tfidfagent there are three commands :

  1. With cosine similarity : atarashi -a tfidf -s CosineSim /path/to/file.c
  2. With Score similarity : atarashi -a tfidf -s ScoreSim /path/to/file.c
  3. Without any similarity : atarashi -a tfidf /path/to/file.c

The evaluator covers the first two cases and not the third one. The same goes for other agents.

How to fix

  1. Goto the getCommand function of the evaluator
  2. Write separate conditions as desired or manipulate the existing ones.
  3. Test and verify if it's working.

Improve TF-IDF agent by tuning matches threshold

Hello.

I've been playing around with some parameters of the TF-IDF agent.

I've found that if we stop using a threshold (cosine similarity >= 0.30) to filter the match results, the accuracy improves up to 3 points. However, filtering helps to reduce the compute time, since at the end of the search the results are sorted. See the piece of code I am talking about (specially lines 126 and 133):

for counter, value in enumerate(all_documents_matrix, start=0):
sim_score = self.__cosine_similarity(value, search_martix)
if sim_score >= 0.3:
matches.append({
'shortname': self.licenseList.iloc[counter]['shortname'],
'sim_type': "TF-IDF Cosine Sim",
'sim_score': sim_score,
'desc': ''
})
matches.sort(key=lambda x: x['sim_score'], reverse=True)
if self.verbose > 0:
print("time taken is " + str(time.time() - startTime) + " sec")
return matches

Using the evaluation.py script, I've carried out some experiments:

Algorithm Time elapsed Accuracy
1 tfidf (CosineSim) (thr=0.30) 30.19 59.0%
2 tfidf (CosineSim) (thr=0.17) 35.29 61.0%
3 tfidf (CosineSim) (thr=0.16, max_df=0.10) 27.34 62.0%
4 tfidf (CosineSim) (thr=0.16) 36.42 62.0%
5 tfidf (CosineSim) (thr=0.15) 38.45 62.0%
6 tfidf (CosineSim) (thr=0.10) 39.91 62.0%
7 tfidf (CosineSim) (thr=0.00) 61.49 62.0%
8 Ngram (CosineSim) - 57.0%
9 Ngram (BigramCosineSim) - 56.0%
10 Ngram (DiceSim) - 55.0%
11 wordFrequencySimilarity - 23.0%
12 DLD - 17.0%
13 tfidf (ScoreSim) - 13.0%
  • Row 1 shows the performance (speed and accuracy) of the current configuration of the TF-IDF agent using CosineSim as similarity measure.
  • Row 7 shows how we can reach an accuracy of 62.% just by removing the threshold (cosine similarity >= 0.00). However, just removing the threshold makes the agent 2x slower, so I continued tuning the threshold holding the last value that produces 62.0% of accuracy, which is 0.16, showed in row 4.
  • In order to continue decreasing the excecution time and increasing the accuracy, I tuned some parameters of the TfidfVectorizer. Setting max_df to 0.10 (default is 1.0) keeps the accuracy equal to 62.0%, but makes the agent 1.1x faster, showed in row 3.
    • Why does decreasing the max_df value increase the speed? It increases the speed because the vectorizer ignores all the terms that appear in more than the max_df percent of the documents (see docs), i.e., it ignores more frequent terms, so each document vector is shorter, making the cosine similarity easier to compute.
    • Why does decreasing the max_df value keeps the accuracy high? My explanation is that the terms that appear in most licenses do not help the algorithm distinguish licenses; rare terms are the ones that make licenses different between each other, so they are enough for the algorithm to do a good job.

I will be opening a PR for you to reproduce the results in row 3 and merge the changes if you consider them relevant.

Important notes:

  • I've left out the speed times for all the other algorithms, because I ran those experiments in another context, so the comparison of time wouldn't be fair.
  • All the results differ from the last report I could find out there. I do not fully understand why some of them are so different; probably changes in the test files or changes in the algorithms. Anyway, 62.0% is the new best result in both reports.
  • My findings may help improve other agents that use thresholds, such as Ngram.
  • This new state-of-atarashi performance 😅 may also push the goals of future agents implementations, since it would be the new baseline.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.