nexb / scancode.io Goto Github PK

ScanCode.io is a server to script and automate software composition analysis pipelines with ScanPipe pipelines. This project is sponsored by NLnet project https://nlnet.nl/project/vulnerabilitydatabase/ Google Summer of Code, nexB and others generous sponsors!

Home Page: https://scancodeio.readthedocs.io

License: Apache License 2.0

Makefile 0.34% Python 84.90% HTML 13.57% Dockerfile 0.20% JavaScript 0.91% Java 0.06% C++ 0.02%

sca software-composition-analysis open-source license scancode docker virtual-machine cyclonedx package-url purl

scancode.io's Introduction

ScanCode.io

ScanCode.io is a server to script and automate software composition analysis with ScanPipe pipelines.

First application is for Docker container and VM composition analysis.

Getting started

The ScanCode.io documentation is available here: https://scancodeio.readthedocs.org/

If you have questions that are not covered by our Documentation or FAQs, please ask them in Discussions.

If you want to contribute to ScanCode.io, start with our Contributing page.

A new GitHub action is now available at scancode-action to run ScanCode.io pipelines from your GitHub Workflows. Visit https://scancodeio.readthedocs.io/en/latest/automation.html to learn more about automation.

Build and tests status

Tests	Documentation

License

SPDX-License-Identifier: Apache-2.0

The ScanCode.io software is licensed under the Apache License version 2.0. Data generated with ScanCode.io is provided as-is without warranties. ScanCode is a trademark of nexB Inc.

You may not use this software except in compliance with the License. You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. No content created from ScanCode.io should be considered or used as legal advice. Consult an Attorney for any legal advice.

scancode.io's People

Contributors

Stargazers

Watchers

Forkers

joe153 sbs2001 ayansinhamahapatra cco3 saif007s the-codekid vvalorous zenocodes abhirup-99 pushpit07 sagarsangwan jonoyang nithishc anshsrtv divya814 amitgupta7580 hyounes4560 joesinghh nospam2000 aalexanderr quepop pratikrocks fireline-emergency-specialist-llc avishrantssh fossaware msobocinsk dhruvij17 naveen-singla mattheews nishanthartham surfndez gauravkumawat33 sa-y-an purna135 prathamaror lovekesh-gh lf32 pratyush1606 adii21-ux viraj-dhanushka akz747 priyapahwa product xerrni bwjohnson-ss 35c4n0r swastkk pinkdiamond1 hyesung2 sumit9953 zhcxww keshav-space 2pushkaraj3 ziadhany aadityasinha-dotcom athreyashreyas silverhook moazzam07 anmode tianwei-upup sivangbagri sahaniarun iumoinfinium akshaymuthal nicorikken okpalindrome ihavespoons philcali hritik14 aaryan137 btme0011 sujalshah-bit jaison005 jayanth-kumar-morem osafalisayed joefelx ambuj-1211 404-geek gopalm-3 mindula-dilthushan lata-11

scancode.io's Issues

Analyze scan for license issues

Using the scancode results analyzer https://github.com/nexB/scancode-results-analyzer/ I would like to enhance a scan done in scancode.io with extra data.
For that we would need to:

install it using the output of nexB/scancode-analyzer#21
run it using this nexB/scancode-analyzer#20

Allow local settings file

Following #56 (comment)

I would be great to have a local setting file that's not the main version-controlled one and that would override the base settings.

Collect license of detected package dependencies without installling, doing a remote API lookup

Description

Hi,
My question is that is it possible to check all package license written in JSON file with out installing then using code-scan tool kit .
if its possible then how?

Create a database-backed commoncode.resource.Codebase-like class

We should be able to walk and navigate a CodebaseResource the same we we do it with commoncode.resources.Codebase and VirtualCodebase but using the DB as a base (as opposed to the real filesystem or a JSON scan)

Provide a way to use scanpipe with an in-process database

Depending on the environment, using Postgres can be a little unwieldy (requires a daemon, root access, etc.).

Is the DB used for much more than just storing pipeline statuses and results? Is there any reason a solution like SQLite couldn't work here?

Assignment of Download URL to a detected package needs to be improved

A recent scan (using ScanCode.io) of vscode-1.33.1.tar.gz (from https://github.com/microsoft/vscode/archive/1.33.1.tar.gz ) resulted in the assignment of invalid Download URL's to detected packages. An example is the detected package clojure-1.0.0.tgz which was assigned a Download URL of https://registry.npmjs.org/clojure/-/clojure-1.0.0.tgz which does not work.

An archive of the scan output is attached.

vscode-1.33.1.tar.gz_scan.json.zip

Remove code duplication from rootfs/docker pipes checksum handling

In the docker.py pipeline:

the code dealing with sha1/md5 is duplicated
it could be streamlined

design-needed: Add GUI to review scanpipe results

We should have some UI that's better than the JSON API and better than the Django Admin.
May be something such as the scancode-workbench?

Add support for "failed" task_output in Run.get_run_id method

The following may occur when a task has failed early, before running a pipeline. The task_output will then have some content but no "run-id 123456789" string to be found, raising an issue on get_run_id calls.

Exception raised in callable attribute "get_run_id"; 
original exception was: 'NoneType' object has no attribute 'group'

Support newer versions of python

Is there a reason python3.6 is specifically required? On some Debian environments, it is impossible to use -m venv without the newest version of python.

Improve Django Admin for efficient review

There are likely a few tweak to the admin that could go a long way to help until we have a GUI #24

design-needed: Create app to display licenses and license rules

These should be the scancode-toolkit licenses.
Ideally we should also have an easy to add new records which should then trigger a PR in scancode-toolkit
There should also ideally be a simple API that could be use d by aboutcde-toolkit to fetch license texts for attribution generation

Add support for RPM distro and packages

How to find the files without licenses/copyright using this tool? is there a flag for that?

Description

I want to find out the files without licenses/copyright content using scancode. How can I do that? Also if I want to put a filter exclusively for finding images inside the source code repo, is it possible?

System configuration

For bug reports, it really helps us to know:

Linux
What installation method was used to install/run scancode? source

Improve the short description of this scancode.io project

Here is a suggested short description for the scancode.io project:

ScanCode.io is a server that manages and performs scancode-toolkit scans, and it enables you to automate ScanCode analysis with ScanPipe pipelines and Container Analysis.

Scan a single text for license, report detection details and quality

Using scancode-toolkit I would like to

have a screen where I can paste a text to trigger a license detection
in the results see which parts of the text were detected/with which text highlighted
the quality of the detection should be analyzed by https://github.com/nexB/scancode-results-analyzer/
if the quality or results are not good I should have an option to:
4.1 automatically create a ticket at Github filled with the issue data (and the possible corrected suggestion)
4.2 OR create a PR with new license rules and/or a rule update with the fix to resolve this issue

First Resource in scanpipe JSON results does not have a path

I set up scanpipe locally, run the scan_codebase pipeline on https://registry.npmjs.org/@uifabric/charting/-/charting-2.7.5.tgz, and downloaded the results in json when it was done. I tried to upload the scan directly to matchcode to see what happens and matchcode runs into an error because of the first Resource in the results:

{
  "for_packages": [],
  "path": "",
  "size": 0,
  "sha1": "",
  "md5": "",
  "copyrights": [],
  "holders": [],
  "authors": [],
  "licenses": [],
  "license_expressions": [],
  "emails": [],
  "urls": [],
  "status": "",
  "type": "directory",
  "extra_data": {},
  "name": "codebase",
  "extension": "",
  "programming_language": "",
  "mime_type": "",
  "file_type": ""
},

I'd expect there to be a path value for the codebase directory (which all the files sit in) and I would also expect that all the subsequent Resource paths to be prefixed by codebase as well. Conversely, since the rest of the paths are not prefixed with codebase, you could just opt to remove this Resource and I wouldn't think it would mess anything up in the results?

Add support for fetching code with fetchcode in a pipeline

This would help downloading files on demand in a pipeline. This would require that https://github.com/nexB/fetchcode/ be packaged for reuse. @TG1999 FYI

Add a basic extract and scan pipeline

This would eventually replace the scanner API.

We have today an older "scanner" API that can scan a single package and this is NOT using the scanpipe pipelines. It would would make sense to use pipelines also for this

Remove the ScanPipe module dependency on Scanner module

Select codebase content from SCIO- DB for JSON or CSV output

When we have a fully extracted codebase in SCIO-DB, we will need to be able to extract subsets of that codebase for (at least) two reasons:

The size of output files - the current practical limits for the number of files that you can effectively manage (search, filter, etc.) in Excel/Calc (CSV) or SCWB (JSON) is about 500k and 150K respectively. These limits are also relative to the amount of column/field data in a file, but the primary constraint seems to be the number of rows.
The purpose of an analysis step - for the current D2D tracing of Deploy code to Devel code you need to create separate CSV files for the Deploy and Devel subsets of the codebase.

The general principle for defining the codebase file data (row) to be extracted is top-down - i.e. by specifying higher-level directories. It would be ideal to have some tree view of the codebase where you can check off the subsets of the codebase that you want to extract for analysis in Excel/Calc or SCWB.

Assignment of package filenames needs to be improved

A recent scan (using ScanCode.io) of bootstrap-4.3.1.tar.gz (from https://github.com/twbs/bootstrap/archive/v4.3.1.tar.gz ) resulted in the identification of a couple of packages that are unique (different URL's), but the assigned filenames are confusingly the same:

Package 1:
Filename 4.3.0
Download URL https://www.nuget.org/api/v2/package/bootstrap/4.3.0
Package URL pkg:nuget/[email protected]

Package 2:
Filename 4.3.0
Download URL https://www.nuget.org/api/v2/package/bootstrap.sass/4.3.0
Package URL pkg:nuget/[email protected]

In each case, the Download URL and Package URL appear to be quite good, but the Filename could be improved to reflect what that filename would actually be if downloaded.

Attaching an archive of the scan results output.

bootstrap-4.3.1.tar.gz_scan.json.zip

Management commands to create project and run pipelines

Only available from the REST API at the moment.

Always return the Pipeline subclass/implementation from the module inspection

The current get_pipeline_class implementation may return the Pipeline base class in place of the actual subclass we are looking for. This appears to be random since it is based on the order of the list returned by inspect.getmembers.

Select field/column content from SCIO-DB for XLSX output

In addition to selecting codebase subsets from a Project in SCIO-DB (see #48), we often want to extract a specific subset of columns for a particular type of analysis in Excel/Calc. For example, you might want just the Copyright and License fields/columns for Files and Packages without the license match_rule data and without the other Package data.
It would be ideal to have some way to select the fields/columns you want to extract from a list with check boxes or similar UI. We will, of course, want to combine this with the capability to select Codebase rows in one UI. That combination would be essentially be the first version of an SCIO Reports module.

Resource and Package missing fields value in ScanCodebase Pipeline

Fields like copyrights and license_expressions are not imported from the Scan data.
See create_codebase_resources and create_discovered_packages

Create a Homepage for presenting Scancode.io

Available at https://scancode.io/

See the https://github.com/nexB/scancode.io/tree/pages branch for progress using GitHub Pages.

Add ability to resume a failed pipeline from a management command

Add ability to track scans of newly-extracted nested tarballs etc.

I'm not entirely clear on what we can currently do, but it would be helpful to be able to update scan results as we get further into an audit and find the need to extract and scan codebase archives that the initial --shallow scan did not reach.

design-needed: Comprehensive pipeline with reports

Form a chat with @daniel-eder

In fact it may be very interesting to create components in the pipeline that produce such output, such as SBOMS, lists of Todos and Dont's, a full "compliance" file including all copyrights, licenses, notices, disclaimers, etc as required by each license, each step handling one specific type of output

Add new pipeline for basic codebase scan

The input to that pipeline would be code archive(s).
The pipeline would:

extract the archives (not recursively at first)
run the equivalent of scancode -clipeu scan on that code
create one or more inventory analysis "workfile" report as CSV, JSON (or XLS TBD) listing all the captured resources and packages in a format TBD.

The reports would be created by the pipeline and stored on disk to be retrieved by the API

Failed to scan Debian Docker image

While scanning https://hub.docker.com/_/mongo with docker.py pipeline we get:

        "01-debian-agpl-sspl-mongo-latest.tar"
    ],
    "next_run": null,
    "runs": [
        {
            "url": "http://127.0.0.1:8001/api/runs/c9337b56-04b6-45c4-a1e8-aa82c85edb19/",
            "pipeline": "scanpipe/pipelines/docker.py",
            "description": "A pipeline to analyze a Docker image.",
            "project": "http://127.0.0.1:8001/api/projects/28f6738c-f2f8-49d9-8299-ab6ff9e9987f/",
            "uuid": "c9337b56-04b6-45c4-a1e8-aa82c85edb19",
            "run_id": "1599755581029165",
            "created_date": "2020-09-10T16:32:58.966347Z",
            "task_id": "7771f598-6b51-434e-b663-96b09b4488e1",
            "task_start_date": "2020-09-10T16:32:59.017750Z",
            "task_end_date": "2020-09-10T16:38:01.455060Z",
            "task_exitcode": 1,
            "task_output": [
                "Validating your flow...",
                "    The graph looks good!",
                "Running pylint...",
                "    Pylint is happy!",
                "2020-09-10 16:33:01.031 Workflow starting (run-id 1599755581029165):",
                "2020-09-10 16:33:01.035 [1599755581029165/start/1 (pid 26030)] Task is starting.",
                "2020-09-10 16:33:01.990 [1599755581029165/start/1 (pid 26030)] Task finished successfully.",
                "2020-09-10 16:33:01.995 [1599755581029165/extract_images/2 (pid 26037)] Task is starting.",
                "2020-09-10 16:33:04.129 [1599755581029165/extract_images/2 (pid 26037)] Task finished successfully.",
                "2020-09-10 16:33:04.134 [1599755581029165/extract_layers/3 (pid 26043)] Task is starting.",
                "2020-09-10 16:33:05.773 [1599755581029165/extract_layers/3 (pid 26043)] Task finished successfully.",
                "2020-09-10 16:33:05.838 [1599755581029165/find_images_linux_distro/4 (pid 26049)] Task is starting.",
                "2020-09-10 16:33:06.812 [1599755581029165/find_images_linux_distro/4 (pid 26049)] Task finished successfully.",
                "2020-09-10 16:33:06.816 [1599755581029165/collect_images_information/5 (pid 26055)] Task is starting.",
                "2020-09-10 16:33:07.782 [1599755581029165/collect_images_information/5 (pid 26055)] Task finished successfully.",
                "2020-09-10 16:33:07.786 [1599755581029165/collect_and_create_codebase_resources/6 (pid 26062)] Task is starting.",
                "2020-09-10 16:33:54.536 [1599755581029165/collect_and_create_codebase_resources/6 (pid 26062)] Task finished successfully.",
                "2020-09-10 16:33:54.541 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)] Task is starting.",
                "2020-09-10 16:38:00.696 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)] <flow DockerPipeline step collect_and_create_system_packages> failed:",
                "2020-09-10 16:38:00.697 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     Internal error",
                "2020-09-10 16:38:00.697 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)] Traceback (most recent call last):",
                "2020-09-10 16:38:01.054 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/metaflow/cli.py\", line 883, in main",
                "2020-09-10 16:38:01.054 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     start(auto_envvar_prefix='METAFLOW', obj=state)",
                "2020-09-10 16:38:01.054 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/click/core.py\", line 829, in __call__",
                "2020-09-10 16:38:01.054 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     return self.main(args, kwargs)",
                "2020-09-10 16:38:01.054 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/click/core.py\", line 782, in main",
                "2020-09-10 16:38:01.054 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     rv = self.invoke(ctx)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/click/core.py\", line 1259, in invoke",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     return _process_result(sub_ctx.command.invoke(sub_ctx))",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/click/core.py\", line 1066, in invoke",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     return ctx.invoke(self.callback, ctx.params)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/click/core.py\", line 610, in invoke",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     return callback(args, kwargs)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/click/decorators.py\", line 33, in new_func",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     return f(get_current_context().obj, args, kwargs)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/metaflow/cli.py\", line 444, in step",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     max_user_code_retries)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/metaflow/task.py\", line 394, in run_step",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     self._exec_step_function(step_func)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/metaflow/task.py\", line 47, in _exec_step_function",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     step_function()",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"scanpipe/pipelines/docker.py\", line 87, in collect_and_create_system_packages",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     docker_pipes.scan_image_for_system_packages(self.project, image)",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/scanpipe/pipes/docker.py\", line 105, in scan_image_for_system_packages",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     for i, (purl, package, layer) in enumerate(installed_packages):",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/container_inspector/image.py\", line 329, in get_installed_packages",
                "2020-09-10 16:38:01.055 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     for purl, package in layer.get_installed_packages(packages_getter):",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/scanpipe/pipes/debian.py\", line 16, in package_getter",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     for package in packages:",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/packagedcode/debian.py\", line 211, in get_installed_packages",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     for package in parse_status_file(base_status_file_loc, distro=distro):",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]   File \"/tmp/scancode.io/lib/python3.6/site-packages/packagedcode/debian.py\", line 228, in parse_status_file",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)]     raise FileNotFoundError('[Errno 2] No such file or directory: {}'.format(repr(location)))",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/scancode.io/var/projects/sspl-28f6738c/codebase/01-debian-agpl-sspl-mongo-latest.tar-extract/6f90c94ad68f6b08882985f9884f3154469709ca8af796d52726ac7562f7ff1c/var/lib/dpkg/status'",
                "2020-09-10 16:38:01.056 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)] ",
                "2020-09-10 16:38:01.057 [1599755581029165/collect_and_create_system_packages/7 (pid 26104)] Task failed.",
                "2020-09-10 16:38:01.057 Workflow failed.",
                "2020-09-10 16:38:01.057 Terminating 0 active tasks...",
                "2020-09-10 16:38:01.057 Flushing logs...",
                "    Step failure:",
                "    Step collect_and_create_system_packages (task-id 7) failed.",
                ""
            ],
            "execution_time": 302

Add PR check on the CHANGELOG update

It would be useful to automatically block a PR until the CHANGELOG was updated.
An option could be to use a GitHub action such as https://github.com/marketplace/actions/changelog-checker

Create license scan quality improvement campaigns for specific ecosystems

Doing massive scans of all the packages of a given ecosystem (say Maven, PyPi, etc.) I would like to:

run the https://github.com/nexB/scancode-results-analyzer/ on each to determine where there are issues or no issue
working closely with that package community, organize a review queue based on detected issue
based on the https://github.com/nexB/scancode-results-analyzer/ output generate new scancode detection rules to improve license detection and bring these in a review queue

Some candidates for these could be these:

Debian with @Pabs @jonassmedegaard and other
CPAN with @StuartJMackintosh
PyPI with the PyPA folks and the the community there
Go

Create/generate documentation for "pipes"

To help a pipeline creator, knowing which are the "pipes" available, what they do and how to use them would be mighty useful.
Since a "pipe" is just a plain Python function (that we organize by module for clarity), the best approach would be to use docstrings and generate a clean doc from that.

Use project name as argument to run a pipeline

We already use the name in all the management commands but the uuid is used on the Pipeline class.

scanpipe run --project <NAME> and pipelines/docker.py run --project <UUID>

Let's use name everywhere for consistency.

Process rootfs one at a time

In the root_filesystems.py pipeline in the step:

    @step
    def match_not_analyzed_to_system_packages(self):
        """
        Match not-yet-analyzed files to files already related to system packages.
        """
        rootfs.match_not_analyzed(
            self.project,
            reference_status="system-package",
            not_analyzed_status="",
        )
        self.next(self.match_not_analyzed_to_application_packages)

... we should consider doing it one rootfs at a time if there are several rootfs at once in the project e.g. for rfs in self.root_filesystems:...

Issues with the Install and usage workflows

Placeholder for many small fixes and documentation refinements.

Add pipe for pipeline detection of inconsistent license detection

As an extension to #30 "Analyze scan for license issues" I would like to be able to plug in a pipeline the analysis of incorrect license detection

In contrast with nexB/scancode-analyzer#20 the available data may not be exactly the same as in scanc (and it is stored in the scanpipe database)

Improve/add documentation

The documentation we have is sparse. We would need:

a basic and descriptive README that explain what this is
installation documentation
usage documentation
scanpipe pipelines tutorials and documentation

Generate standard JSON files from SCIO-DB

When you use ScanPipe to analyze a codebase and load the Scan data into the SCIO-DB, you should also be able to extract the Scan data from the SCIO-DB according to any combination of the basic ScanCode runtime parameters:
--info
--copyright
--license
--package
--email
--url
The primary output format should be standard SCTK JSON.
I am not sure what we should put in the output file header, but we would at least want to know what version of SCTK was used for the original Scan. In any case this JSON file must be compatible with ScanCode Workbench since the primary use case is to view Scan data there.

Add a management command to delete a Project and its related work directories

import shutil
from scanpipe.models import Project

project = Project.objects.get(name="PROJECT_NAME")
shutil.rmtree(project.work_directory)
project.delete()

Track which Docker image/layer a resource or package is found

In a docker analysis pipeline, we have the layer information as this is always part of the resource path. We may also have the image name in the path, but that's not guaranteed. We need to have a way to attach explicitly to a discovered package and a codebase resource which docker image it is found in and also determine what is the base image/base image layers.
In the simplest way, this would be stored as extra_data attached to each of these objects and extra_data should also be returned with the JSON results.

Generate streamlined analysis workfile (CSV) from SCIO-DB

A streamlined CSV workfile will be very useful for SCA planning. The columns we need are listed below by SCTK runtime option using current SCTK CSV output column names. If there are multiple values in a JSON field we want all of the values in one cell ("flattened").

Info:
Resource
type
name
base_name
extension
size
sha1
mime_type
file_type
programming_language

Copyrights:
copyright
copyright_holder
author

Licenses:
license_expression
license__key
license__score
license__category
license__owner

email
url

Packages:
package__type
package__namespace
package__name
package__version
package__primary_language
package__description
package__release_date
package__homepage_url
package__download_url
package__size
package__sha1
package__vcs_url
package__copyright
package__license_expression
package__declared_license
package__notice_text

Add support for RPM-based distros for docker and rootfs images scanpipe

There is no easy way to access the RPM database but through librpm and the rpm executable.
The installed RPMs database comes in three formats:

bdb: a legacy Berkeley DB hash used as a key/value where the value is a binary blob that contains all the RPM data. The format of this blob should be the same as the RPM header format and scancode-toolkit can parse the headers. This is the format that was/is used in older RH, CentOS, Fedora and most every RPM distros.
sqlite: a SQLite database where one table is used as a key/value store where the value is a binary blob that contains all the RPM data in the same binary format as in 1. and the RPM header. This is the format that is used in newer RH, CentOS and Fedora versions.
ndb: a new key/value store that is built-in librpm. This is the format used by newer openSUSE distros

librpm provides support for each of these formats and also contains a built-in read-only handler for the 1. bdb format such that librpm can be built without Berkeley DB and still can read an older RPM db (for instance to convert it to a newer format).

It needs to be built with specific flags to enable all these formats (typically a given build of a distro does not nee to support all the formats).

The installed DBs locations are:

Distro	Path	Format
CentOS 8	`/var/lib/rpm/Packages`	Berkeley DB (Hash, version 9, native byte-order)
CentOS 5	`/var/lib/rpm/Packages`	Berkeley DB (Hash, version 8, native byte-order)
Fedora 30	`/var/lib/rpm/rpmdb.sqlite`	SQLite 3.x database
Fedora 20	`/var/lib/rpm/Packages`	Berkeley DB (Hash, version 9, native byte-order)
openMandriva	`/var/lib/rpm/Packages`	Berkeley DB (Hash, version 10, native byte-order)
RHEL 8	`/var/lib/rpm/Packages`	Berkeley DB (Hash, version 9, native byte-order)
openSUSE 20200528	`/usr/lib/sysimage/rpm/Packages.db`	data, but this is the ndb format

In addition on Fedora distros there are files under /etc/yum.repos.d/* that contains base and mirror URLs for the repo used to install RPMs. Each file is in .ini format. On openSUSE and SLES, these are under /etc/zypp/repos.d

The licenses (when not deleted as in some CentOS Docker images) are found in /usr/share/licenses/<package name>/<license files> or /usr/share/doc/<package name>/<license files>

If using the rpm cli, this can create an XML like output:
./rpm --query --all --qf '[%{*:xml}\n]' --rcfile=./rpmrc --dbpath=<path to>/var/lib/rpm > somefile.xml
The .rcfile option may not be needed, but when using a fresh RPM build this is needed.

The RPM db may need to be rebuilt first when this is a bdb format from an older version than the bdb with which librpm was built.

support all scancode-toolkit plugins
add a reporting module

nexb / scancode.io Goto Github PK

scancode.io's Introduction

ScanCode.io

Getting started

Build and tests status

License

scancode.io's People

Contributors

Stargazers

Watchers

Forkers

scancode.io's Issues

Description

Description

System configuration

Recommend Projects

Recommend Topics

Recommend Org