Giter Site home page Giter Site logo

copydetect's Introduction

copydetect

Screenshot of copydetect code comparison output

Overview

Copydetect is a code plagiarism detection tool based on the approach proposed in "Winnowing: Local Algorithms for Document Fingerprinting" and used for the popular MOSS platform. Copydetect takes a list of directories containing code as input, and generates an HTML report displaying copied slices as output. The implementation takes advantage of fast numpy functions for efficient generation of results. Code tokenization is handled by Pygments, so all 500+ languages which pygments can detect and tokenize are in turn supported by copydetect.

Note that, like MOSS, copydetect is designed to detect likely instances of plagiarism; it is not guaranteed to catch cheaters dedicated to evading it, and it does not provide a guarantee that plagiarism has occurred.

Installation

Copydetect can be installed using pip install copydetect. Note that Python version 3.7 or greater is required. You can then generate a report using the copydetect command (copydetect.exe on Windows. If your scripts folder is not in your PATH the code can also be run using py.exe -m copydetect).

Usage

The simplest usage is copydetect -t DIRS, where DIRS is a space-separated list of directories to search for input files. This will recursively search for all files in the provided directories and compare every file with every other file. To look only at specific file extensions, use -e followed by another space-separated list (for example, copydetect -t student_code -e cc cpp h)

If the files you want to compare to are different from the files you want to check for plagiarism (for example, if you want to also compare to submissions from previous semesters), use -r to provide a list of reference directories. For example, copydetect -t PA01_F20 -r PA01_F20 PA01_S20 PA01_F19. To avoid matches with code that was provided to students, use -b to specify a list of directories containing boilerplate code.

There are several options for tuning the sensitivity of the detector. The noise threshold, set with -n, is the minimum number of matching characters between two documents that is considered plagiarism. Note that this is AFTER tokenization and filtering, where variable names have been replaced with V, function names with F, etc. If you change -n (default value: 25), you will also have to change the guarantee threshold, -g (default value: 30). This is the number of matching characters for which the detector is guaranteed to detect the match. If speed isn't an issue, you can set this equal to the noise threshold. Finally, the display threshold, -d (default value: 0.33), is used to determine what percentage of code similarity is considered interesting enough to display on the output report. The distribution of similarity scores is plotted on the output report to assist selection of this value.

There are several other command line options for different use cases. If you only want to check for "lazy" plagiarism (direct copying without changing variable names or reordering code), -f can be used to disable code filtering. If you don't want to compare files in the same leaf directory (for example, if code is split into per-student directories and you don't care about self plagiarism), use -l. For a complete list of configuration options, see the following section.

Configuration Options

Configuration options can be provided either by using the command line arguments or by using a JSON file. If a JSON file is used, specify it on the command line using -c (e.g., copydetect -c configuration.json). A sample configuration file is available here. The following list provides the names of each JSON configuration key along with its associated command line arguments.

  • test_directories (-t, --test-dirs): a list of directories to recursively search for files to check for plagiarism.
  • reference_directories (-r, --ref-dirs): a list of directories to search for files to compare the test files to. This should generally be a superset of test_directories. If not provided, the test directories are used as reference directories.
  • boilerplate_directories (-b, --boilerplate-dirs): a list of directories containing boilerplate code. Matches between fingerprints present in the boilerplate code will not be considered plagiarism.
  • extensions (-e, --extensions): a list of file extensions containing code the detector should look at.
  • noise_threshold (-n, --noise-thresh): the smallest sequence of matching characters between two files which should be considered plagiarism. Note that tokenization and filtering replaces variable names with V, function names with F, object names with O, and strings with S so the threshold should be lower than you would expect from the original code.
  • guarantee_threshold (-g, --guarantee-thresh): the smallest sequence of matching characters between two files for which the system is guaranteed to detect a match. This must be greater than or equal to the noise threshold. If computation time is not an issue, you can set guarantee_threshold = noise_threshold.
  • display_threshold (-d, --display-thresh): the similarity percentage cutoff for displaying similar files on the detector report.
  • force_language (-o, --force-language): forces the tokenizer to tokenize input as a specific language, rather than automatically detecting the language using the file extension.
  • same_name_only (-s, --same-name): if true, the detector will only compare files that have the same name (for example, decision_tree.py will not be compared to k_nn.py). Note that this also means that, for example, bryson_k_nn.py will not be compared to sara_k_nn.py.
  • ignore_leaf (-l, --ignore-leaf): if true, the detector will not compare files located in the same leaf directory.
  • disable_filtering (-f, --disable-filter): if true, the detector will not tokenize and filter code before generating file fingerprints.
  • disable_autoopen (-a, --disable-autoopen): if true, the detector will not automatically open a browser window to display the report.
  • truncate (-T, --truncate): if true, highlighted code will be truncated to remove non-highlighted regions from the displayed output (sections not within 10 lines of highlighted code will be replaced with "...").
  • out_file (-O, --out-file): path to save output report to. A '.html' extension will be added to the path if not provided. If a directory is provided instead of a file, the report will be saved to that directory as report.html.
  • encoding (--encoding): encoding to use for reading files (the default is UTF-8). If files use varying encodings, --encoding DETECT can be used to detect the encoding of all files (note: encoding detection requires the chardet package).

Advanced options:

  • css (--css): Optional list of CSS files that will be linked in the generated HTML report file. These will overwrite the styling of the default report.

API

Copydetect can also be run via the python API. An example of basic usage is provided below. API documentation is available here.

>>> from copydetect import CopyDetector
>>> detector = CopyDetector(test_dirs=["tests"], extensions=["py"],
...                         display_t=0.5)
>>> detector.add_file("copydetect/utils.py")
>>> detector.run()
  0.00: Generating file fingerprints
   100%|████████████████████████████████████████████████████| 8/8
  0.31: Beginning code comparison
   100%|██████████████████████████████████████████████████| 8/8
  0.31: Code comparison completed
>>> detector.generate_html_report()
Output saved to report/report.html

copydetect's People

Contributors

amaennel avatar ankostis avatar blingenf avatar chaseleif avatar ilya-ilya avatar kjjohnsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

copydetect's Issues

report.html does not print and saves variable names instead of values.

Hi

I am using python 3.7.3

and my report html looks like
image

I guess python version stuff.

Also the report.html was saved in pkg_resources.resource_filename('copydetect', 'data/')
rather than the specified output directory. I specified the output directory both in the constructor for CopyDetector as well as later in the method generate_html_report

Feature Request: Ignore Pattern Regex

Feature Proposal:
Introduce an "ignore pattern regex" option to the Copydetect command line interface, allowing users to specify regex patterns for files or directories to be excluded from the plagiarism detection process.

Use Case:
In complex codebases, there are often files or directories that are irrelevant to plagiarism detection, introducing unnecessary noise. For instance, excluding Rust's target directory or specific project artifacts could improve the precision of plagiarism detection.

Proposed Implementation:
Add a new command line option, e.g., --ignore-pattern, where users can input a regex pattern. This pattern would be used to identify and exclude files or directories during plagiarism detection.

Example Usage:

copydetect -t DIRS --ignore-pattern "target|build|common_artifact"

Awaiting community feedback.

Why are we ignoring the duplicate hashes

While I was trying to solve the common score problem(having a single score for each test_file), I figured that I have to keep the duplicate hashes. In the current code, at every step (while removing the boilerplate and while finding the common hashes), we are giving only the unique hashes.

Is this on purpose? Or did we face any issues while trying to keep the repeated hashes?

The merged pull request on Feb 3 is breaking ~> slice_matrix

I could fix this issue by going back to the commit before the merge:
"only run CI on pull requests and master pushes" 62d4bfd

I have used copydetect multiple times before with no issues.

I had deleted the repo and installation files to save space, so when I re-cloned it I got the merged changes from Februrary.

I have tried varying the values in my config for "noise_threshold", "guarantee_threshold" and "display_threshold"

Each time I run it I get different indices in the KeyError but the same trace otherwise:

Traceback (most recent call last):
  File ".local/bin/copydetect", line 11, in <module>
    sys.exit(main())
  File ".local/lib/python3.6/site-packages/copydetect/__main__.py", line 117, in main
    detector.generate_html_report()
  File ".local/lib/python3.6/site-packages/copydetect/detector.py", line 637, in generate_html_report
    code_list = self.get_copied_code_list()
  File ".local/lib/python3.6/site-packages/copydetect/detector.py", line 598, in get_copied_code_list
    slices_test = self.slice_matrix[(y[idx], x[idx])][1]
KeyError: (9, 16)

In detector.py (line numbers may be slightly varied from modification, some lines skipped).

There is a statement, if (x[idx], y[idx]) in self.slice_matrix:
Then there was just an else, assuming they would be in there reversed which was not happening.
I can get the script to complete and make my report.html by adding the else:

560     def get_copied_code_list(self):
594             if (x[idx], y[idx]) in self.slice_matrix:
595                 slices_test = self.slice_matrix[(x[idx], y[idx])][0]
596                 slices_ref = self.slice_matrix[(x[idx], y[idx])][1]
597             elif (y[idx], x[idx]) in self.slice_matrix:
598                 slices_test = self.slice_matrix[(y[idx], x[idx])][1]
599                 slices_ref = self.slice_matrix[(y[idx], x[idx])][0]
600             else:
601                 print(f'({x[idx]}, {y[idx]}) not in self.slice_matrix')
602                 slices_test = np.array([])
603                 slices_ref = np.array([])

By adding in the elif and the final else I can just get this warning:

.local/lib/python3.6/site-packages/copydetect/utils.py:167: UserWarning: empty slices array
  warnings.warn("empty slices array")
(11, 0) not in self.slice_matrix

I notice here that self.slice_matrix[] is only conditionally assigned:

485     def _comparison_loop(self):
515         for i, test_f in enumerate(tqdm(self.test_files,
517             for j, ref_f in enumerate(self.ref_files):
531                 else:
532                     overlap, (sim1, sim2), (slices1, slices2) = compare_files(
533                         self.file_data[test_f], self.file_data[ref_f]
534                     )
535                     comparisons[(test_f, ref_f)] = (i, j)
536                     if slices1.shape[0] != 0:
537                         self.slice_matrix[(i, j)] = [slices1, slices2]

I notice here that len(slices2[0]) is not checked, adding a check doesn't help but it seems odd:

100 def compare_files(file1_data, file2_data):
130     slices1 = get_copied_slices(idx1, file1_data.k)
131     slices2 = get_copied_slices(idx2, file2_data.k)
132     if len(slices1[0]) == 0:
133         return 0, (0,0), (np.array([]), np.array([]))
156     return token_overlap1, (similarity1,similarity2), (slices1,slices2)

if len(slices1[0]) == 0 or len(slices2[0]) == 0:

I saw in the commit history that changes happened to the slice_matrix, so reverted to the commit before the merge and now it is working again.

It wasn't very many lines that were changed, the problem should be in here: 16874d6

Is similarity wrong ??

Inside the compare_files method, we are using

..
..
slices1 = get_copied_slices(idx1, file1_data.k)
.. 
token_overlap1 = np.sum(slices1[1] - slices1[0])
..
..
if len(file1_data.filtered_code) > 0:
        similarity1 = token_overlap1 / file1_data.token_coverage
..

to find the similarity.

But here, I think slices are the actual character(not tokens) slices. When we find the similarity, shouldn't we be divicing using len(file_data.raw_code) instead ? (Assuming token coverage is actually the number of tokens (not characters) in the code file)

(0, 0) similarity

Hello! While using the method described in your API "For advanced use cases", I have encountered some examples that give the "IndexError: index 1 is out of bounds for axis 0 with size 0" error when trying to highlight, [0,0] similarities and (array([], dtype=float64), array([], dtype=float64)) slices. It seems as if they are processed incorrectly by copydetect, since there are some similar lines in them. In case of solutions which have only one line (like "print(0)" in python) adding something (like "import numpy as np") to them helps. I would like to know whether this is a bug or your tool is supposed to work this way.

>>> fp6 = copydetect.CodeFingerprint("solution 2.py", 25, 1)
>>> fp8 = copydetect.CodeFingerprint("solution 4.py", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp6, fp8)
>>> print(similarities, slices)
(0, 0) (array([], dtype=float64), array([], dtype=float64))
>>> code1, _ = copydetect.utils.highlight_overlap(fp1.raw_code, slices[0], ">>", "<<")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/copydetect/utils.py", line 163, in highlight_overlap
    hl_percent = np.sum(slices[1] - slices[0])/len(doc)
IndexError: index 1 is out of bounds for axis 0 with size 0

solution 2.py:

a1 = []  # new list
n = int(input())  # length of list
for i in range(n):  
    new_element = int(input())  # new element
    a1.append(new_element)  # 

    
b1 = [] # new list
m = int(input())  # length of list
for i in range(m):  
    new_element = int(input())  # new element
    b1.append(new_element)  # 

s1 = set(b1)
ans = []
for i in range(n):
    if a1[i] in s1:
        ans.append(a1[i])
        
print(ans)

solution 4.py:

n = int(input())
#writing something here#writing something here#writing something here
lst = []#writing something here#writing something here#writing something here
for i in range(n):#writing something here#writing something here
    lst.append(int(input()))#writing something here
m = int(input())

lst2 = []#writing something here
for i in range(m):#writing something here
    lst2.append(int(input()))#writing something here

set_int = set(lst).intersection(set(lst2))#writing something here
res_lst = []#writing something here
for i in range(n):#writing something here#writing something here#writing something here#writing something here
    if (lst[i] in set_int):#writing something here#writing something here
        res_lst.append(lst[i])#writing something here#writing something here#writing something here

print(res_lst)#writing something here#writing something here#writing something here#writing something here



the same also happens for the following solutions:

#include<bits/stdc++.h>

using namespace std;
#define forn(i, n) for(int i = 0; i < int(n); i++) 
#define f first	
#define s second
#define ll long long
#define db double
#define pb push_back
typedef pair<int, int> pii;

void inppp(){
    freopen("input.txt", "r", stdin);
    freopen("output.txt", "w", stdout);
}


const int M = 1e9 + 7;
const int MAXN = 1e3 + 5;


//double ans = 0;



int main() {
	inppp();
	//setprecision(12);
	double t = 1;
	//cin >> t;
	int n;
	cin >> n;
	vector<int> arr(n);
	for(int i = 0; i < n; i++)
		cin >> arr[i];
	sort(arr.begin(), arr.end());

	int c = 1;
	for(int i = 1; i < n; i++) {
		if(arr[i] - arr[i - 1] > 1) c = 0;
	}

	cout << c << endl;
	
	return 0;
}

#include <iostream>
#include <algorithm>
#include <vector>
#include <map>
#include <set>
#include <iomanip>
#include <fstream>
#include <cmath>

using namespace std;

int check(ifstream &fin, int n)
{
  vector<int> vec;
  int a;

  while(fin >> a)
    vec.push_back(a);

  sort(vec.begin(), vec.end());
  
  for (int i = 0; i < n-1; ++i)
  {
    if (!(abs(vec[i]-vec[i + 1]) <= 1))
      return 0;
  }

  return 1;
}

int main()
{
  ifstream fin("input.txt");
  ofstream fout("output.txt");

  int a0;

  if (fin >> a0)
    fout << check(fin, a0);

  fin.close();
  fout.close();
  
  return 0;
}

feat: include version-id and parameters in the report

With significant changes in the scores from 0.4.2+, its important for the provenance of html report-files to contain the version of the program that generated the match ratios.
The cli-options used to generate the report are also useful when comparing results.

So different results

Hi! We compared two files in different versions of copydetect and got completely different check results in similarities. What does this have to do with?
Version 0.3.0:

>>> print(copydetect.__version__)
0.3.0
>>> fp1 = copydetect.CodeFingerprint("solution1.c", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("solution2.c", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities[0],similarities[1])
0.851063829787234 0.8439716312056738

Version 0.4.0:

>>> print(copydetect.__version__)
0.4.0
>>> fp1 = copydetect.CodeFingerprint("solution1.c", 25, 1)
>>> fp2 = copydetect.CodeFingerprint("solution2.c", 25, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities[0],similarities[1])
0.41025641025641024 0.41025641025641024

solution1.c:

#include <stdio.h>
 int main() {
    int f, g;
    int tmp ;
    int a[10];
    
    // считываем количество чисел n
    f=0;
    // формируем массив n чисел
    for(f = 0 ; f < 10; f++) { 
        a[f]=f;
    }
    for(f = 0 ; f < 9; f++)
       for(g = 0 ; g < 9- f  ; g++)
           if(a[g] > a[g+1]) {           
              // если они идут в неправильном порядке, то  
              //  меняем их местами. 
              tmp = a[g];
              a[g] = a[g+1] ;
              a[g+1] = tmp; 
           }
 }

solution2.c:

#include <stdio.h>
 int main() {
    int k, l;
    int tmp ;
    int a[10];
    // считываем количество чисел n

    // формируем массив n чисел
    for(k = 0 ; k < 10; k++) { 
        a[k]=k;
    }
    for(k = 0 ; k < 9; k++) { 
       // сравниваем два соседних элемента.
       for(l = 0 ; l < 9- k  ; l++) {  
           if(a[l] > a[l+1]) {           
              // если они идут в неправильном порядке, то  
              //  меняем их местами. 
              tmp = a[l];
              a[l] = a[l+1] ;
              a[l+1] = tmp; 
           }
        }
    }
 }

Taking high memory for large set of files

I was trying to test this for large code base like around 4000 files. Just to take the limits, I took 3 files and made 1500 copies each. I was through the source code. But it was taking around 7 GB. After some digging, I found out that it was majorly from the slice_matrix.

It was a nested list. I tried converting that to default dict (2D). But no luck. It was taking even more time with same 7GB memory.

Any ideas on how to reduce the memory footprint?

And for a such a large set of files, generating the HTML report doesn't make sense. With just the information like token overlap, similarities and sliced would be enough I feel.

feat: report match totals for each test-file based on all reference files

Currently (v0.4.5) the tool reports match-ratios between all pairs of test <--> ref files - i will focus here on match-ratios for test files but the same applies for ref files, reversed.

Let's assume these are the reported match-ratios for test files:

  graph LR;
      T1--70%-->R1;
      T1--65%-->R2;
      T1--55%-->R3;
      T2--3%-->R1;
      T2--2%-->R2;
      T2--8%-->R3;
Loading

What i'm missing is a new summary section with all the grand total matchings for each test file vs the whole ref codebase,
ie. how many lines are copies, regardless of which specific ref file matched it,
something like this:

  graph LR;
      T1--98%-->R[R1+R2+R3] ;
      T2--11%-->R[R1+R2+R3] ;
Loading

When expanding the ratios in these sections I would expect to see only the "left" diff pane with the copied test-code, like a code-coverage report, reporting the number of matches for each LoC, like this:
image

Does that make sense?

Workaround

Currently i have to concatenate all ref-files into a single one with a command like:

mkdir /tmp/all
find  ref_project/ -name *.cs | xargs cat /tmp/all/all.cs

... and then run against the new ref-folder:

copydetect -t test_prolect/ -r /tmp/all/ -e cs

UnicodeEncodeError in report generation

I'm getting the following traceback after "Code comparison completed":

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\omer\AppData\Roaming\Python\Python311\site-packages\copydetect\__main__.py", line 120, in <module>
    main()
  File "C:\Users\omer\AppData\Roaming\Python\Python311\site-packages\copydetect\__main__.py", line 117, in main
    detector.generate_html_report()
  File "C:\Users\omer\AppData\Roaming\Python\Python311\site-packages\copydetect\detector.py", line 668, in generate_html_report
    report_f.write(output)
  File "C:\Program Files\Python311\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode characters in position 112029-112032: character maps to <undefined>

I believe this has to do with some of the code (strings) containing non-Latin characters, because when I remove these strings the report is generated as expected.

Highlight problem for cpp code

Thanks for such an awesome tool! It's really creative.

The generated HTML seems to miss some token, could you please have a look(The DO_NOT_TAKE_OWNERSHIP token should be marked as copied)?

Screen Shot 2022-04-15 at 11 18 35

The left file content:

// Copyright 2001 - 2003 Google, Inc.
//
// Google-specific types

#ifndef BASE_BASICTYPES_H_
#define BASE_BASICTYPES_H_

#include "kudu/gutil/integral_types.h"
#include "kudu/gutil/macros.h"

// Argument type used in interfaces that can optionally take ownership
// of a passed in argument.  If TAKE_OWNERSHIP is passed, the called
// object takes ownership of the argument.  Otherwise it does not.
enum Ownership {
  DO_NOT_TAKE_OWNERSHIP,
  TAKE_OWNERSHIP
};

// Used to explicitly mark the return value of a function as unused. If you are
// really sure you don't want to do anything with the return value of a function
// that has been marked WARN_UNUSED_RESULT, wrap it with this. Example:
//
//   scoped_ptr<MyType> my_var = ...;
//   if (TakeOwnership(my_var.get()) == SUCCESS)
//     ignore_result(my_var.release());
//
template<typename T>
inline void ignore_result(const T&) {
}


#endif  // BASE_BASICTYPES_H_

The right one:

// Copyright 2001 - 2003 Google, Inc.
//
// The following only applies to changes made to this file as part of YugaByte development.
//
// Portions Copyright (c) YugaByte, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
// in compliance with the License.  You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software distributed under the License
// is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
// or implied.  See the License for the specific language governing permissions and limitations
// under the License.
//
// Google-specific types

#ifndef BASE_BASICTYPES_H_
#define BASE_BASICTYPES_H_

#include "yb/gutil/integral_types.h"
#include "yb/gutil/macros.h"

// Argument type used in interfaces that can optionally take ownership
// of a passed in argument.  If TAKE_OWNERSHIP is passed, the called
// object takes ownership of the argument.  Otherwise it does not.
enum Ownership {
  DO_NOT_TAKE_OWNERSHIP,
  TAKE_OWNERSHIP
};

// Used to explicitly mark the return value of a function as unused. If you are
// really sure you don't want to do anything with the return value of a function
// that has been marked WARN_UNUSED_RESULT, wrap it with this. Example:
//
//   scoped_ptr<MyType> my_var = ...;
//   if (TakeOwnership(my_var.get()) == SUCCESS)
//     ignore_result(my_var.release());
//
template<typename T>
inline void ignore_result(const T&) {
}


#endif  // BASE_BASICTYPES_H_

detector.py only reads utf-8 encoded files

copydetect was throwing "file not ASCII text" for multiple files I am working on. In detector.py it expects all readings to be in utf-8, so I used chardet to automagically detect file encodings and set the appropriate encoding when opening a file.

Fix 1.1 - add to imports:
import chardet

Fix 1.2 - detect character encoding:
In "CodeFingerPrint.init"

change:
"with open(file, encoding="utf-8") as code_fp:"

to:
rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata)['encoding']
with open(file, encoding=encoding) as code_fp:

This will allow copydetect to work with files of any encoding. I had this problem on my system and had to fix it.

Reference directories when using config file

It is told that if reference_directories list is empty, they become equal to test_directories list (which is strictly required in any variant).

But it is not told (at least it is not cleared out) that if using tool with -c flag, reference_directories list should be provided non-empty in config file and it will not become equal to (copied from) test_directories list.

And this 'cross-confusing rule' got confirmed when I've inspected the code here (copy) and here (no copy) that empty reference_directories is only allowed for full-optioned CLI run, but not with -c-run.

Can you state this restriction more clear in the docs or, probably better - update the -c mode to behave equal to full-optioned run?

Code highlighting and similarities for small code

Hello! While using copydetect on small solutions, I have found some cases with highlighting which seemed to miss some matching parts and similarities less than 70% (and 50% if more constructions are changed), changing thresholds didn't help. At first, I thought that this was the fault of {} brackets and changing ++ to +=1, but that is not the case for highlighting. I would like to know whether that is a limitation of the winnowing algorithm or something that can be fixed.

With sol and sol2, I renamed the variables, added an unused one and changed n++ to n+= 1. This let to the following results:

>>> fp1 = copydetect.CodeFingerprint("sol.c", 15, 1)
>>> fp2 = copydetect.CodeFingerprint("sol2.c", 15, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities)
(0.47058823529411764, 0.4444444444444444)
>>> code1, _ = copydetect.utils.highlight_overlap(fp1.raw_code, slices[0], ">>", "<<")
>>> code2, _ = copydetect.utils.highlight_overlap(fp2.raw_code, slices[1], ">>", "<<")
>>> print(code1)
int num = 0;

if (a0<0)
    num++;

double a>>;

while (fscanf(fin, "%lf", &a) != EOF)
{
    if (a<0)
        num+<<+;
}

return num;

>>> print(code2)
int n = 0;
    double b, c;
    if (a0 < 0)
        n += 1>>;
    while (fscanf(fin, "%lf", &b) != EOF)
    {
        if (b < 0)
            n +<<= 1;
    }
    return n;

when i changed n+=1 to n++ in sol3, similarities got higher, but the highlighting seemed to miss some important parts:

>>> fp1 = copydetect.CodeFingerprint("sol.c", 15, 1)
>>> fp2 = copydetect.CodeFingerprint("sol3.c", 15, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities)
(0.6323529411764706, 0.6142857142857143)
>>> code1, _ = copydetect.utils.highlight_overlap(fp1.raw_code, slices[0], ">>", "<<")
>>> code2, _ = copydetect.utils.highlight_overlap(fp2.raw_code, slices[1], ">>", "<<")
>>> print(code1)
int num = 0;

if (a0<0)
    num++;

double a>>;

while (fscanf(fin, "%lf", &a) != EOF)
{
    if (a<0)
        num++;
}

return num;<<

>>> print(code2)
int n = 0;
    double b, c;
    if (a0 < 0)
        n ++>>;
    while (fscanf(fin, "%lf", &b) != EOF)
    {
        if (b < 0)
            n ++;
    }
    return n;<<

When I added { and } to if constructions in sol4, the result became this:

>>> fp1 = copydetect.CodeFingerprint("sol.c", 15, 1)
>>> fp2 = copydetect.CodeFingerprint("sol4.c", 15, 1)
>>> token_overlap, similarities, slices = copydetect.compare_files(fp1, fp2)
>>> print(similarities)
(0.4264705882352941, 0.3918918918918919)
>>> code1, _ = copydetect.utils.highlight_overlap(fp1.raw_code, slices[0], ">>", "<<")
>>> code2, _ = copydetect.utils.highlight_overlap(fp2.raw_code, slices[1], ">>", "<<")
>>> print(code1)
int num = 0;

if (a0<0)
    num++;

double a;

>>while (fscanf(fin, "%lf", &a) != EOF)
{
    if (a<0)
        <<num++;
}

return num;

>>> print(code2)
int n = 0;
    double b, c;
    if (a0 < 0)
    {
        n ++;
    }
    >>while (fscanf(fin, "%lf", &b) != EOF)
    {
        if (b < 0)
        <<{
            n ++;
        }
    }
    return n;

Raw code used here:
sol.c:

int num = 0;

if (a0<0)
    num++;

double a;

while (fscanf(fin, "%lf", &a) != EOF)
{
    if (a<0)
        num++;
}

return num;

sol2.c:

int n = 0;
    double b, c;
    if (a0 < 0)
        n += 1;
    while (fscanf(fin, "%lf", &b) != EOF)
    {
        if (b < 0)
            n += 1;
    }
    return n;

sol3.c:

int n = 0;
    double b, c;
    if (a0 < 0)
        n ++;
    while (fscanf(fin, "%lf", &b) != EOF)
    {
        if (b < 0)
            n ++;
    }
    return n;

sol4.c:

int n = 0;
    double b, c;
    if (a0 < 0)
    {
        n ++;
    }
    while (fscanf(fin, "%lf", &b) != EOF)
    {
        if (b < 0)
        {
            n ++;
        }
    }
    return n;

file not ASCII text

Hi there

When I run copydetect -t <folder> I get the error "file not ASCII text".

I have tried with a folder containing pdf files and a folder containing txt files, with same result.

Any clues on what I'm doing wrong?

Feature Request: Ignore n levels of "leafs"

Hi!

I am using copydetect together with DOMjudge submissions. Effectively, the situation is that students can submit attempts at a problem multiple times. I want to check for plagiarism across students, but not within their own attempts. Here, ignore_leaf comes into play, since I can organize submissions by problem / student / <student attempts>. But if one attempt could contain multiple files, I'd like to organize submissions by problem / student / submission / <submission files>. Then, ignore_leaf doesn't help much. (Of course, I could still flatten all the student submissions into one directory, but I think there is an "easier" way)

Suggestion: Instead of making ignore_leaf a boolean flag, make it an integer, and two files are not compared to each other when the n-th parent is the same.

This should be easily doable in

if (test_f not in self.file_data
or ref_f not in self.file_data
or test_f == ref_f
or (self.conf.same_name_only
and (Path(test_f).name != Path(ref_f).name))
or (self.conf.ignore_leaf
and (Path(test_f).parent == Path(ref_f).parent))):
continue
I can provide a PR if you like!

Code comparison for files not being done correctly for xml files.

Hi,
I have used the above library to compare my files for xml files. even though 2 files have the same code, the library does not show it as copied. I have used the default configuration for the same.
image

Please let me know if there are some changes to be done.

Note: the same issue is persisted in a few .properties and java files as well

Thank you in advance

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.