decuser / decuser_python_playground Goto Github PK
View Code? Open in Web Editor NEWA repository for python scripts of interest
License: BSD 3-Clause "New" or "Revised" License
A repository for python scripts of interest
License: BSD 3-Clause "New" or "Revised" License
I'm not sure this is a 100% great idea, but it's worth looking into. Instead of processing files larger than 100 megs completely, grab a 100 meg sample from them and use that to calculate a digest. Seed the random number generator with a magic number to get consistent tables for seek. It should be fairly simple to do and if performance sucks, easy to remove. The hope is that this will make multi-gig files faster to get a digest while maintaining a degree of confidence that the files are the same, if they have the same shallow digest. The thinking being that the shallow digest version can be run frequently on directories that aren't expected to be changing much, and the deep version can be run as needed for increased certainty. May play with the 100meg minimum, and with the amount of sample.
Some considerations
Issue with effectively empty src:
Calculating sha1 digests in src Traceback (most recent call last):
File "/Users/wsenn/bin/dircmp.py", line 394, in <module>
[src_files_dict, revidx_src_files] = calculate_sha1s(args['srcdir'], "src", src_files)
File "/Users/wsenn/bin/dircmp.py", line 70, in calculate_sha1s
display_progress(current_progress, src_files_bytes, 50)
File "/Users/wsenn/bin/dircmp.py", line 165, in display_progress
per_progress = int((total / curr) * 100)
ZeroDivisionError: division by zero
$ tree dst
dst
0 directories, 0 files
$ tree src
src
└── a
1 directory, 0 files
$ ls -a dst
. .. .DS_Store .DS_Store.o
$ ls -a src
. .. a
$ ls -a src/a
. .. .empty .emptytoo
$
add support for scanning a single directory and or tree.
Without breaking anything, let's clean up the code in anticipation of automating some tests. Use a branch for this so there's not so many piddly versions. Prior to merging, do a regression test for all combinations of arguments and directory scenarios.
dircmp -crf ./notes ~/sandboxes/notes
+----------------------------------+
| Welcome to dircmp version 0.7.3 |
| Created by Will Senn on 20191210 |
| Last updated 20210805 |
+----------------------------------+
Arguments: -crf ./notes /Users/wsenn/sandboxes/notes
Digest: sha1
Source (src): ./notes/
Destination (dst): /Users/wsenn/sandboxes/notes/
Compact mode: True
Single directory mode: False
Show all files: False
Recurse subdirectories: True
Calculate shallow digests: True
Traceback (most recent call last):
File "/Users/wsenn/bin/dircmp", line 482, in <module>
[src_files, src_files_bytes, num_src_dirs, num_src_files] = get_files(args['srcdir'], "src")
File "/Users/wsenn/bin/dircmp", line 345, in get_files
[num_dirs, num_files, files] = recurse_subdir(dir_to_analyze, args['recurse'], args['all'])
File "/Users/wsenn/bin/dircmp", line 379, in recurse_subdir
if tfiles[0] == ".":
IndexError: list index out of range
the current progress bar looks like it puts out a dot for every file processed, this is annoying when you've got more than 100 files, but it's obnoxious with 400k.
Consider printing brackets 50 chars apart, then filling with dots. First, regular dot, backup bold dot, etc.
[.................................................]
Stick to simple version for now - display a dot '.' for every 1% complete. Calculate after every file processed.
It appears to be counting subdirectories as well as regular files. Something needs to be done - either keep separate counts or reconcile the count elsewise.
Haven't a clue, but windows doesn't appear to be happy with:
skey = re.sub(r'^' + re.escape(srcpath), dstpath, key)
could be a problem with srcpath, dstpath, or key
python dircmp.py tests/default/src tests/default/dst
+------------------------------------+
| Welcome to dircmp version 0.5.0 |
| Created by Will Senn on 20191210 |
| Last updated 20191212 |
+------------------------------------+
Digest: sha1
Source (src): tests/default/src\
Destination (dst): tests/default/dst\
Show all files: False
Recurse subdirectories: False
Scanning src ... 9 files found (0.0s).
Calculating sha1 digests in src ... done (0.0s).
Scanning dst ... 7 files found (0.0s).
Calculating sha1 digests in dst... done (0.0s).
Analyzing src directory ...done (0.0s).
Analyzing dst directory ...done (0.0s).
Comparing src to dst ...Traceback (most recent call last):
File "dircmp.py", line 264, in <module>
skey = re.sub(r'^' + re.escape(srcpath), dstpath, key)
File "C:\py38\lib\re.py", line 208, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "C:\py38\lib\re.py", line 325, in _subx
template = _compile_repl(template, pattern)
File "C:\py38\lib\re.py", line 316, in _compile_repl
return sre_parse.parse_template(repl, pattern)
File "C:\py38\lib\sre_parse.py", line 988, in parse_template
this = sget()
File "C:\py38\lib\sre_parse.py", line 256, in get
self.__next()
File "C:\py38\lib\sre_parse.py", line 245, in __next
raise error("bad escape (end of pattern)",
re.error: bad escape (end of pattern) at position 17
Duplicates found in src/: 4 files found.
da39a3ee5e6b4b0d3255bfef95601890afd80709 a/.empty
da39a3ee5e6b4b0d3255bfef95601890afd80709 a/.emptytoo
da39a3ee5e6b4b0d3255bfef95601890afd80709 b/.empty
da39a3ee5e6b4b0d3255bfef95601890afd80709 b/.emptytoo
The way times are displayed prevent a simple comparison of logs to determine if system is working. Either write standard unit tests or ditch the timings.
When -b and -s are specified with a src directory, nothing is displayed about duplicate files, whereas in plain single dir mode there is:
+----------------------------------+
| Welcome to dircmp version 0.7.0 |
| Created by Will Senn on 20191210 |
| Last updated 20210803 |
+----------------------------------+
......Started at 2021-08-04 06:09:06.029075
1 dirs, 9 files analyzed.
1 dirs, 9 files found in src/.
Finished at 2021-08-04 06:09:06.045889
vs
...
Duplicates found in src/: 6 files found.
0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both
0026a27ffa78a4a4963175c35fbee11c332049ed same_in_both_copy
c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src
c62a323c301dfb0f3cc8e27609c7f507d1965b64 only_in_src_copy
da39a3ee5e6b4b0d3255bfef95601890afd80709 empty
da39a3ee5e6b4b0d3255bfef95601890afd80709 empty_in_both
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.