cdelorme / level Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 120 KB

A file deduplication program for linux.

License: zlib License

Go 100.00%

level's People

Contributors

Stargazers

Watchers

level's Issues

Windows 32 bit FileInfo

Tested on a Windows Vista 32 bit system, the program crashes when the FileWalk begins.

It throws a nil pointer error while referencing a line with FileInfo calls to Mode().IsRegular().

This could be a bug with the windows 32 bit golang implementation, but I do not have a 32 bit windows system to test this on.

Crash Report - chan receive?

Encountering a bug that appears to stem from a receiving channel processing the crc32 summary:

panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x0 pc=0x4b449b]

goroutine 5 [running]:
hash/crc32.update(0x0, 0x0, 0xc208013200, 0x689, 0x889, 0xc2080466c0)
/home/vagrant/.gvm/gos/go1.4.2/src/hash/crc32/crc32.go:104 +0x5b
hash/crc32.updateCastagnoli(0x7f8d00000000, 0xc208013200, 0x689, 0x889, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/hash/crc32/crc32_amd64x.go:26 +0x7d
hash/crc32.Update(0xc200000000, 0x0, 0xc208013200, 0x689, 0x889, 0x889)
/home/vagrant/.gvm/gos/go1.4.2/src/hash/crc32/crc32.go:112 +0x55
hash/crc32.(*digest).Write(0xc208ac8e60, 0xc208013200, 0x689, 0x889, 0x889, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/hash/crc32/crc32.go:118 +0x62
github.com/cdelorme/level6/l6.func·001(0x0)
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:102 +0x64d
created by github.com/cdelorme/level6/l6.(*Level6).HashAndCompare
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:110 +0x24e

goroutine 1 [chan send]:
github.com/cdelorme/level6/l6.(*Level6).HashAndCompare(0xc20803ca90)
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:125 +0x43f
main.main()
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/main.go:95 +0x14e9

goroutine 6 [runnable]:
syscall.Syscall(0x0, 0x4, 0xc20836c000, 0x20f, 0xf, 0x20f, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/syscall/asm_linux_amd64.s:21 +0x5
syscall.read(0x4, 0xc20836c000, 0x20f, 0x20f, 0xc2080d8c00, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/syscall/zsyscall_linux_amd64.go:867 +0x6e
syscall.Read(0x4, 0xc20836c000, 0x20f, 0x20f, 0x507580, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/syscall/syscall_unix.go:136 +0x58
os.(*File).read(0xc2085cc000, 0xc20836c000, 0x20f, 0x20f, 0x70, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/os/file_unix.go:191 +0x5e
os.(*File).Read(0xc2085cc000, 0xc20836c000, 0x20f, 0x20f, 0x7f8d7fece098, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/os/file.go:95 +0x91
bytes.(*Buffer).ReadFrom(0xc208ae0000, 0x7f8d7fcec970, 0xc2085cc000, 0x0, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/bytes/buffer.go:169 +0x25a
io/ioutil.readAll(0x7f8d7fcec970, 0xc2085cc000, 0x20f, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/io/ioutil/ioutil.go:33 +0x1b0
io/ioutil.ReadFile(0xc2080d8c00, 0x72, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/vagrant/.gvm/gos/go1.4.2/src/io/ioutil/ioutil.go:70 +0x1b5
github.com/cdelorme/level6/l6.func·001(0x1)
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:95 +0x3db
created by github.com/cdelorme/level6/l6.(*Level6).HashAndCompare
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:110 +0x24e

goroutine 7 [chan receive]:
github.com/cdelorme/level6/l6.func·002()
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:117 +0x7b
created by github.com/cdelorme/level6/l6.(*Level6).HashAndCompare
/home/vagrant/.gvm/pkgsets/go1.4.2/global/src/github.com/cdelorme/level6/l6/level6.go:120 +0x353

windows 8.1 resource exhaustion

Tested in Windows 8.1 and Linux on a 60GB folder of miscellaneous files.

Tests were with golang 1.3.1.

Linux box has 32GB of memory, Windows has 16GB of memory.

Execution worked great on Linux, consumed lots of memory, but the process kept up and finished. On Windows it crashes with "resource-exhaustion-detected". It could be the difference in available memory, or just a general bug in handling. I tried forcing the GC which did not help, and confirmed it does happen during the generatehashes.

My assumption is that ReadFile is not efficient enough, but I could not find a more efficient approach when searching google for examples and documentation. For our use-case we need the entire files contents to generate the unique hash.

buffered channels

I had buffers for my channels when writing the second draft of the source, but after tests concluded not to include them.

test case

I tested without buffers, with buffers the size of the number of workers, twice the number of workers, and five times the number of workers:

My tests were run on debian linux x64 with 32GB of memory.

I used two data sets:

single-copy of 12GB of documents
duplicated copies of those 12GB as a biased data-set to test hash performance

The first allowed me to test a more feasible situation, one where there are likely to be significantly less duplicates (it was an old set of personal documents, so some duplicates were sure to exist).

The second was forcefully biased, to test hash performance and would certainly not represent a normal data-set, nor would it necessarily yield unbiased performance results (more time spent hashing that throws off how long the channels take to process the data)

For all instances I ran the program ten times against the datasets, including summary, verbose logging, and cpu profiling output.

performance

Biased Data (tested first):

Summary output for each indicated 206242 files scanned, and 151674 hashes generated (crc32 and sha256).

In all cases the CPU profile was unvariably similar, with the most time spent in Syscall and sha256 hashing functions.

My tests yielded very similar performance metrics for no-buffer and 2x buffered channels with time, with a slightly more erratic performance without buffers occassionally being .05 seconds slower.

According to the debug output, the spread of work ranged between 800-1300 for each worker. This range remained the same for nearly all test cases, with the 800-900 & 1200-1300 being outliers in all cases. As the buffer sizes increased the range become slightly tighter, but the outliers occurred at least once in every test.

The speed of the 1x and 5x buffers was approx. .1 seconds slower.

Again, because a majority of time spent in these tests was likely processing duplicate hashes the timing is biased.

Normal Data:

Summary output revealed 103121 files scanned, with 74270 crc32 hashes generated and 29786 sha256 hashes generated (a great example of crc32 not being as accurate but great as a front, because that's 50,000 less potential sha256 hashes it may have generated).

In this case cpu profiling revealed that we still had a majority of time spend running syscall and sha256, but this time sha256 was significantly less (but still the second highest time-spent function). I could conclude from these results that none of the code I wrote in this project is a major bottleneck, though this does not mean my code cannot be improved.

Work distribution ranged between 700-1100, and outliers in the 700 & 1000 ranges had roughly the same rate of occurence throughout the tests.

Execution time ranges:

1.27-1.31 seconds w/o buffers
1.28-1.31 seconds with 1x buffers
1.30-1.36 seconds with 2x buffers
1.35-1.42 seconds with 5x buffers

conclusion

In conclusion I removed channel buffers from my source. The buffers did not appear to greatly improve channel distribution, and in almost all cases they resulted in lower speed of execution; with the 2x buffer being the exception when working on a moderately large data set and only by .05 seconds.

This may be different on a single-core system, and if anyone is willing to supply me with some test results I'd be happy to see them.

cdelorme / level Goto Github PK

level's People

Contributors

Stargazers

Watchers

level's Issues

Windows 32 bit FileInfo

Crash Report - chan receive?

windows 8.1 resource exhaustion

buffered channels

test case

performance

conclusion

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent