Giter Site home page Giter Site logo

my_find_grep's Introduction

my_find_grep & copy

parallel find, grep & copy

parallel or concurrent find grep with O_DIRECT mode for grep TL;DR:

  • grep uses a few kilobytes as buffer and reads files chunkwise, skips files if a '\0' is found and checks if the search string is at the buffer boundary
  • threads increment atomic counters to indicate if they are working and might find new directories which are the work-units the threads are working on

============== my find grep: concurrent find respectively concurrent grep ==============

(grep with O_DIRECT mode: (name the executable 'mfgo' or change the code below)

compile:

g++ -O3 -std=c++17 -pthread my_find_grep.cpp -o mfg

intel compiler:

dpcpp -O3 -std=c++17 -pthread fsscanner_atomic_find_grep.cpp -o fsscanner_atomic_find_grep && time ./fsscanner_atomic_find_grep 2 ~/nvme/code MAINTAINERS grep

usage: ./mfg <search_string> [grep|find|grepCPP|grepCPPI] # defaults to find

  • grep: skipping binary files and reading chunks/buffer with read() syscall
  • find: C++ string search on whole file path
  • grepCPP: using C++ searches binary files
  • grepCPPI: using C++ skipping binary files (like grep -I)

e.g. ./mfg 64 /usr/src/linux "MAINTAINERS" grep

notes:

grep by default skips binary files (files with \0 in them)

also searching in file path part not just after last more like find -type f | grep <search_string>

---- advanced notes ---- heuristic:

  • one directory per thread
  • no stat syscalls to speed up traversal
  • array of atomic variables are used to synchronize/wait for other threads which might find new directories
  • increment atomic variable in section where we do directory listings and might find new directories
  • decrement after we are done in this section
  • check global vector with directories for more work
  • wait if other threads have a postive atomic variable
  • else: die
    • don't wait forever: say 1000 threads are running, but near the end only a few directories are left and ~999 are "busy waiting and checking atomic variables, i.e. waisting cpu time

for more open files

ulimit -n 123456

for more threads > 32k; I guess 2 mmap's per thread

echo 777888 > /proc/sys/vm/max_map_count

e.g. use iostat -x -p nvme0n1p1 --human 1 to find sweet spot mpstat -P ALL 1 vmstat -m 1 | egrep "dentry|inode_cache" # print slabs bcc/tools/syscount.py to monitor syscalls

read-ahead settings, scheduler, etc: https://cromwell-intl.com/open-source/performance-tuning/disks.html https://wiki.mikejung.biz/Ubuntu_Performance_Tuning

echo kyber > /sys/block/nvme0n1/queue/scheduler echo 512 > /sys/block/nvme0n1/queue/nr_requests echo 0 > /sys/block/nvme0n1/queue/read_ahead_kb

Benchmarks:

find: $ time find /nvme -name "MAINTAINERS" 15 secs; in-cache: 1.8s $ time ./mfg 32 threads 1 sec; in-cache: 0.25s

$ time ./mfg 128 /nvme "MAINTAINERS" grep 16-20 sec minimum: with 209 threads: 11 sec ---> high variance than xargs or ripgrep

skip binary files:

time (find /nvme/ -type f -print0 | xargs -P64 -0 grep -I MAINTAINERS | wc -l) 20 sec

my_find_grep's People

Contributors

stefan20162016 avatar

Stargazers

 avatar  avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.