Giter Site home page Giter Site logo

wf's Introduction

wf

Crates.io Build Status

A Unix-style command line utility for counting word frequencies.

Usage

wf [options]

Reads stdin and outputs newline delimited rows containing each unique word and the number of times it appears, seperated by a space.

Options:

-n --nums           Include numbers
-s --sort           Sort output alphabetically by words (incompatible with -f)
-f --freq           Sort output but most to least frequent (incompatible with -s)
-h --help           Display usage information

Installation

To install the wf binary, you can now do the following with an up-to-date version of cargo:

cargo install wf

Development

This project uses clippy, and Travis CI will check all PRs and branches against it. It is advisable then to install this and check locally before submitting a PR.

License

Copyright 2018 by Annaia Berry This project is licensed with the Affero GPL v3. See LICENSE for full details.

wf's People

Contributors

jarcane avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

wf's Issues

Significant slow down at around 400k lines

First off, thank you for writing this tool! It's been a great help!

I'm building a word frequency list for online reviews, and ran into significant slow down at around 400k lines - the combined data to be counted is about 80M lines (~850M words), so as it is, it would seemingly never finish.

If anyone else runs into this problem, and is looking for a bandaid, the following script does a map-reduce-style split-count-combine:

# install python dependencies: `pip install tqdm`
# install gnu parallel: sudo apt-get install parallel
mkdir splits wfs
echo 'Splitting file into parts...'
split -a 5 -l 100000 $1 splits/split
ls splits/ | parallel 'echo "Counting {}..."; cat splits/{} | wf > wfs/{}_wf.txt'
echo 'Combining split counts...'
python -c 'from tqdm import tqdm; from functools import reduce; from glob import glob; from collections import Counter; of = open("wfs.txt", "w"); wf = reduce(lambda a, b: a + b, (Counter(dict((pair[0], int(pair[1])) for pair in (line.strip().split() for line in open(fpath)))) for fpath in tqdm(glob("wfs/*"))), Counter()); [of.write("{} {}\n".format(key, count)) for key, count in sorted(wf.items(), key=lambda p: -p[1])]'
rm -rf wfs splits
echo 'Word frequencies written to wfs.txt.'

Broken pipe error when piping the output to `head`

Hey, I just used wf for some data exploration and it was pretty useful! I tried to look up the most frequent word in a text and piped wf's output to head. I got the following error message. The error probably happens because head exits after it has printed the requested number of lines. I would expect a command-line tool like wf to handle piping to head gracefully.

% wf -f < text.txt | head -1
the 2228
thread 'main' panicked at 'failed printing to stdout: Broken pipe (os error 32)', libstd/io/stdio.rs:692:9
note: Run with `RUST_BACKTRACE=1` for a backtrace.

I'm using wf 0.2.0 on macOS.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.