bitly / data_hacks Goto Github PK

View Code? Open in Web Editor NEW

1.9K 1.9K 196.0 50 KB

Command line utilities for data analysis

Home Page: http://github.com/bitly/data_hacks

Python 100.00%

data_hacks's People

Contributors

Stargazers

Watchers

Forkers

luiscielak jabley ashish0x90 mjschultz lorrin babo kenfar ojilles spiros abhin4v asemx jlharri bhy mrcrabby matthewlmcclure graydon jlsandell michaelhood chiehwen perryhau danieltaborda sashka todelod linuxster awreece rubinovitz campeterson claudiamihai prateek jrmontag yujinqiu mt0803 jxqlovejava coola007 ravibhure pfmoore ryancoleman jnwhiteh wocin nicksto wavelets mathyourlife springcoil liuwise twistedmove jehiah phillipkent mattias-lundell erikvanzijst epety aaronjorbin jemunos hackerway nidhog wlan0 linkzter edinunzio willingc milstein bobbybabra davegerson ype ebuk randyau nfredrik kjing kalloc jgchaves odconfront rahuls5 squioc vdimarco leifanc aseetharam qiaoboxy future0064 aeppert sampathweb elazarl cluo sayiho rhoml jasonshih nonva binbenliu bnulxe snazz2001 huiyi1990 hxi dumoulma2 shadhopson kalinn gdsellerpd zangtw mateidavid junvn mieitza simudream tnxbutno ph0enixxx

data_hacks's Issues

should ceil the scale in bar_chart.py

Around line 54.
scale = int(float(max_value) / value_characters)

Should be
scale = int(math.ceil(float(max_value) / value_characters))

That way values like 1.4 become 2 instead of 1, which helps keep the tick marks from line wrapping.

Sort by values flag does not work in bar_chart.py

Need to modify bar_chart.py as follows:

if options.sort_values:
    data = [[value,key] for key,value in data.items()]
    data.sort(reverse=True)
else:
    # sort by keys
    data = [[key,value] for key,value in data.items()]
    data.sort()
    data = [[value, key] for key,value in data]
format = "%" + str(max_length) + "s [%6d] %s"
for value,key in data:
    print format % (key[:max_length], value, (value / scale) * "*")

Allow histogram.py and other scripts to be imported/used as python modules

Hi, I'd like to generate histograms to standard out in my python script, and currently have to do
os.system("echo '%s' | historgram.py" % "\n".join(values))

It would be great if I could instead do

from data_stacks import histogram
histogram.histogram(values)

install fails, install from git leaves src folder on disk... fix the docs please ?

pip install data_hacks
Downloading/unpacking data-hacks
Could not find any downloads that satisfy the requirement data-hacks
Some externally hosted files were ignored (use --allow-external data-hacks to allow).
Cleaning up...
No distributions at all found for data-hacks

pip version of data_hacks is not up to date

There seem to be a bunch of changes to this repo that aren't reflected in the version that people are downloading through pip. Updating it would be great. Thanks!

histogram.py errors if all values are equal

histogram.py errors if all values are equal. e.g.

vagrant@localhost:~/ws$ echo -e '16\n16\n16\n' | ~/ws/bitly/data_hacks/histogram.py
Traceback (most recent call last):
  File "/home/vagrant/ws/bitly/data_hacks/histogram.py", line 300, in <module>
    options.agg_key_value), options)
  File "/home/vagrant/ws/bitly/data_hacks/histogram.py", line 150, in histogram
    raise ValueError('max must be > min. max:%s min:%s' % (max_v, min_v))
ValueError: max must be > min. max:16 min:16

New version of this package

Hello devs,

This package has been a part of my workflow for several years now, mainly since I spend most of my time on the command line. I see its not really maintained anymore. I would like to take responsibility for it if no one minds, mostly so I can get it working with python 3 (and have this version on PyPI) and add some features.

If this sounds okay, I propose one of three ways to transition this (in order of my preference):

I fork the package and keep the same name. In this scenario, I'd like access to the data_hacks package on PyPI so I can upload the python 3 version and keep it up-to-date.
I fork the package and use a new name (e.g. data_hacks_3 or data_cli). This would be more like making my own package with this package as a starting point. In this scenario, I would just need your permission, I'll handle the rest.
I offer my support to the main fork of the package. I think this solution would cause the most overhead for you all, so that's why I've listed this solution last.

Let me know which sounds best for you. Thanks!
Ewen

Error when piping output into another program

On os x, I had a list in my clipboard, did:
pbpaste | bar_chart.py -v |head -n 30

The chart works fin when I do not pipe into head, but I wanted to only show the top 30 items. Piping into head does limit to 30 rows, as expected, but at the end I also see this error output printed:

close failed in file object destructor:
Error in sys.excepthook:

Original exception was:

Minus symbol for negative numbers.
Number of digits in the number.

Below is a screenshot highlighting the above.

plotting script cannot handle missing values

if uniq -c produces an empty count such as

cat results/train_4.txt | bar_chart.py -a  --sort-keys
Traceback (most recent call last):
  File "/Users/aub3/portenv/bin/bar_chart.py", line 114, in <module>
    run(load_stream(sys.stdin), options)
  File "/Users/aub3/portenv/bin/bar_chart.py", line 52, in run
    data[kv[1]] += value
IndexError: list index out of range

histogram.py switch for logarthmic buckets

When I'm having many outliers, I often get histograms like:

$ time (./a.out 100000|histogram.py -b 10)
# NumSamples = 100000; Min = 237.00; Max = 37599.00
# Mean = 321.560610; Variance = 64719.622326; SD = 254.400516; Median 303.000000
# each ∎ represents a count of 1333
  237.0000 -  3973.2000 [ 99993]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 3973.2000 -  7709.4000 [     0]: 
 7709.4000 - 11445.6000 [     1]: 
11445.6000 - 15181.8000 [     0]: 
15181.8000 - 18918.0000 [     0]: 
18918.0000 - 22654.2000 [     0]: 
22654.2000 - 26390.4000 [     3]: 
26390.4000 - 30126.6000 [     1]: 
30126.6000 - 33862.8000 [     0]: 
33862.8000 - 37599.0000 [     2]:

Not helpful. I see, I have outliers, but how is the distribution inside the first bucket? It is the most important one, and I want to understand what's there.

What I want is, logarithmic histogram, like dtrace shows. Double the distance at every buckets.

Can I send a PR?