bitly / data_hacks Goto Github PK
View Code? Open in Web Editor NEWCommand line utilities for data analysis
Home Page: http://github.com/bitly/data_hacks
Command line utilities for data analysis
Home Page: http://github.com/bitly/data_hacks
Around line 54.
scale = int(float(max_value) / value_characters)
Should be
scale = int(math.ceil(float(max_value) / value_characters))
That way values like 1.4 become 2 instead of 1, which helps keep the tick marks from line wrapping.
Need to modify bar_chart.py as follows:
if options.sort_values:
data = [[value,key] for key,value in data.items()]
data.sort(reverse=True)
else:
# sort by keys
data = [[key,value] for key,value in data.items()]
data.sort()
data = [[value, key] for key,value in data]
format = "%" + str(max_length) + "s [%6d] %s"
for value,key in data:
print format % (key[:max_length], value, (value / scale) * "*")
Hi, I'd like to generate histograms to standard out in my python script, and currently have to do
os.system("echo '%s' | historgram.py" % "\n".join(values))
It would be great if I could instead do
from data_stacks import histogram
histogram.histogram(values)
pip install data_hacks
Downloading/unpacking data-hacks
Could not find any downloads that satisfy the requirement data-hacks
Some externally hosted files were ignored (use --allow-external data-hacks to allow).
Cleaning up...
No distributions at all found for data-hacks
There seem to be a bunch of changes to this repo that aren't reflected in the version that people are downloading through pip. Updating it would be great. Thanks!
histogram.py
errors if all values are equal. e.g.
vagrant@localhost:~/ws$ echo -e '16\n16\n16\n' | ~/ws/bitly/data_hacks/histogram.py
Traceback (most recent call last):
File "/home/vagrant/ws/bitly/data_hacks/histogram.py", line 300, in <module>
options.agg_key_value), options)
File "/home/vagrant/ws/bitly/data_hacks/histogram.py", line 150, in histogram
raise ValueError('max must be > min. max:%s min:%s' % (max_v, min_v))
ValueError: max must be > min. max:16 min:16
Hello devs,
This package has been a part of my workflow for several years now, mainly since I spend most of my time on the command line. I see its not really maintained anymore. I would like to take responsibility for it if no one minds, mostly so I can get it working with python 3 (and have this version on PyPI) and add some features.
If this sounds okay, I propose one of three ways to transition this (in order of my preference):
data_hacks
package on PyPI so I can upload the python 3 version and keep it up-to-date.data_hacks_3
or data_cli
). This would be more like making my own package with this package as a starting point. In this scenario, I would just need your permission, I'll handle the rest.Let me know which sounds best for you. Thanks!
Ewen
On os x, I had a list in my clipboard, did:
pbpaste | bar_chart.py -v |head -n 30
The chart works fin when I do not pipe into head, but I wanted to only show the top 30 items. Piping into head does limit to 30 rows, as expected, but at the end I also see this error output printed:
close failed in file object destructor:
Error in sys.excepthook:
Original exception was:
some data sample files would be helpfull ;-) I have problem with histogram.py.
Cant send it correct data
Hi folks. Planning Python 3 compatibility for this awesome tool? I'm getting SyntaxErrors, looks like from print statements missing parens.
<3
if uniq -c produces an empty count such as
235054
3629 0
136189 1
18418 10
258 100
cat results/train_4.txt | bar_chart.py -a --sort-keys
Traceback (most recent call last):
File "/Users/aub3/portenv/bin/bar_chart.py", line 114, in <module>
run(load_stream(sys.stdin), options)
File "/Users/aub3/portenv/bin/bar_chart.py", line 52, in run
data[kv[1]] += value
IndexError: list index out of range
When I'm having many outliers, I often get histograms like:
$ time (./a.out 100000|histogram.py -b 10)
# NumSamples = 100000; Min = 237.00; Max = 37599.00
# Mean = 321.560610; Variance = 64719.622326; SD = 254.400516; Median 303.000000
# each ∎ represents a count of 1333
237.0000 - 3973.2000 [ 99993]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
3973.2000 - 7709.4000 [ 0]:
7709.4000 - 11445.6000 [ 1]:
11445.6000 - 15181.8000 [ 0]:
15181.8000 - 18918.0000 [ 0]:
18918.0000 - 22654.2000 [ 0]:
22654.2000 - 26390.4000 [ 3]:
26390.4000 - 30126.6000 [ 1]:
30126.6000 - 33862.8000 [ 0]:
33862.8000 - 37599.0000 [ 2]:
Not helpful. I see, I have outliers, but how is the distribution inside the first bucket? It is the most important one, and I want to understand what's there.
What I want is, logarithmic histogram, like dtrace shows. Double the distance at every buckets.
Can I send a PR?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.