shenwei356 / csvtk Goto Github PK
View Code? Open in Web Editor NEWA cross-platform, efficient and practical CSV/TSV toolkit in Golang
Home Page: http://bioinf.shenwei.me/csvtk
License: MIT License
A cross-platform, efficient and practical CSV/TSV toolkit in Golang
Home Page: http://bioinf.shenwei.me/csvtk
License: MIT License
flags:
-n, -- ormalization
csvtk version
I have a collection of worksheets in an xlsx file, each worksheet should be of the same type. I want to collate them together (or concatenate or aggregate, not sure what the best word is), so that the order in which I concatenate them is preserved, but the duplicated headers disappear. I am currently doing this with datamash rmdup 1
below, but I rather have a pure csvtk recipe for this:
csvtk xlsx2csv -n LISTOFWORKSHEETS $meta | csvtk cut -f1 | while read ws; do csvtk xslx2csv -n $ws $meta; done | datamash rmdup 1 -t ',' > final_collated_file.csv
What would be the recipe for this collate operation in csvtk? Thanks in advance.
Tried to clone the repo, got a failure:
$ git clone --recursive https://github.com/shenwei356/csvtk.git
...
fatal: No url found for submodule path 'doc/site' in .gitmodules
In the github UI it shows up as a non-clickable link. Commands like:
$ go get -u github.com/shenwei356/csvtk/csvtk
fail also showing the same issue.
csvtk version
Test code
cat seqIDvsChrName.txt | csvtk cut -f -F "chr*" -i
Error: unknown shorthand flag: 'i' in -i
csvtk version
I am trying to use mutate2 to add a column or series of columns but based on values from my bash command-line. E.g. this does not work as it does not pick up the value of the bash variable $val:
val=RUN515 && cat testdata/region.csv | csvtk mutate2 -n run_id -e '1 > 0 ? "${val}" : "${val}"'
The context in which I am doing this is that I have a bunch of worksheets on an xlsx, and I want to aggregate them together, but add some extra columns for each of them (runid) that I can pass as bash variables like in the example below ($runid is the name of each worksheet):
csvtk xlsx2csv -n RUNLIST $meta | grep completed | csvtk cut -T -f 1,7 | xargs -n2 echo | while read runid runcomplete; do csvtk xlsx2csv -n $runid $meta | head -n19 | csvtk transpose -; done
My thinking was that I could pass the bash variable $runid to the csvtk mutate2
command somehow, e.g.:
csvtk xlsx2csv -n RUNLIST $meta | grep completed | csvtk cut -T -f 1,7 | xargs -n2 echo | while read runid runcomplete; do csvtk xlsx2csv -n $runid $meta | head -n19 | csvtk transpose - | csvtk mutate2 -n run_id -e ' 1 > 0 ? $runid : $runid'; done
Any ideas?
Hi, how can I use mutate2
to create a new column as the concatenation of two columns with a separator?
E.g. something like:
echo "A,1" | csvtk -H mutate -e '$1,"/",$2' -n new
To produce:
A,1,A/1
Any ideas?
Example data
dept name item
A Tom I_A
A Tom I_B
A Jerry I_A
B Wei I_B
What I want
$ csvtk xxx -t -f dept -v name
dept name
A Tom;Jerry
B Wei
$ csvtk xxx -t -f item -v name
item name
I_A Tom;Jerry
I_B Tome;Wei
$ csvtk xxx -t -f dept,name -v item
dept name item
A Tom I_A;I_B
A Jerry I_A
B Wei I_B
csvtk version
I am generating a .tsv file which contains single-quotes and double-quotes in the text, and when I pass it through csvtk pretty, I get this error:
[ERRO] parse error on line 2, column 105: extraneous or missing " in quoted-field
I tried getting rid of it by quoting the double-quotes as shown below, but the problem persists:
my $this_record = "$this_created_at\t$this_created_time\t$this_user\t$this_message\t$this_comments";
$this_record =~ s/\"/\\\"/g;
Any ideas?
Hi,
Thanks for creating this very useful library. I have one observation which might help other users.
The binaries, that you helpfully provide for users, should be statically linked instead of dynamically linked to avoid problems with library paths.
I downloaded the Linux 64 bit binary but could not run it because of a library path mismatch for my distro (NixOS). When I created a statically linked binary it ran successfully.
Regards,
Chris
I tested csvtk on a SGI UV server with 1024 cores, and got the following error:
fatal error: procresize: invalid arg
runtime stack:
runtime.throw(0xa2f880, 0x17)
/usr/local/app/go/src/runtime/panic.go:530 +0x90
runtime.procresize(0x400, 0x0)
/usr/local/app/go/src/runtime/proc.go:3158 +0xaee
runtime.schedinit()
/usr/local/app/go/src/runtime/proc.go:452 +0x13d
runtime.rt0_go(0x7fffffffde28, 0x2, 0x7fffffffde28, 0x0, 0x0, 0x2, 0x7fffffffe17f, 0x7fffffffe185, 0x0, 0x7fffffffe18a, ...)
/usr/local/app/go/src/runtime/asm_amd64.s:138 +0x132
It seems due to the default setting of the max number of processors, which refers to golang/go#13908
csvtk version
Version: 0.13.0
cat test.hg19_multianno.csv |csvtk filter2 -f ' $6 < 0.05 && $7<0.05'
works
cat test.hg19_multianno.csv |csvtk filter2 -f ' $1000g2015aug_all < 0.05 && $1000g2015aug_eas<0.05'
doesn't work
cat test.hg19_multianno.csv
Chr,Start,End,Ref,Alt,1000g2015aug_all,1000g2015aug_eas,ExAC_ALL,ExAC_AFR,ExAC_AMR,ExAC_EAS,ExAC_FIN,ExAC_NFE,ExAC_OTH,ExAC_SAS
1,907540,907540,T,C,0.00499201,0.0238,0.0020,0,8.672e-05,0.0252,0,1.533e-05,0.0011,0.0012
1,1153073,1153073,G,A,0.000399361,0.002,0.0002,0,8.66e-05,0.0021,0,0,0,6.061e-05
1,1273569,1273569,G,C,0.000399361,0.002,0.0002,0,0,0.0025,0,0,0,0
1,1372668,1372668,C,T,0.00698882,0.0347,0.0033,0,0,0.0441,0,0,0.0014,0.0003
1,1455586,1455586,C,T,0.00119808,0.006,0.0009,0,0,0.0087,0,0.0004,0.0013,0.0002
1,1458237,1458237,G,A,0.00758786,0.0377,0.0033,0,0,0.0430,0,0.0002,0.0039,0.0004
1,1572708,1572708,C,A,0.00319489,0.0089,,,,,,,,
1,2440518,2440518,C,T,0.00279553,0.0099,0.0039,0.0005,0.0260,0.0192,0,0.0005,0.0023,0.0008
1,6257863,6257863,G,A,0.00259585,0.0119,0.0013,0.0001,0.0001,0.0130,0.0030,0.0001,0.0037,0.0004
Hi,
I have a use case where I am using csvtk on a table from an .xlsx file where there could be rows at the bottom of the worksheet that are completely empty, e.g. where all the fields have empty cells. This is common if people manually edit the worksheet and just leave the empty rows in there rather than explicitly deleting them.
Could there be a flag in csvtk so that the empty rows are ignored in the csvtk operation?
There may be a workaround using csvtk filter or some other operation, but I couldn't find one.
E.g. in this example, all the values for the last two rows are empty
Col1,Col2,Col3
A,1,2
B,2,3
C,3,4
,,
,,
Currently when applying any operation, e.g. csvtk freq, it takes into account the empty rows:
csvtk freq -f 1 example_empty_rows.csv
Col1,frequency
A,1
B,1
C,1
,2
It would be great to have an --ignore-empty
e.g. -e
flag that would filter out the rows where all fields are empty, then return the equivalent of the operation, e.g.:
csvtk freq --ignore-empty -f 1 example_empty_rows.csv
Col1,frequency
A,1
B,1
C,1
Hi,
You explicitly noted that the space character will lead to errors. Do you plan to remove that restriction at some future update? I would love to use this tool, but pretty much all of my datasets contain spaces.
Best,
Carsten
csvtk version
Would be useful!
% cat data.csv
ID,room,name
3,G13,Simon
5,103,Anna
% csvtk to_json data.csv
{
ID: 3,
room: G13,
name: Simon
},
{
ID: 5,
room: 103,
name: Anna
}
Hi shenwei,
Thanks for this nice tool. I really like the plot functions.
It will be more useful if it can be piped to display
from imagemagic to display on terminal rather saving the figures. see some examples http://crazyhottommy.blogspot.com/2014/11/csvkit-to-manipulate-csv-at-command.html. Most of the time, I just want to glance the distribution of the data. If I want serious figures, I will import to R and plot with ggplot2.
Best,
Tommy
The stdlib encoding/csv supports this, but the https://github.com/mzimmerman/multicorecsv doesn't.
If it's supported, involved commands include:
cut
filter
grep
inter
mutate
stat2
uniq
rename
rename2
replace
Well, I think the answer is no.
Solution of handling data with duplicated colnames:
csvtk headers
and using them instead of colnames.csvtk rename
to make them unique.I have an example .csv with the values:
patient_id,frequency
102900008,2
102100016,2
And I want to create two more columns like so:
patient_id,frequency,extraction_from_sample_name,extraction_to_sample_name
102900008,2,102900008pool,102900008xt
102100016,2,102100016pool,102100016xt
In trying to append the "pool" to the patient_id, it evaluates it as a number, and changes it to scientific notation, see below. How can I avoid this and simply get the string as text?
cat example.csv | csvtk mutate2 -n extraction_from_sample_name -e ' $patient_id + "pool"'
patient_id,frequency,extraction_from_sample_name
102900008,2,1.02900008e+08pool
102100016,2,1.02100016e+08pool
csvtk version
I have an .xlsx file with long worksheet names, that get truncated when applying the -a as shown below:
csvtk xlsx2csv file.xlsx -a
The file.xlsx contains a worksheet like the one on the left, but csvtk returns a truncated name with the last character missing:
Import_XTP101_20181018102046_551 > Import_XTP101_20181018102046_55
csvtk inter
: decreasing RAM occupation of keysMaps
by adding an option to delete keys that keysMaps[key]
don't have #. files
keys right after handle one file.
csvtk uniq
: use bloomfilter (BoomFilters) instead of map
for every record:
if record not in bloomfilter:
output record
add record to bloomfilter
csvtk grep
: decrease RAM occupation for finding unique elements in a relative small dataset against a very big (millions) dataset, i.e. A - B (csvtk grep -v -P very-big small
). We may add a new command substract
Workaround: traversing very-big
file and marking elements existed in small
, and excluding them from small
:
csvtk grep -P small very-big > tmp
csvtk grep -v -P tmp small > result
building a bloomfilter from very big
and checking every element in small
, only outputting that not in the bloomfilter
I install csvtk_darwin_amd64.tar.gz on my OSX 10.11.6 and get the error below :
csvtk stat query_result.csv
[ERRO] read query_result.csv: bad file descriptor
Is there any thing wrong ?
csvtk join
keep source records with same ID. v0.4.5 and later versions support it.header
head
sample
csvtk filter2
: support comparison between fields (like awk '$1 < $4'
) using govaluate.csvtk grep
: keep output in order of input$ cat t
a,b,c
1,2,3
$ cat t | csvtk cut -f -a
[ERRO] column "-a" not existed in file: -
$ cat t | csvtk cut -f -1
b,c
2,3
csvtk join
:
For instance:
-k 1,2 # to keep all rows of the union of first and second file
-k 1 # to keep all rows of first file
-k 2-7 # to keep all rows of the union of file 2 to 7
-k 1- # to keep all rows of the union of all files
-n
is not needed when flag -H
in csvtk mutate
Hope the grep command can suport the partial match of given field and highlight the matched part just like the grep on Linux.
Could you submit this CLI tool to Homebrew so it could be convinently installed and upgraded on MacOS with brew.
with help of https://github.com/Knetic/govaluate
csvtk version
Datetime expression with single quotes (e.g. '2018-10-31 01:23:45'
) will be converted to float when using csvtk mutate2
in order to append a new column to original file.
Is this an expected behavior or not?
$ csvtk version
csvtk v0.15.0
Checking new version...
You are using the latest version of csvtk
$ cat source.csv
aaa,bbb
1,x
2,y
3,z
$ cat source.csv | csvtk mutate2 -n ts -e "'2018-10-31 01:23:45'"
aaa,bbb,ts
1,x,1.540916625e+09
2,y,1.540916625e+09
3,z,1.540916625e+09
$ cat source.csv | csvtk mutate2 -s -n ts -e "'2018-10-31 01:23:45'"
aaa,bbb,ts
1,x,1.540916625e+09
2,y,1.540916625e+09
3,z,1.540916625e+09
From 76-78 lines:
if len(patterns) == 0 && patternFile == "" {
checkError(fmt.Errorf("one of flags -p (--pattern) or -f (--pattern-file) should be given"))
}
As I understand here -f should be the "--fields string" and -P gives the "--pattern-file".
Such as:
gyangdeiMac:untitled folder gyang$ csvtk csv2md -H -t chr2L_RNA.matrix | head -10
[WARN] colnames detected, flag -H (--no-header-row) ignored
yw_wt|1 |chr2L |62 |0 |1 |0 |0 |0 |1
:----|:----|:-----|:--|:--|:--|:--|:--|:--|:--
yw_wt|chr2L|63 |0 |1
yw_wt|chr2L|64 |0 |1
yw_wt|chr2L|65 |0 |1
yw_wt|chr2L|66 |0 |1
yw_wt|chr2L|67 |0 |1
yw_wt|chr2L|68 |0 |1
yw_wt|chr2L|69 |0 |1
yw_wt|chr2L|70 |0 |1
Sorting according to value of some field in custom defined order.
For example, a custom order file:
$ cat order.txt
C
A
T
Data:
A,hello
C,wei
B,dear
T,shen
Wanted output:
$ csvtk sort -H -k 1:c -L 1:order.txt
C,wei
A,hello
T,shen
B,dear
Hi, I have an example operation that I think I will be able to do with csvtk but I am unsure how.
From a table that looks like this:
col1,col2,col3
100,2,ABC
101,2,DEF
I would like to have an operation on col2, so that foreach entry, two new entries are created where the value of col2 is divided by 2, and the value in col1 is suffixed with 'A' and 'B', e.g.:
col1,col2,col3
100A,1,ABC
100B,1,ABC
101A,1,DEF
101B,1,DEF
Any ideas how to construct that operation with csvtk?
I need a fast filter to parse csv file lines and drop those that are unparsable, as with https://csvkit.readthedocs.io/en/1.0.2/scripts/csvclean.html
csvclean works in a shell pipe (PR 781) but is limited in speed.
for rownumber; especially for grep function.
In order to get Excel to always accept CSV files some of our legacy lab equipment adds a line such as
sep=,
as a first line. Trying to manipulate them, e.g. with csvtk transpose results in
[ERRO] read .\Testing.csv: The handle is invalid.
Would you consider catching this line, ignoring it and adding it to the resulting file if present? Could very well be a command-line option as well.
csvtk version
Hi, first thanks for this awesome tool! Works quite well with my data but I couldn't find a way to do the following with featureCounts
output:
I have multiple files with this header (always the same header!):
Geneid Chr Start End Strand Length gene_name blablabla.bam
If I join these together by Geneid, the table looks nice but "duplicated rows" exist (Chr, Start,End,Strand,Length,gene_name).
I'd like to remove all columns Matching Chr,Start,End,Strand,Length (I assume with -
as a prefix) but this only removes the first occurence of such a column. Can one specify to remove all occurences of such a pattern, e.g. with an extra option -g
?
Thanks, feel free to ask more questions if I'm being unclear!
csvtk version
I couldn't find an easy way to do this in the docs. I presume it would involve stats/mutate2, but I couldn't find a recipe. What I want is the following: I would like to add a percentage column next to column 3, by grouping over column 1,2:
CGT 0 5094192
CGT 1 1031270
CGT 2 141416
CGT 3 25743
CGG 4 16268
CGG 5 6180
CGG 6 2855
CGG 7 1231
CGG 8 477
CGG 9 157
CGG 10 77
CGG 11 16
CGG 12 3
CGG 13 2
CGG 0 5123225
CGG 1 911882
CGG 2 187348
CGG 3 47305
csvtk version
I would like to merge two or more files with all data from both files kept? However, it seems that csvtk join only allows unmatched data from the first file kept with -k parameter. More specifically, I would like to merge the following example data:
username,phone
gri,11111
rob,12345
ken,22222
shenwei,999999
name,region
ken,nowhere
gri,somewhere
shenwei,another
Thompson,there
and get the following result:
username,phone,region
gri,11111,somewhere
rob,12345,NA
ken,22222,nowhere
shenwei,999999,another
Thompson,NA,there
How can I do it with csvtk? Thank you!
Xuhang
Hi! Thanks for the useful tool, started using it today.
Just wanted to let you know I am maintaining a package for this utility for anyone on Arch Linux here: https://aur.archlinux.org/packages/csvtk/
Anyone using an Arch OS can install with their package manager now.
Feel free to close, just wanted to inform you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.