Your text editor (Sublime) can only handle files up to a certain size. Command line is THE place to go when you have one huge file (>~450,000 lines) or more directories full of huge files.
- ex1. Jeb Bush emails
- ex2. Flint water emails in the
/pdf
directory - ex3. One batch of Hilary Clinton's emails
make prep
Once the script is finished you will have installed the command-line tool that we will be using:
pdftotext
ack
You already know a lot of command-line utilities!
cd
ls
touch
mkdir
rm
mv
git
The following utilities come with your terminal.
command input > output
The >
takes the input from the left and creates an output named after the right.
command input | another-command | yet-another-command
The |
processes the results from the first command with the second command, so on and so forth.
We will use these in a second.
Practice the following with files in ex0/
.
sort alphabet.txt
sort -r alphabet.txt
sort -r alphabet.txt > reverse-alphabet.txt
sort -n numbers.txt
sort -t "," -k 3 lead-request-zipcode.csv
How many types of lead kit does the city record?
sort lead-request-type.csv | uniq
How many incidents under each type?
sort lead-request-type.csv | uniq -c > lead-type-count.csv
ack
is a command-line file pattern searcher that is fast and optimized in a lot of ways. See why ack.
Remember how you tried to get all the phone numbers from Jeb's email? You did something like this in Sublime:
- Search for all the phone numbers in a document with a Regex.
- Copy all the phone numbers.
- Create a new document.
- Paste the phone numbers to that new document.
- Save the new document with a name.
Now these five steps can be done with ack
and redirection in one line. The basic ack
usage looks like this:
ack "pattern" somefile
This gives you a preview of the search in the command line. However, to store the search results in a new file, you will use redirection.
ack "pattern" somefile > resultfile
ack "your pattern" jeb-bush-telephone-number-challenge.txt > number-only.txt
Use the -o
flag to specify that we don't want to print the entire line.
ack -o "your pattern" ex1/jeb-bush-telephone-number-challenge.txt > ex1/number-only.txt
Note: if you redirect an input to an output and the output name already exists, you will overwrite the previous file. Make sure you name things with care and/or make copies of files important to you.
ack -o "your pattern" ex1/jeb-bush-telephone-number-challenge.txt | uniq -c > ex1/numbers-only-uniq.txt
We installed something called pdftotext
to, well, turn pdf files into text files.
Try it with the pdf in ex2/
.
pdftotext input > output
ack -o '^From:+(.+)Date' ex2/snyder-flint-water-emails-demo.txt > ex2/flint-email-recipients.txt
We are going to investigate a batch of Hilary Clinton's email.
make ex3
As you see, we are dealing with files that come in 293 individual files instead of one single file.
ack -h -m 1 -A 3 '^From: +(.+)' ex3/
Notice that for this exercise we are searching a whole directory rather than a single file. Here, ack
is no longer an alternative, but the only tool suitable for the job (since there are many files with a large total size.)
-A NUM
flag specifies number of lines to be printed after the matching lines.-m NUM
flag stops searching in each file after NUM matches- What do you think the
-h
flag does? Try it without-h
.
Use -l
flag to only print the filenames of matching files, instead of the matching text.
ack -l '\b(SECRET|CONFIDENTIAL)\b' ex3/
Building on the previous exercise:
ack -h -m 1 -A 3 '^From: +(.+)' ex3/ > ex3/recipients.txt
Here's the challenge:
Now find all the emails in the new file, sort them, find out the count of each email address, and sort by frequency of these recipients.
Use the pipe |
in your command!
Submit your command here: http://goo.gl/forms/hKgssuslfE
Regex is widely used for form validation. You don't really have to write it from scratch. Two ways to validate:
- More command line materials available in this workshop repo
- Excercises are modified based on Dan Nguyen's NICAR 2016 Regex slides. There are a lot of hands-on examples from his talk. Practice them!