Giter Site home page Giter Site logo

filibuster's Introduction

filibuster: counting how many times certain words occur in docx and pdf files

Counts how many times each of a list of words occurs in all of the docx and pdf files in a single directory. Outputs: Count of how many each word occurs in each file. Count of how many files each word occurs in. Total number of times each word is found.

why?

Requested by an Assistive Technologist for analysing about 60 .docx and .pdf files for keywords. His first name is Fil. Hence the script's name 'filibuster.py'.

quickstart

Install Python3. Click all the options to install pip when you do this and to add python to the path.

Install the dependencies using pip3 install -r requirements.txt.

If this doesn't work, type:

pip install click PyPDF2 click docx2txt pandas

Put all your .docx and .pdf files into the same directory as filibuster.py.

Make a list of search terms and put them in a file called search_terms.txt.

Run using: python3 filibuster.py

how to use, detailed

Install Python3. Click the options to add Python to your path and to install pip.

Unpack the zip file from this repository to a convenient directory. Open a command line terminal in this directory.

There are a few libraries that the script needs that are not included with the default Python installation. Install these libraries using the command:

pip3 install -r requirements.txt

In the directory containing the file requirements.txt.

If you don't have pip3 installed, you may have not ticked that option when installing Python. For more details and pictures of what to do, have a look at this website and re-install Python with the options to install pip selected and adding Python to the system path selected.

Sometimes using pip does not work as I may have built the script using an older or newer version of Python than you have installed. If you get errors type:

pip install click PyPDF2 click docx2txt pandas

To install the dependencies this way.

Put all of the .docx and .pdf files that you want to search through into a single directory. Either the same directory as the filibuster.py script or another one of your choice.

Put your list of search terms into a single file called search_terms.txt in the same directory as the .pdf and .docx files.

If the .docx and .pdf files are in the same directory as the script filibuster.py, run the script using:

python3 filibuster.py

If the .docx and .pdf files and search_terms.txt files are in a different directory, use the command:

python3 filibuster.py --dir=path to files

notes

The script looks for complete words. It will not count words that are substrings of longer words. e.g. if you are searching for 'the', then 'theatre' will not be counted as 'the' is a substring of 'theatre'. Punctuation marks are removed before searching, so 'the,' or '!the' will be counted.

The output could do with some cosmetic tidying, but I want to get this out to the end user.

test files

There is a directory called 'test' with some files used for testing the script with.

dependencies

These are all in the requirements.txt file.

.docx files are processed using the docx2txt library. .pdf files are process using the PyPDF2 library. pandas is used to collate the results click is used for the command line interface

contacts

[email protected]

filibuster's People

Contributors

mattoppenheim avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.