Giter Site home page Giter Site logo

word-data's Introduction

word-data

Source for retrieving words (binomials, trinomials, etc) from Reddit, and updating the database.

This project relies on PRAW. praw.ini required, user is "ibs". Additionally requires python module tendo to ensure that only one instance is running. Words are retrieved and processed with Python. Python passes the data to Php which updates the SQL tables.

config.php should contain the $conn for the MySQL connection; $days and $count for how long and which "old" words should be kept (see below).

Please note: You will need a significant amount of space in your MySQL database. The amount of space will vary based on how often and with what arguments the script is called.

getwords.py

python getwords.py <number of comments to fetch>

Uses PRAW to grab a comment, the input string is uppercased and any character that is not alphanumeric or a space is removed, then is analyzed for quality (not spam, in English, contains at least 2 spaces). The string is converted to an array by the spaces. All the comments are broken down into their respective words through 5nomials and the result is passed to wordscript.php

wordscript.php

Takes the input string, processes it into a 2D array (one array for each table), then each array is passed through an insert statement to insert or update the respective row.

cleandatabase.php

Using $days and $count, words are removed. This is used to remove words that are likely not actual words. The higher $days is, the more days the word (or binomial or trinomial) is kept. The lower $count is, the more likely a word will be removed. This script is useful for when somebody makes up words or as a result of out of context text. This script is very important, but must be used with caution, as it could potentially remove more than intended. I am still, and will for the forseeable future, adjusting exactly this file to remove as much junk as possible while attempting to preserve useful data.

process.php

This part of the project is not yet complete. Using the input data, this script will choose what words (or phrases) are significant. The idea is that you can insert a string to get a general idea of what the topic is.

gen.php

This is a work in progress, see the commits for more information. The end result will be to use the data to create sentences.

word-data's People

Contributors

b0bcarlson avatar

Watchers

 avatar

word-data's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.