Giter Site home page Giter Site logo

rapaste's Introduction

RaPaste

RaPaste is a fully featured web-pastebin, written in Ruby using the Ramaze web-framework.

Features

  • Syntax highlighting using CodeRay or Ultraviolet
  • Forking pastes, creating a new one based on an existing paste
  • Easy configuration
  • Use any database that Sequel supports.
  • Show the paste with Content-Type of text/html or text/plain
  • Private pastes with ids based on hashing the contents of the paste
  • Pastes may have an optional limit in size
  • Spam protection without javascript or captchas
  • Powerful bayesian filtering to support your quest against spam

Dependencies

  • ramaze
  • sequel
  • uv or coderay

Installation

gem install ramaze sequel coderay # or uv
git clone git://github.com/manveru/rapaste.git
cd rapaste
$EDITOR env.rb
ruby start.rb

{:sh}

A gem will be provided when someone donates a rapaste.gemspec

Configuration

Configure by editing the $rapaste hash and value of DB constant in env.rb

Settings are:

  • :engine May be either :uv or :coderay
  • :fragment How many lines are visible in the list and search preview
  • :pager How many pastes are listed per page in list and search
  • :priority Array of Strings with the names of the syntaxes that should be on top of the dropdown
  • :theme Theme to use for Ultraviolet
  • :title Title shown on every page
  • :admins This might be replaced at a later point, but right now it's a simple Hash of username and password for each person that wants to help you fight spam.

The settings for DB may be very different for you, it's file-based sqlite by default, some possibilities are:

DB = Sequel.sqlite('my_blog.db')
DB = Sequel.connect('postgres://user:password@localhost/my_db')
DB = Sequel.mysql('my_db', :user => 'user', :password => 'password', :host => 'localhost')
DB = Sequel.ado('mydb')

{:ruby}

Usage

You can immediately start pasting after a successful start, please tell us if you don't find the user-interface intuitive enough or feel we're missing something.

Most likely your RaPaste will start to attract some crazy spammers, but don't worry, we have you covered. In order to keep them from messing up your listing and search and filling your database we have added adaptable bayesian filtering. The administration interface is located at /spam, you will be presented with a list of unreviewed pastes and suggestions on how to handle them.

The other form of protection is rather simple, every paste is only considered for visibility once it was accessed from another IP, so once someone pasted and passes on the link, it will most likely be openend from another IP and so made visible for everybody. We thought this would be a reasonable first step to avoid massive flooding by spammers, but doing manual filtering is still necessary sometimes.

Every time a new paste is created and viewed from another IP, a bayes rating is generated based on the contents of the paste. If it is classified as spam it won't show up in listings or searching despite being marked as archived until you assert that this paste is indeed ham and add it to the filter.

Personally I think the basic implementation is sane, but currently the id of pastes are still too guessable.

About the Bayesian filter

I wrote the filter after reading articles from Paul Graham and trying the related ruby library from Lucas Carlson called classifier. Classifier proved to be a bothersome experience, and caused me some problems and issuing warnings on startup. But I took the core algorithm, tuned it a bit and for now the filter resides in vendor/bayes.rb. It's pure Ruby, reasonably fast and accurate. Some design decisions were to limit it to words longer than 4 characters (apart from a few exceptions), smaller words tend to skew the results and are often not meaningful enough. Unknown words have minimal impact on the result.

Further reading on bayesian filtering:

Finetuning Bayes

After your first startup you will have a new file at db/bayes.marshal, which contains the marshalled contents of the @categories hash from the Bayes instance. It is seeded with some words from db/spam.txt and db/ham.txt initially, and will grow when you use the /spam interface. In case you want to correct something or change the scoring you can load it in irb:

bayes = Marshal.load(File.read('db/bayes.marshal'))

To write it back you simple do:

File.open('db/bayes.marshal', 'w+'){|b| b.write(Marshal.dump(bayes)) }

So let's say you have collected some textfiles with spam and ham and would like to train the filter with it, but without pasting:

require 'vendor/bayes'

bayes = Bayes.new('bayes.marshal')

spam = File.read('stuff/spam.txt')
ham = File.read('stuff/ham.txt')

bayes.train :spam, spam
bayes.train :ham, ham

bayes.store

The final bayes.store will reflect the changes into bayes.marshal so when you issue Bayes.new('bayes.marshal') next time it will automatically load your filter.

Todo

  • Documentation
  • More highlighting engines
  • Caching
  • Clean up env.rb and start.rb (maybe non-global configuration)
  • More options and docs about how to change display of pastes
  • Generate static CSS from view/css/screen.sass
  • Reduce DB queries
  • Use migrations?
  • The behaviour of forking private pastes isn't specified yet
  • Make the id of pastes less guessable, the current system can be made spam-able by a simple curl from another IP
  • Modification of the bayes filter itself, atm the easiest way is via irb

rapaste's People

Contributors

manveru avatar alexut avatar

Stargazers

Angus H. avatar Phillip Baker avatar Mykola Konyk avatar Vangelis Typaldos avatar DT avatar Joe Glenn avatar Kristan Krispy Uccello avatar  avatar Aaron avatar Valérianne avatar Minofare avatar Fabian Buch avatar Michael Trommer avatar  avatar Pistos avatar

Watchers

 avatar James Cloos avatar  avatar  avatar

Forkers

pistos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.