Giter Site home page Giter Site logo

markovify's Introduction

Version Build status Code coverage Support Python versions

Markovify

Markovify is a simple, extensible Markov chain generator. Right now, its main use is for building Markov models of large corpora of text, and generating random sentences from that. But, in theory, it could be used for other applications.

Why Markovify?

Some reasons:

  • Simplicity. "Batteries included," but it's easy to override key methods.

  • Models can be stored as JSON, allowing you to cache your results and save them for later.

  • Text parsing and sentence generation methods are highly extensible, allowing you to set your own rules.

  • Relies only on pure-Python libraries, and very few of them.

  • Tested on Python 2.7, 3.4, 3.5, and 3.6.

Installation

pip install markovify

Basic Usage

import markovify

# Get raw text as string.
with open("/path/to/my/corpus.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Print five randomly-generated sentences
for i in range(5):
    print(text_model.make_sentence())

# Print three randomly-generated sentences of no more than 140 characters
for i in range(3):
    print(text_model.make_short_sentence(140))

Notes:

  • The usage examples here assume you're trying to markovify text. If you'd like to use the underlying markovify.Chain class, which is not text-specific, check out the (annotated) source code.

  • Markovify works best with large, well-punctuated texts. If your text doesn't use .s to delineate sentences, put each sentence on a newline, and use the markovify.NewlineText class instead of markovify.Text class.

  • If you've accidentally read your text as one long sentence, markovify will be unable to generate new sentences from it due to a lack of beginning and ending delimiters. This can happen if you've read a newline delimited file using the markovify.Text command instead of markovify.NewlineText. To check this, the command [key for key in txt.chain.model.keys() if "___BEGIN__" in key] command will return all of the possible sentence starting words, and should return more than one result.

  • By default, the make_sentence method tries, a maximum of 10 times per invocation, to make a sentence that doesn't overlap too much with the original text. If it is successful, the method returns the sentence as a string. If not, it returns None. To increase or decrease the number of attempts, use the tries keyword argument, e.g., call .make_sentence(tries=100).

  • By default, markovify.Text tries to generate sentences that don't simply regurgitate chunks of the original text. The default rule is to suppress any generated sentences that exactly overlaps the original text by 15 words or 70% of the sentence's word count. You can change this rule by passing max_overlap_ratio and/or max_overlap_total to the make_sentence method. Alternatively you can disable this check entirely by passing test_output as False.

Advanced Usage

Specifying the model's state size

By default, markovify.Text uses a state size of 2. But you can instantiate a model with a different state size. E.g.,:

text_model = markovify.Text(text, state_size=3)

Combining models

With markovify.combine(...), you can combine two or more Markov chains. The function accepts two arguments:

  • models: A list of markovify objects to combine. Can be instances of markovify.Chain or markovify.Text (or their subclasses), but all must be of the same type.
  • weights: Optional. A list — the exact length of models — of ints or floats indicating how much relative emphasis to place on each source. Default: [ 1, 1, ... ].

For instance:

model_a = markovify.Text(text_a)
model_b = markovify.Text(text_b)

model_combo = markovify.combine([ model_a, model_b ], [ 1.5, 1 ])

... would combine model_a and model_b, but place 50% more weight on the connections from model_a.

Extending markovify.Text

The markovify.Text class is highly extensible; most methods can be overridden. For example, the following POSifiedText class uses NLTK's part-of-speech tagger to generate a Markov model that obeys sentence structure better than a naive model. (It works. But be warned: pos_tag is very slow.)

import markovify
import nltk
import re

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        words = re.split(self.word_split_pattern, sentence)
        words = [ "::".join(tag) for tag in nltk.pos_tag(words) ]
        return words

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

Or, you can use spaCy which is way faster:

import markovify
import re
import spacy

nlp = spacy.load("en")

class POSifiedText(markovify.Text):
    def word_split(self, sentence):
        return ["::".join((word.orth_, word.pos_)) for word in nlp(sentence)]

    def word_join(self, words):
        sentence = " ".join(word.split("::")[0] for word in words)
        return sentence

The most useful markovify.Text models you can override are:

  • sentence_split
  • sentence_join
  • word_split
  • word_join
  • test_sentence_input
  • test_sentence_output

For details on what they do, see the (annotated) source code.

Exporting

It can take a while to generate a Markov model from a large corpus. Sometimes you'll want to generate once and reuse it later. To export a generated markovify.Text model, use my_text_model.to_json(). For example:

corpus = open("sherlock.txt").read()

text_model = markovify.Text(corpus, state_size=3)
model_json = text_model.to_json()
# In theory, here you'd save the JSON to disk, and then read it back later.

reconstituted_model = markovify.Text.from_json(model_json)
reconstituted_model.make_short_sentence(140)

>>> 'It cost me something in foolscap, and I had no idea that he was a man of evil reputation among women.'

You can also export the underlying Markov chain on its own — i.e., excluding the original corpus and the state_size metadata — via my_text_model.chain.to_json().

Generating markovify.Text models from very large corpora

By default, the markovify.Text class loads, and retains, the your textual corpus, so that it can compare generated sentences with the original (and only emit novel sentences). But, with very large corpora, loading the entire text at once (and retaining it) can be memory-intensive. To overcome this, you can (a) tell Markovify not to retain the original:

with open("path/to/my/huge/corpus.txt") as f:
    text_model = markovify.Text(f, retain_original=False)

print(text_model.make_sentence())

And (b) read in the corpus line-by-line or file-by-file and combine them into one model at each step:

combined_model = None
for (dirpath, _, filenames) in os.walk("path/to/my/huge/corpus"):
    for filename in filenames:
        with open(os.path.join(dirpath, filename)) as f:
            model = markovify.Text(f, retain_original=False)
            if combined_model:
                combined_model = markovify.combine(models=[combined_model, model])
            else:
                combined_model = model

print(combined_model.make_sentence())

Markovify In The Wild

Have other examples? Pull requests welcome.

Thanks

Many thanks to the following GitHub users for contributing code and/or ideas:

Initially developed at BuzzFeed.

markovify's People

Contributors

ahjuroop avatar ammgws avatar artlung avatar avinassh avatar bfontaine avatar bziarkowski avatar cjmochrie avatar danmayer avatar deimos avatar eoinnoble avatar erichalldev avatar findus23 avatar fitnr avatar jaza avatar jsvine avatar kezadias avatar lukewrites avatar matthewscholefield avatar ntratcliff avatar orf avatar otakumegane avatar pmlandwehr avatar schollz avatar smalawi avatar stepjue avatar swartzcr avatar thallada avatar tmkuba avatar veggiedefender avatar whatrocks avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.