Text generation with Markov chains

Markov chains can be used by spammers (a bit outdated approach) to generate random texts from real ones. Not so useful for spammers though, so I did it actually for lulz. Here are some examples and some code follows.

Chain length=3 (each word depends on 2 past words):

Mich, see how I dash about! For a while he had fixed them. He supervised pin the tail on the street below, outside this warm two-storey house, like so many serpentines thrown before her. Now, the parents. With laughing reluctance they slid down a slide which Mich had said eight years ago. He almost bent to take hold of Marion now, in her skull mask. He felt an inexplicable rush of pity for her, because she had watched her own party, like so many grey mice the children, like so many serpentines thrown before her. Now, the house was empty and silent in the autumn leaves before his granmother’s house many years ago and heard the bathroom door unlatch and she had produced a child in her own party, like so many grey mice the children, like so many leaves. By the sound of the child as an idea, completely asexual, an immaculate conception of contemptuous mind and cell. As a firm rebuke to him she had hurt him. It was quiet. Louise cried out, ‘Marion, Marion!’ ‘Turn on the lights,’ said one of

Length=2 (each word depends on previous one only):

‘Hush, you’ll spoil everything,’ some girl said. The room I’m in the house, four times around the rest, amid fits of the items. ‘Marion?’ asked Louise. ‘Are you afraid?’ Marion appeared in her disguise. ‘How do in winter he suspected that somehow he came after, smiling, ‘I can’t find her.’ Then …… some idiot turned on was her skull sockets small blue eyes and says, “This is dead,’ intoned the suddenly frozen task of table-section, into the last day he was a dark forcing murderer. It would be over! There was a child cried, ‘I’ll check the wild explosive in the furnace. The room he as he thought. With laughing down and picked her not a boy. ‘Come back, Helen!’ Shot from her, all the husband. ‘She’s not wear this game with white bone masks and says, “These are you invited people in, but she was talking. ‘Marion?’ asked Louise as much as he suspected that and the top it, on the bottom of October, with the last coming slowly had forced child. But when he came down. By the open. But when

As you can see, the longer the chain gets, the longer pieces of original text appear in the result.

The same approach can be applied not only to words, but to characters as well:

ceineatrs, inghs -s nowad Bushin acllaplkin long. id, worshe plkeng. cereil won hie thenorildiloorie ganing. ang I’ qut cthid Mar Moutorkerskise to theyotidinerkesk heseas smoullin blelo be t h heat eayis edin f tof ve uid h h. ist thes iso tuthe ofut ll, bume be cid this cesus congherid owintr apourowit t t Itid Ane cter alt Itowd glab thersthowaris ouedit, cs. acankngosefen the rors arsupis hatadeshead ce w, tlly pr hrit n’I siouldanots ithedeailoura d tce!’Nobe’Evesa hinoow-the Agalthescke”. pesed tabowippsched te!” add. ad t, agare. se Lower, ppeacr or m t s toucathe war. me rontar tongsen thelind ‘Qurofamp beliofe hoys he t h t. By. Loysey nehthton ake ad, othe he If s Anok ‘ bontath vedn dll ande m onoond on wioused ghelen sctut adaiowlllwe chanind hrnsil Thown. s obirknd ine teisiloug hind spthalarrchiron praye s. chas che asanaumm juruse wa opig. ar. ght che alinene faldie, Ee nyoind he. dle ‘Mall w, s. apirout t wis I’ chikie m t mig ha

And here’s the code. PHP first.

<?
    /*
     * class for Markov's chains realization
     * (c) Andrey, somemilk.org@gmail.com, 2006
     *
     * Typical usage:
     * $m = new Markov(2); // 2 - markov chain length, depends on source
     *                     // file size, 2 for small files, 3-4 for large ones
     * $m->initFromFile("somefile.txt") or die("shit happens.");
     * $bullshit = $m->generate(2000); // generates bullshit 2000 chars long.
     * $m->initFromString(file_get_contents("somefile.txt")) or die("shit happens.");
     * $bullshit = $m->generate(100, MARKOV_OPT_WORDS); // generates bullshit 100 words long.
     */

    define("MARKOV_OPT_CHARACTERS", 1); // option: character limit for generate()
    define("MARKOV_OPT_WORDS",      2); // option: word limit for generate()
    define("MARKOV_DEFAULT_K",      3); // default Markov chain length
    define("MARKOV_TIME_LIMIT",     120); // time limit for initFromFile()

    define("MARKOV_MAX_RECURSION_LEVEL",     20);

    class Markov
    {
        var $k = MARKOV_DEFAULT_K;
        var $k_sets = array();
        var $split_method = MARKOV_OPT_WORDS;

        function Markov($k, $split_method = MARKOV_OPT_WORDS)
        {
            $this->k = $k;
            $this->split_method = $split_method;
        }

        /*
         * inits class' Markov k-sets with a text from a text file.
         *
         * returns true on success, false on failure
         */
        function initFromFile($filename)
        {
            set_time_limit(MARKOV_TIME_LIMIT);

            if(!is_readable($filename)) return false;

            $fc = file_get_contents($filename);
            $this->initFromString($fc);

            return true;
        }

        /*
         * inits class' Markov k-sets with a text from a string.
         *
         * returns true on success, false on failure
         */
        function initFromString($str)
        {
            if(strlen($str) k) return false;
            $this->k_sets = array();
            $set = array();

            if($this->split_method == MARKOV_OPT_WORDS)
            {
                $words = preg_split('/\s+/', trim($str));
                if(count($words) k) return false;
                foreach($words as $w)
                {
                    $this->_addToSets($set, $w);
                }
            }
            else
            {
                for($i=0; $i_addToSets($set, substr($str, $i, 1));
                }
            }

            return true;
        }

        function _addToSets(&$set, $w)
        {
	        $set[] = $w;
            if(count($set) == $this->k)
            {
                $key = "";
                for($i=0; $ik - 1; $i++)
                {
                    $key .= "[\$set[$i]]";
                }
                eval("\$this->k_sets{$key}[] = \$set[$i];");
                array_shift($set);
            }
        }

        /*
         * you must re-init the class after calling setK()
         */
        function setK($k) { $this->k = $k; $this->k_sets = array(); }

        /*
         * generates random word $length characters long
         * (available only if split_method is MARKOV_OPT_CHARACTERS
         */
        function getWord($length)
        {
            if($this->split_method != MARKOV_OPT_CHARACTERS) return false;
            $res = "";
            $set = array();
            while(strlen($res) k)
                {
                    $word = array_shift($set);
                    if(preg_match('/[\s\.,:\?!;"]/', $word))
                    {
                        $res = "";
                        continue;
                    }

                    $res .= $word;
                }

                if(strlen($res) == $length) break;

                if(count($set)) $element =& $this->k_sets[$set[0]];
                else $element = null;
                foreach($set as $i => $word)
                {
                    if(isset($element[$word])) $element =& $element[$word];
                }
                if(is_array($element))
                {
                    $word = $this->random_key($element);
                    if(is_array($element[$word])) $set[] = $word;
                    else $set[] = $this->random_value($element);
                }
                else $set[] = $this->random_key($this->k_sets);

            }
            return $res;
        }

        /*
         * generates Markov's string, max $how_much long,
         * in words (if $how==MARKOV_OPT_WORDS)
         * or in characters (if $how==MARKOV_OPT_CHARACTERS, default)
         */
        function generate($how_much, $how=MARKOV_OPT_CHARACTERS, $sentences=false, $recursion_level=0)
        {
            $res = "";
            $n_words = 0;
            $set = array();
            $sentence_started = false;
            while(($how == MARKOV_OPT_CHARACTERS && (strlen($res) < $how_much)) ||
                  ($how == MARKOV_OPT_WORDS && ($n_words k)
                {
                    $res .= ($word = array_shift($set));
                    if($this->split_method == MARKOV_OPT_WORDS)
                    {
                        $res .= " ";
                        $n_words++;
                    }
                    elseif(preg_match('/[\s\.,:\?!;"]/', $word))
                    {
                        $n_words++;
                    }
                    if($sentences && !$sentence_started && preg_match('/\.$/', $word))
                    {
                        $sentence_started = true;
                        $res = "";
                        $n_words = 0;
                    }
                }

                if(count($set)) $element =& $this->k_sets[$set[0]];
                else $element = null;
                foreach($set as $i => $word)
                {
                    if(isset($element[$word])) $element =& $element[$word];
                }
                if(is_array($element))
                {
                    $word = $this->random_key($element);
                    if(is_array($element[$word])) $set[] = $word;
                    else $set[] = $this->random_value($element);
                }
                else $set[] = $this->random_key($this->k_sets);

            }
            $res = rtrim($res);
            if($sentences)
            {
                $res = preg_replace('/\.[^\.]+$/', '.', $res);
                if(!preg_match('/\.$/', $res) && ($recursion_level generate($how_much, $how, $sentences, $recursion_level+1));
            }
            return $res;
        }

        /*
         * helper function, returns random array key
         */
        function random_key(&$array)
        {
            $rand = mt_rand(0, count($array) - 1);
            $i = 0;
            foreach($array as $key => $value) if($i++ == $rand) return $key;
        }
        /*
         * helper function, returns random array value
         */
        function random_value(&$array)
        {
            $rand = mt_rand(0, count($array) - 1);
            $i = 0;
            foreach($array as $key => $value) if($i++ == $rand) return $value;
        }
    }
?>

Python code, not mine, for generating words. I use it on one of my projects for generating random nicknames.

#!/usr/bin/env python

from collections import defaultdict
from random import choice

class TextGenerator(object):

    def __init__(self):
        self._data = defaultdict(list)

    def train(self, file):
        words = [None, None]
        for line in open(file):
            for word in line.split():
                words[0], words[1] = words[1], word
                if words[0]:
                    self._data[words[0]].append(words[1])

    def gentext(self, num_words):
        text = []
        text.append(choice(self._data.keys()).title())
        while len(text) < num_words:
            if self._data.has_key(text[-1]):
                text.append(choice(self._data[text[-1]]))
            else:
                text.append(choice(self._data.keys()))
        return ' '.join(text) + '.'

if __name__ == '__main__':
    textgen = TextGenerator()
    textgen.train('pandp.txt')
    print textgen.gentext(100)

You can google for some more code snippets, that’s what I found for perl:

#!/usr/bin/perl
# by jackal
use strict;
my $filename = shift || 'data.txt';
my $MAXGEN = 10000;
my $NONWORD = "\n";
my $w1 = $NONWORD;
my %statetab;
my $fh;
open($fh, '<', $filename);
while () {
    foreach (split) {
        push(@{$statetab{$w1}}, $_);
        $w1 = $_;
    }
}
close($fh);
push(@{$statetab{$w1}}, $NONWORD);
$w1 = $NONWORD;
open($fh, '>', 'markoff.txt');
for (my $i=0; $i[$r]) eq $NONWORD);
    print $fh "$t ";
    $w1 = $t;
}
close($fh);

Further reading:

Wikipedia:
http://en.wikipedia.org/wiki/Markov_chain

Python code author’s site:
http://bitecode.co.uk/

Leave a Reply