Snake_Byte #14: Writing Made Easy

Python, as a language, has its strengths and weaknesses as any other language does. Some of its biggest strengths are its productivity and speed of development, presence of third party modules you can easily import, extensive support libraries, and open source and community development. Together, all of these things strengths make python a great language for doing rather complex things - but with relative ease.

Let’s take, as an example, the average chatbot or text generator. Most such applications use a corpus of text that has been seen before to generate brand new words and sentences, as an attempt to mimic style and tone of the text used to train the bot. Some popular examples of such applications are Microsoft’s twitter bot named ‘Tay’, Reddit's subreddit simulator, and a web-app that generates college application essays.

If you were instructed to make a basic version of a text generator app, you might have some ideas of where to start. Perhaps you are aware that using markov chains will work pretty well as a method of determining which characters, words, and phrases should likely follow previous ones. If you are unfamiliar with what markov chains are and how they could be used to generate text in this case, think probabilistic state changes. Nonetheless, this is a succinct introduction to markov chains for the curious.

We could create our own markov model by treating the last 2 words we saw (starting with 2 consecutive words from our corpus) as the current state, where the next word depends on the previous 2 words only. Using the following text as an example:

“how much wood could a woodchuck chuck if a woodchuck could chuck wood. can a woodchuck chuck wood?”

Now if we start with "how much", the next word has to be "wood" based on the first line above - and then the current state moves to “much wood”. However, if we start with "a woodchuck", then the next word has a ⅓ probability of being “could”, and a ⅔ probability of being “chuck”. Obviously, the larger our input text is, the more combinations of words our markov model has to work with.

At this point, we have a basic idea of how we could generate a markov model we can then use to output text that mirrors the style of our input corpus; however, as I previously mentioned, one of the benefits to using python is the vast amount of open source modules available. Why re-invent the wheel if we don’t have to, especially if we want to capitalize on another one of python’s strengths - speed of development.

Turns out there is a python library (markovify) dedicated to generating markov models that we can easily leverage to help us in our effort to create our application. All we need to do is pip install it, and then write a few lines of code.

And with that we get a few more sentences of text with the same style and tone of writing that comes from the short story “The Log” by Guy de Maupassant. The below paragraph is what we generated from our short snippet of code.

“We were never apart, and the fender, and on to the hearth and stamped out all the burning sparks with his boots. At first I spoke vaguely of those moments of friendly silence between people who have no need to be allured by the charm of their life. A minute more, and--I should have been--no, she would have been--when a loud noise made us both jump up.”
Not bad for a computer generated version of Guy de Maupassant’s style of writing. Python allows developers to quickly and easily go from idea to reality, which when leveraged properly, can be very handy and pragmatic.

About Alec Macrae

Alec is a member of the data science team at PokitDok, based out of the company's San Mateo office. He enjoys using data, machine learning, and AI to solve problems, reveal insights, and build cool software.

View All Posts