Skip to content

Nantucket: an accidental limerick detector

How can you not name an accidental limerick detector Nantucket? I think there may be laws about that. Point being, I wrote a program that takes any text and tries to find any accidental limericks that might be hiding within (based on syllable counts and rhyme, ignoring punctuation and intent).

Limericks have a fairly loose form. The rhyme scheme is always AABBA, but the syllable count can be anything along the lines of 7-or-8-or-9/7-or-8-or-9/5-or-6/5-or-6/7-or-8-or-9. And as if that weren’t loosey-goosey enough, they can have either anapaestic meter (duh-duh-DUM, duh-duh-DUM) or amphibrachic meter (duh-DUM-duh, duh-DUM-duh)!

So, rather than write a limerick detector that looks for anything even remotely resembling a limerick, I chose the following as the canonical example to work from (at least for now):


There was a young student from Crew
Who learned how to count in base two.
His sums were all done
With zero and one,
And he found it much simpler to do.

Going by the canonical example above, Nantucket is set to look for limericks that are AABBA (rhyme scheme) and 8/8/5/5/9 (syllable count per line). It currently ignores meter, but I may add that requirement in later. It also only looks at words through an American English accent.

How to implement such a thing? Well, I started simple – I used the CMU Pronouncing Dictionary (“cmudict”) as accessed through NLTK (Python’s natural language toolkit) to get each word’s syllable count and pronunciation, and just had Nantucket give up on any potential limericks that included words not in cmudict. It tokenizes the text, then walks through each word in the text in turn, checks to see if there’s a limerick starting with that word, and gives up if it hits a word it can’t analyze or a limerick form violation (such as a word that overflows the end of a limerick line, or a rhyme scheme problem).

(It turned out I had to know Python or Java for the Stanford Natural Language Processing class I started two weeks ago. Er. Okay! This came in very handy as soon as I decided to build Nantucket last week – apparently Python has better natural language processing (“NLP”) libraries than Ruby, so it was absolutely the right choice for this project.) (I want Python’s scientific libraries, but with RSpec. Is that too much to ask?)

So, that was pretty fast to throw together. I defined a rhyme as being any pair of words which have identical last vowel sound plus all sounds following the last vowel sound. That’s not precisely right in American English, but it’s pretty damn close. This was easy to do with cmudict, because every vowel phoneme in cmudict ends with a digit denoting stress, but consonant phonemes never include digits.

So, that was a long bit of geekery. Time for a poetry break!

from Proust’s Swann’s Way:

bad conduct should deserve Was I
then not yet aware that what I
felt myself for her
depended neither
upon her actions nor upon my

was wonderful to another
How I should have loved to We were
unfortunate to
a third Yes if you
like I must just keep in the line for

to abandon the habit of
lying Even from the point of
view of coquetry
pure and simple he
had told her can’t you see how much of

That was a fun start, but I really wanted to handle the tons of words that just aren’t in cmudict. I tested Nantucket on James Joyce’s Ulysses repeatedly throughout this process, to come up with fantastic lists of words I was failing to catch at every step along the way.

I like to break projects down into small enough chunks that I get a sense of accomplishment frequently enough to keep myself motivated to keep on pushing forward. Here, words naturally fell into two categories – words which land in the middle of a line in a potential limerick, and words which land at the end of a line of a potential limerick. Syllable count matters for both, but I only really care about pronunciation for words that have to rhyme.

Next step was to create a function that determines the approximate number of syllables in any given word. A syllable contains a vowel sound, which can map to anywhere from 0 to 3 vowel graphemes (letters, not sounds) (aeiouy). I figured I could count the vowel grapheme groups, add 1 for each grouping of more than one vowel grapheme that isn’t a common digraph (group of graphemes that generally signifies a single phoneme (sound)), add 1 for the common apostrophe-in-place-of-vowel instances, and subtract 1 for the circumstances where ‘ed’, ‘es’, or ‘e’ are likely silent at the end of a word, and get pretty close.

def approx_nsyl(word):
      digraphs = ["ai", "au", "ay", "ea", "ee", "ei", "ey", "oa", "oe", "oi", "oo", "ou", "oy", "ua", "ue", "ui"]
      # Ambiguous, currently split: ie, io
      # Ambiguous, currently kept together: ui
      digraphs = set(digraphs)
      count = 0
      array = re.split("[^aeiouy]+", word.lower())
      for i, v in enumerate(array):
           if len(v) > 1 and v not in digraphs:
                count += 1
           if v == '':
                del array[i]
      count += len(array)
      if re.search("(?⟨=\w)(ion|ious|(?⟨!t)ed|es|[^lr]e)(?![a-z']+)", word.lower()):
           count -= 1
      if re.search("'ve|n't", word.lower()):
           count += 1
      return count

With that in place, I had Nantucket keep going when it came across non-cmudict words that fell in the middle of potential limerick lines, but give up on any potential limerick that hit a non-cmudict word that would fall at the end of a limerick line and have to rhyme. Better, but not good enough. I needed a way to figure out when words not in cmudict rhyme!

Grapheme-to-phoneme conversion (“g2p”), or converting a list of letters into a corresponding list of sounds, is a fascinating and non-trivial problem. I spent a lot of time falling down the rabbit hole of reading intriguing papers on how to design machine-learning algorithms that can take context into account when doing g2p. There are two big problems with g2p in American English. First, there’s the issue of alignment – there just isn’t a consistent correspondence between the number of graphemes and the number of phonemes. And second, there’s the issue of context – the same grapheme (or group of graphemes) can correspond to a different phoneme depending on the other graphemes in the word (think about the “pol” in “politics” (ah) versus “political” (uh)), or even whether the word is a verb or a noun (the “de” in “defect”, for instance (dee versus deh)).

At a glance, I figured I could do a rough pass at the problem by creating a last-syllable dictionary based on cmudict, which I promptly did. It was actually surprisingly helpful, but still missed a lot of words, and wasn’t as accurate as I wanted it to be. (More on this later.) So, I kept thinking and reading about the problem.

Poetry break!

from James Joyce’s Ulysses:

grace about you I can give you
a rare old wine that’ll send you
skipping to hell and
back Sign a will and
leave us any coin you have If you

then he tipped me just in passing
but I never thought hed write making
an appointment I
had it inside my
petticoat bodice all day reading

meant till he put his tongue in my
mouth his mouth was sweetlike young I
put my knee up to
him a few times to
learn the way what did I tell him I

The most promising idea I came across is implemented as free software already as Sequitur (that link links to the paper it’s based on, but you can also find a a free pre-publication copy of the manuscript here). It uses expectation maximization and viterbi training to create a model for g2p for any language, given a suitable dictionary to train from first. (It argues that it’s method is better than other methods I came across earlier, which generally involved hidden markov models or, in the simplest promising paper I read, going from right to left and looking at three graphemes to either side for context).

So, I installed Sequitur and gave it a whirl, training and testing it with huge portions of cmudict. It took hours to train, though, and I ultimately wasn’t thrilled with its accuracy. (To be fair, I only tested its accuracy overall, not its accuracy for last syllables only).

It occurred to me during this process that my needs were actually simpler than that. I don’t need to be able to do g2p for complete words in order to have a functional limerick detector. I only need to do g2p for the last syllable of each end-of-limerick-line word. That simplifies my alignment problem right off the bat, because I start by aligning my grapheme list and my phoneme list to the right (as in the second paper I link to above), and I don’t go far enough to the left for them to have much opportunity to become misaligned to any relevant expect. It also resolves a large portion of my context problem – there’s less flexibility in last-syllable graphones (pairs of graphemes and phonemes) than in entire-word graphones.

As completely fascinated by machine learning algorithms as I am, I decided to go back to the last-syllable dictionary I’d created and see if I could attack the problem by improving its accuracy instead.

I created it by going through every word in cmudict and pulling it out what looked like the last syllable worth of graphemes and the last syllable worth of phonemes, catching the graphemes with:

graphemes = re.search("((?i)[BCDFGHJKLMNPQRSTVWXZ]{1,2}[AEIOUY]+[BCDFGHJKLMNPQRSTVWXZ]*(E|ED)?('[A-Z]{1,2})?)(?![a-zA-Z]+)", word).group()

And catching the phonemes with:

val = min(vals, key=len)
i = -1
while i >= 0 - len(val):
      if isdigit(val[i][-1]):
           str = " ".join(val[i:])

To improve my accuracy, after a few iterations I chose to grab up to 2 consonants prior to the my best estimate of the last vowel sound in the word, and include copies without the first and without either, in my list. For example:

clotted AH0 D
lotted AH0 D
otted AH0 D
ders ER0 Z
ers ER0 Z
rs ER0 Z

This gave me gleeful flashbacks to a childhood parlor trick that my brothers and I can still all do in unison, where we chant at blazingly fast top speed: “everybody-verybody-erybody-rybody-ybody-body-ody-dy-y!”

Poetry break!

from Dostoevsky’s The Brothers Karamazov:

eyes with a needle I love you
I love only you Ill love you
in Siberia
Why Siberia
Never mind Siberia if you

are children of twelve years old who
have a longing to set fire to
something and they do
set things on fire too
Its a sort of disease Thats not true

and be horror struck How can I
endure this mercy How can I
endure so much love
Am I worthy of
it Thats what he will exclaim Oh I

Anyways, once I had modified set of last syllable graphones (pairs of letter lists and sound lists), I used some sweet little command line tools to sort the results into a list of unique types with frequency counts, like so:

sort < suff_a.txt | uniq -c | sort -nr > suff_b.txt

My last step in creating my cmudict-based last-syllable dictionary (“suffdict”) was to keep only the most likely set of phonemes for each unique set of graphemes, and reformat the list to match cmudict’s format so I could just use NLTK’s cmudict corpus reader for my suffdict as well. I did that like so:

def most_prob(file):
      uniq_suffs = []
      goal = open('suff_c.txt', 'a')
      with open(file) as f:
           for line in f:
                suff = re.search("\s[a-zA-Z']+\s", line).group()
                if suff not in uniq_suffs:
                     uniq_suffs.append(suff)
                     new_line = re.sub("\d+\s(?=[a-z])", "", line)
                     new_line = re.sub("(?<=[a-z])\s(?=[A-Z])", " 1 ", new_line).strip()
                     goal.write(new_line + '\n')
      goal.close()

The above code catches only the first instance of any given grapheme set, which gives me the most probable instance, because I'd already sorted everything in order of highest to lowest number of occurrences.

Now when checking for last-syllable phonemes for a word not in cmudict, I use the same regex I used when creating suffdict to check whether the last syllable worth of graphemes from that novel word is in suffdict, and if so, return the corresponding last syllable worth of phonemes from suffdict. If not, try without the first letter, or without the s or 's at the end.

Fantastic! I was then able to test my accuracy by running every word in cmudict through cmudict and through my suffdict, and then seeing whether the resulting phoneme lists rhymed.

Poetry break!

from Mark Twain's Huckleberry Finn:

he suspicion what we're up to
Maybe he won't But we got to
have it anyway
Come along So they
got out and went in The door slammed to

and see her setting there by her
candle in the window with her
eyes towards the road and
the tears in them and
I wished I could do something for her

I went through a few iterations of tweaking my suffdict creation method to eke a few extra percentage points of accuracy out of it.

My first attempt, which looked at every possible pronunciation for every word in cmudict when creating my suffdict, gave me 80.29% accuracy.

Next, I realized that since I was always using the shortest possible pronunciation when running words through cmudict, I should probably only look at shortest possible pronunciations when creating my suffdict, for consistency's sake. That brought me up to 82.20% accuracy.

After that, it occurred to me that if I included up to 2 consonants at the beginning of each last-syllable grapheme list in suffdict instead of just 1, I would have a bit more context and catch a few more words correctly. Which I did! That got me up to 85.78% accuracy.

This was getting pretty exciting, but still not as good as I wanted it to be. I thought about what I was doing, and decided to revisit the function I'd written that checks whether lists of phonemes (in the cmudict-style, but from any source) rhyme. Maybe the problem was there instead.

And, oh, yes! My rhyme_from_phonemes function was checking to see whether the last vowel phoneme and all phonemes thereafter were identical in both words - really, truly identical, that is. It disqualified even pairs of words that had exactly the same sounds but different stress patterns. This might make sense if I were paying attention to meter or defining rhyming differently, but it wasn't really what I was going for here at all. So, I rewrote that function to ignore the digit of the last vowel phoneme (which denotes stress only) and instead check whether it and all following phonemes were otherwise identical, like so:

def rhyme_from_phonemes(list1, list2):
      i = -1
      while i >= 0 - len(list1):
           if isdigit(list1[i][-1]):
                if i >= 0 - len(list2) and list1[i][:-1] == list2[i][:-1] and (i == -1 or list1[i + 1:] == list2[i + 1:]):
                     return True
                else:
                     return False
           i -= 1

That brought me up to 90.85% accuracy.

That feels pretty good, for my purposes!

I was now catching limericks with novel words at the end of lines. Poetry break time!

from Genesis:

in the iniquity of the
city And while he lingered the
men laid hold upon
his hand and upon
the hand of his wife and upon the

Amorite and the Girgasite
And the Hivite and the Arkite
and the Sinite And
the Arvadite and
the Zemarite and the Hamathite

I took a moment after that to make the basic limerick-finding algorithm a bit faster. My first draft was intentionally simple but inefficient, in that it started fresh for each word in the text, instead of saving the syllable counts and phonemes for words checked on previous limerick attempts. It had to re-analyze a word each time that word was encountered.

That worked well enough to let me get to the interesting g2p problem quickly, but once I was reasonably satisfied with my suffdict, I wanted to refactor to make the whole thing more efficient. The current version holds the phonemes and syllable count of each word encountered in a dict, so it can grab them quickly from that dict the next time they're encountered instead of having to figure them out from scratch again and again as it goes through the text.

I have some thoughts on increasing the efficiency further (by having it skip forward more intelligently whenever it hits a word it can't find phonemes for, for instance), but really, it's at a good enough stage that I wanted to share some accidental limericks with you all already!

22 Comments

  1. Rob Abrazado wrote:

    This is just astoundingly impressive. I’m greatly enjoying reading about your hacking projects. :)

    Tuesday, April 3, 2012 at 1:36 pm | Permalink
  2. Dr. Goose wrote:

    It’s hard to overstate the importance of this invention – when will you release version 1.0?

    Tuesday, April 3, 2012 at 3:50 pm | Permalink
  3. Danielle wrote:

    Thanks, Rob! I really truly appreciate that.

    Dr. Goose – I’m charmed, and thank you. ^^ Well, if you download the code and install the dependencies as described in the ReadMe on Github, you can run it from your command line right now. Does that work for you? If not, what kind of release are you looking for?

    Tuesday, April 3, 2012 at 3:52 pm | Permalink
  4. Dave wrote:

    This is very cool.

    Have you considered forbidding limericks with repeated words in couplets? E.g. the couplet

    and the Sinite And
    the Arvadite and

    is less interesting than a rhyme with distinct words (though it’s got nice meter!), and

    in the iniquity of the
    city And while he lingered the
    men laid hold upon
    his hand and upon
    the hand of his wife and upon the

    is a 3-2 split between two words. Even a somewhat permissive filter (for example, forbidding triples and pairs of pairs) might make the remaining limericks really interesting.

    Would it be easy/legal for you to post the specific text files you’re searching on github? Or if that’s hard/impermissible, can you link to them? (Presuming they’re freely available.)

    Tuesday, April 3, 2012 at 10:28 pm | Permalink
  5. JeffyPooh wrote:

    There was a young geek named Sucher,
    and poetry was her pleasure.
    She worked in Python,
    and coded ’til dawn,
    to build a limerick detector.

    :-)

    Tuesday, April 3, 2012 at 10:31 pm | Permalink
  6. Danielle wrote:

    I did consider that, but I’m getting few enough limericks now that it didn’t make sense to me to cut down the number even more. If I were getting a lot, then yes, I’d absolutely do what you suggest! You’re right, it would definitely lead to better poetry.

    I’m just going through texts found on Project Gutenberg. You can head over there to grab entire books to play with!

    Tuesday, April 3, 2012 at 10:32 pm | Permalink
  7. rfp wrote:

    love the ones from Genesis, couldn’t decide if they would be best read as a single 2-verse composition or ingested separately

    Tuesday, April 3, 2012 at 11:44 pm | Permalink
  8. Neal Z wrote:

    I’m jealous! I started the NLP course along with three others, but due to my schedule and lack of Java/Python experience I ended up only sticking with two of them (Model Thinking and Algorithms). My consolation is that I can just save the videos/slides and come back to them over summer break.

    Anyway, this is awesome – congrats!

    Wednesday, April 4, 2012 at 2:53 am | Permalink
  9. Niall wrote:

    Excellent – I encountered much the same short-comings with the CMU syllable parser as you did, but went in a different direction for my project… wonderful to see cross-discipline work like this.

    Wednesday, April 4, 2012 at 6:30 am | Permalink
  10. Matt wrote:

    Can you please make a web-based front-end so people can upload or point to a text and have it return limericks? Or, failing that, an email front-end where I could email it a text to parse?

    Wednesday, April 4, 2012 at 1:14 pm | Permalink
  11. Bruce Hayes wrote:

    This is nice but I think you are aiming too low in your definition of a limerick.

    1) It’s true that the length of the limerick line is somewhat free, but the freedom is largely at the periphery of lines. A decent sounding limerick would have two stressless syllables between every pair stressed, e.g. for the long lines the template is:

    (x)(x) X x x X x x X (x)(x)

    where capital X denotes a stressed syllable.

    2) It doesn’t really count as a rhyme if you fail to match stressed vowels (so, for example, “batter” fails to rhyme with “upper”). In addition, rhymes are heavily disfavored if the consonants before the stressed vowel are identical. Your case of “to…to” falls short on both counts.

    I conjecture that if you enforced the above requirements you would find that “accidental limericks” are vanishingly rare, or perhaps even entirely absent. At best you could find legal limerick lines from the same corpus and then string them together.

    A nice reference source for the rules of English verse construction is Timothy Steele’s book “All the fun is how you say a thing: an explication of meter and versification”

    Wednesday, April 4, 2012 at 2:58 pm | Permalink
  12. Danielle wrote:

    Matt – Sure! I’m working on something else today, but I can probably get to that at some point within the next week so.

    Bruce – You make entirely valid points here. As I mentioned above, I knowingly used a definition of ‘rhyme’ that isn’t quite accurate in order to get a rough pass at this without having to worry about stress, at least for now.

    I also agree about the meter – limericks can formally have one of two different possible meters, but since a limerick line can start anywhere within a foot of the meter, those two end up being functionally identical in the way you describe.

    But it really comes down to your last point – accidental limericks would indeed be vanishingly rare if I used a stricter ruleset. I’d rather find some fun ones this way than none at all by hewing to stricter guidelines!

    Wednesday, April 4, 2012 at 3:14 pm | Permalink
  13. Danielle wrote:

    Bruce – Oh, and thanks for the Steele recommendation. I’d never heard of it, and it looks great!

    Wednesday, April 4, 2012 at 3:17 pm | Permalink
  14. Micheal Lefkowitz wrote:

    Great work, this is neat!

    If you’re willing to hear suggestions, I think you could find even better accidental limericks by making some minor tweaks:

    1) In real limericks the ends of lines aren’t usually function words like “the” and “I.” You could filters these out.

    2) In limericks, lines can end in unstressed syllables, but when they do, everything back to rightmost STRESSED syllable has to rhyme too. This might take some tweaking but I think you have all the right tools to do it!

    2) Syntactically it would be nice to find limericks that end at the end of a sentence: these could be found pretty easily using punctuation.

    Wednesday, April 4, 2012 at 6:35 pm | Permalink
  15. Micheal Lefkowitz wrote:

    Oops, should have read the other comments first – didn’t mean to rehash Bruce’s point!

    Wednesday, April 4, 2012 at 6:37 pm | Permalink
  16. Danielle wrote:

    Matt – At your request, Nantucket now has web front-end: http://www.daniellesucher.com/nantucket/nantucket.html

    It’ll let you point it at text files on the web only, not upload files or point it at anything other than text files. But that lets you get at all of Project Gutenberg, basically! Have fun, and thanks for prodding me to take the extra step here. ^^

    Saturday, April 7, 2012 at 12:14 pm | Permalink
  17. Erin wrote:

    You never cease to amaze me. Here is some Don Quixote for you.

    on that point said Don Quixote
    God knows whether there be any
    Dulcinea or
    not in the world or
    whether she is imaginary

    enchanters that persecute me
    are not trying to entangle me
    in them and delay
    my journey by way
    of revenge for my obduracy

    Monday, April 9, 2012 at 11:28 am | Permalink
  18. wordgoeshere wrote:

    Amazing work, Danielle! I’m looking to build a rhyming dictionary/search and have just started delving into what that entails (like, today). I don’t have any programming experience but am starting with basic Python tutorials. I have a couple questions for you:

    1) How long did this take you and how well did you know Python/Java before you started? (very impressive)

    2) Any suggestions on where to start? So far, I’m looking to just learn Python via http://learnpythonthehardway.org/

    Any help would be greatly appreciated. This projects is bound to consume me :)

    Thursday, April 19, 2012 at 7:57 pm | Permalink
  19. Danielle wrote:

    Erin – I love the Don Quixote ones! Niiice find.

    Wordgoeshere – Happy to help any way I can. ^^ Do you know any programming? Knowing more about where you’re coming from will help me give you better advice on where to start!

    I had just started poking at Python a bit a couple days earlier, by jumping into the docs to do an assignment for the Stanford online NLP class. But it’s pretty similar to Ruby, which I’d played with for a few months beforehand. I don’t actually know any Java.

    I was able to put together my first rough draft of Nantucket (which worked, but skipped any words not in the cmu pronouncing dictionary) in less than a day. I worked on improving it on and off for the next week or so, in between other projects.

    Let me know what you end up building! I’d love to see what you come up with!

    Friday, April 20, 2012 at 2:35 pm | Permalink
  20. wordgoeshere wrote:

    Thanks for the quick response, Danielle. No, I don’t know any programming. I just started looking into this yesterday after thinking about it for a LONG time. At the suggestion of some helpful folks over on reddit () I’m planning on starting to work through http://learnpythonthehardway.org to gain a little familiarity with the language before I try getting complicated. I was expecting my project to take months at the very least, but hearing that you were able to do that in a day is very encouraging. Someone over on reddit also wrote a 44 line script in the last 24 hours, so I guess I was greatly overestimating the task that lay in front of me. Do you think that tutorial I linked above would be an appropriate place to start for this sort of thing? I feel like I need to understand the building blocks before I try to build anything. Thanks for your help!

    Friday, April 20, 2012 at 5:17 pm | Permalink
  21. wordgoeshere wrote:

    Oh, forgot to link to the reddit thread (http://www.reddit.com/r/learnprogramming/comments/sik0f/i_want_to_build_a_rhyming_dictionarysearch_engine/)

    Friday, April 20, 2012 at 5:17 pm | Permalink
  22. Danielle wrote:

    This is probably a conversation best continued on the reddit thread, so I think I’ll go create an account at long last and head over there.

    (Short answer: I hear very good things about Learn Python the Hard Way, but have not actually read through it myself. Also, I’m not really clear on what you’re trying to build, honestly.)

    Friday, April 20, 2012 at 5:26 pm | Permalink

2 Trackbacks/Pingbacks

  1. [...] Danielle Sucher › Nantucket: an accidental limerick detector. [...]

  2. This is a linkdump | Ryan Sholin on Wednesday, April 10, 2013 at 4:31 pm

    [...] The accidental limerick detector. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*