Skip to content

Honey from Huffman trees

Did you know that NYC Open Data includes latitude/longitude for every tree in NYC? A tree census – amazing! Surely there’s something fun we can do with that!

Well, bees in NYC do a lot of their nectar foraging from flowering trees. It could be interesting to see which kinds of trees my bees are likely foraging from. Bees can fly startlingly long distances to forage if they have to, but my sense is that a 2 mile radius is a decent rough estimate for how far they’re likely to fly in search of nectar barring a barren neighborhood. Sure, there are plenty of other flowering plants in the city (I can certainly taste the clover in my spring honey, for instance), but it’s still interesting to see which trees are likely contributing to the honey I harvest locally.

So, I pulled the tree census data into mysql to play with. (Bread crumbs for the inspired: I wanted to deploy on Dreamhost, so I used an old version of mysql and was stuck with mbrcontains narrowed down with the haversine equation (and cosine approximation, of course) to find only those trees actually within an n-mile radius of my starting point. Postgres and stcontains with a polygon approximation of the circle would be better, really.)

I’m displaying results as a Huffman tree – a visual representation of the data structure you create when you use Huffman encoding.

Visualized with D3.js, with leaf sizes proportional to the percentage of trees found of each type, it looks rather like this:

You can put in your own beehive address (or home, or neighborhood you’re thinking of moving to, whatever makes you happy) and play with my tree-finder here.

When a thing is as dorky as it can possibly be, I know it is done right.

“What I tell you three times is true.”

My partner, Dave, just received the most amazing email from a brilliant and delightful friend of ours (emphasis added):
 

You were in a dream I had last night, and I thought it might amuse you.

You had a couple tables set up in the foyer of a (nonexistent) restaurant specializing in dumplings molded into cube shapes and sold in powers of two, and you were handing out atheist literature and cookies to people waiting to get tables. The cookies were chocolate chip, except the dough had a sort of smoky cherry taste – like a cherry brandy, not like a cherry jolly rancher – and was purple.

There was a sign announcing you would talk to anyone about anything they wanted to talk about, and you were telling someone that they should solve their problem by trying three times to fix it – any way they wanted, as long as all three attempts were different – and then pick which of their attempts came closest to the desired result and ask an expert how to improve on that. They should keep doing this until they ran out of experts, at which point they themselves were the expert.

“That’s how I learned to communicate with sea lions!” you said. At which point I suddenly noticed the sea lion in a powered exoskeleton and little square glasses hanging out under one of your tables, happily reading one of the atheist books.

I thought it was amusing. Even though I don’t like chocolate chip cookies. :)


I just had to share (with permission, of course), because that’s actually pretty great (and very Dave-ish) problem-solving advice!

(The cookies are nice, too. Seeing as how life imitates dreams, Dave got inspired and had to make a batch of smoky cherry chocolate chip cookies for us tonight.)

Spring has sprung with a LOT of bees.

My bees looked a bit sparse and weak going into winter, so when I had the opportunity to order a new package of bees through the Backwards Beekeepers a few months ago, I went for it. I wasn’t sure if my hive had survived winter (or would survive spring), especially given how strange the weather has been, but it was then or never on making sure I had bees this summer. I figured it was worth a gamble – either my hive would die, and I’d want a new package to replace it, or my hive would survive, in which case I could try to sell my package to someone else or (gulp!) start a second hive.

I lost track of time this spring, what with focusing on Hacker School and all. So when I got an email a few days ago saying that I had to come pick up my package on Friday, I was in a bit of a panic. I hadn’t checked in on my bees at all since last fall. I had no idea if I had bees! And I sure didn’t have any spare hive to put the package into if it turned out my bees had made it through after all, or anyone to buy the package from me last-minute.

After some hurried consultation with my delightful bee purveyor, I stopped by Hayseed and picked up a spare bottom board, inner cover, and outer cover along with my package. Can’t hurt to have spares just in case, and if I did need to start a second hive, that was the bare minimum equipment I needed, given that I have some extra medium supers and frames that I tend to use for honey a bit later in the season.

Turns out, I have a LOT of bees.

(Yeah, I need to pick up some more cinderblocks for the new hive to stand on. That milk crate was the best I could find in a pinch!)

You can see that the new hive is just one super at the moment, while the old one is three supers high. (Or was two days ago, anyways.) I use all medium supers for both brood and honey in my hive(s!). And right there in front of the new hive is the box the new bees came in.

A package is a box with about 3 lbs of bees and a little cage with a queen in it. When you want to install a package, you basically just reach in and gently remove the queen cage, then pry the mesh off one side of the package and shakeshakeSHAKE your booty all the bees out into the hive. That’s it, really. I scold them lovingly and literally brush them from the tops of the frames down in between the frames, but mostly just because it’s fun. And finally, you just leave the box out in front of the hive so the rest of the stragglers can follow the queen’s scent and find their way into their new home.

I suspended my queen cage in my hive, then closed it up. The idea there is that the queen is trapped in her cage by a sugar plug. The bees have time to get used to her scent while eating her free, and so are more likely to accept her once she’s out among them. I tend to use a business card and a thumbtack to hang the queen cage between frames in the hive. This time I used Kyle‘s business card (he’s a Hacker School alum who now works with Tumblr), since we’ve chatted by email already and I know I have his contact info saved elsewhere by now. (Hi, Kyle! I hope you’re charmed rather than offended by this. You’ve become part of a rather delightful process, it turns out!)

I went back this morning to make sure the queen was released properly. Tomorrow would’ve been better, but bees depend on weather and my work schedule, after all. They’d mostly gotten through the sugar, but not entirely – her handmaidens were free, but the queen herself was still in her cage. Everyone sounded happy, though, so I manually released her and watched for a moment to make sure that the hive continued to sound cheerful and that they didn’t start balling her immediately. Everything looked fine, so I closed that hive up with a sugar syrup feeder on top and moved on to my older hive.

The weather was nicer today, so I wanted to get deeper into my big hive to see what was really going on in there. They were chill as can be, friendly and relaxed, so I figure they’re probably queenright. I saw some very young larva in there, too, along with some older brood and honey and pollen and assorted bee stuffs. And so many bees! That hive is seriously busy. Not too many queen cells, surprisingly, so they didn’t seem in imminent danger of swarming, but they were starting to back-fill the brood nest with nectar. Time to take action!

No prob. I closed up the hive, went downstairs, and got my last remaining super and set of frames. I checkerboarded the top two of the now four supers on that hive, to confuse their swarming instinct and give them more space to lay and to save whatever nectar they may find this early in the year. I’m going to gave to buy some more supers and frames at Hayseed, stat! I don’t have any spares left for my expanding new hive or to collect honey in the old hive.

One more thing to take care of before I was done. I had a bit of a varroa mite problem last year (remember my snow bees?), and I want to stay on top of reducing the mite population as much as possible. How lucky, then, to discover a nearly fully capped frame full of drone brood!

See how it bulges out? Drones (males) are bigger than workers (females), so they need more space to grow when developing. Worker brood has flat caps, but, well, you can see why we also refer to drone brood as “bullet brood”!

Varroa mites preferentially lay in drone brood. Drone brood takes longer to mature, and is bigger, so it gives the mite mama more bang for her buck, as it were. (Catch me in person and ask me tell you of my scheme to miniaturize my bees at some point, by all means! I have theories and plans. But it has been a hectic spring, so that will probably wait for next year.)

Point being, in addition to sprinkling powdered sugar everywhere, you can also cut down on your mite population by taking capped drone brood and sticking it in the freezer. The mites die, and then you return the frame to the hive, where the bees will clean it up, lay more drones, and restart the process. Frankly, my ladies don’t really need more drones around anyway. They’re not terribly useful. So we can do this all summer to try to reduce the number of bugs on my bugs. Fantastic!

So basically, what I’m trying to say here is that all is gorgeous and amazing out in the Brooklyn sky, which is where I’ll be most Sunday mornings for the next few months.

Happy spring!

The why and how of it all

Idealist asked: “Tell us: What quotation reminds you to keep your priorities straight? #favoritequotesroundup”

I’m such a literary packrat that I couldn’t fit my answer into 140 characters, so here goes instead:
 
 
From my best-beloved Buckminster Fuller, in his Everything I know (emphasis added):

“I’ve wanted you to think about, “Why are humans here?” “Why do they have that beautiful mind and why they have access to the great principles of Universe itself, of the great design nothing else we know has access to?” I say we, common to all human beings, in all history, completely independent of any ethnic nuance or whatever it may be have problems, problems, problems because WE ARE HERE FOR PROBLEM SOLVING. Not to have problems out of the way in some stupid, sublime something called peace. We’re here strictly for problem solving, and the better you get at it, the more problems you’re going to get to solve.

 
 
Also from Buckminster Fuller:

“When I am working on a problem I never think about beauty. I only think about how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong.”

 
 
From Annie Dillard’s The Writing Life (via PC Wordsmiths):

“Write as if you were dying. At the same time, write as if for an audience consisting only of terminal patients. That is, after all, the case. What would you begin writing if you knew you would die soon? What could you say to a dying person that would not enrage by its triviality?

“…One of the few things I know about writing is this: spend it all, shoot it, play it, lose it, all, right away, every time. Do not hoard what seems good for a later place in the book, or for another book; give it, give it all, give it now. The impulse to save something good for a better place later is the signal to spend it now. Something will arise for later, something better. These things fill in from behind, from beneath, like well water. Similarly, the impulse to keep to yourself what you have learned is not only shameful, it is destructive. Anything you do not give freely and abundantly becomes lost to you. You open your safe and find ashes.

“After Michelangelo died, someone found in his studio a piece of paper on which he had written a note to his apprentice, in the handwriting of his old age: ‘Draw, Antonio, draw, Antonio, draw and do not waste time.’”

 
 
From Keith Olbermann:

“…the simple idea that those other people you see every day, the background characters, the extras in the movie that is your life, that they count too, and that the only obligation you truly have in life is to try to do something, something for them, even if you will never meet them, even if you will never know them. Something. Not everything. Something. …You will die and I will die and everybody you will see tomorrow will die and so will their children and their descendents, and we will be, at best, memories. And by what are all those who proceeded us judged? Name anybody in history—name anybody we all know or somebody only you know—by what are they judged? The answer, stripped of the bells and whistles, is not wealth nor fame not beauty nor power, but what impact did they have on the lives of others?”

 
 
From Neil deGrasse Tyson:

“For me, I am driven by two main philosophies: know more today about the world than I knew yesterday and lessen the suffering of others. You’d be surprised how far that gets you.”

 
 
From Kasey Chambers:

“The miles take time, but the time is mine, and always moving suits me fine. I’ll catch my breath when I sleep. And after all that I’ve done, I’m not half what I’d hope that I’d become. There is still a long way to go.”

 
 
And a few from Oliver Wendell Holmes, Jr.:

“The riders in a race do not stop short when they reach the goal. There is a little finishing canter before coming to a standstill. There is time to hear the kind voice of friends and to say to one’s self: ‘The work is done.’ But just as one says that, the answer comes: ‘The race is over, but the work never is done while the power to work remains.’ The canter that brings you to a standstill need not be only coming to rest. It cannot be while you still live. For to live is to function. That is all there is in living. And so I end with a line from a Latin poet who uttered the message more than fifteen hundred years ago: ‘Death plucks my ears and says, Live – I am coming.’”

“Alas, gentlemen, that is life. I often imagine Shakespeare or Napoleon summing himself up and thinking: ‘Yes, I have written five thousand lines of solid gold and a good deal of padding – I, who would have covered the milky way with words which outshone the stars!’ ‘Yes, I beat the Austrians in Italy and elsewhere: I made a few brilliant campaigns, and I ended in middle life in a cul-de-sac – I, who had dreamed of a world monarchy and Asiatic power.’ We cannot live our dreams. We are lucky enough if we can give a sample of our best, and if in our hearts we can feel that it has been nobly done.”

“The rule of joy and the law of duty seem to me all one. I confess that altruistic and cynically selfish talk seem to me about equally unreal. With all humility, I think ‘Whatsoever thy hand findeth to do, do it with thy might,’ infinitely more important than the vain attempt to love one’s neighbor as one’s self. If you want to hit a bird on the wing, you must have all your will in a focus, you must not be thinking about yourself, and, equally, you must not be thinking about your neighbor; you must be living in your eye on that bird. Every achievement is a bird on the wing.”

How to get an accurate recipe from your grandmother

If your grandmother is anything like mine, she has an incredible repertoire of recipes from the old country which involve a set of ingredients and no measurements whatsoever. Everything is by feel, by sight, by years of experience rather than lists of precise numbers. It’s amazing, but hard to learn from.

I like numbers. They are clean in my head, and lead to reproducible results. I also like chicken paprikash and palacsinta and all sorts of delicious things that my grandmother cooks so well, and want to be able to make them myself.

This came up in conversation yesterday, when a fellow I was chatting with mentioned that he has the same problem getting accurate recipes from his mother. I told him my trick for getting accurate recipes from my grandmother, and it occurs to me this morning that I’ve never written it out before and probably ought to share.

It’s simple, really. I collect multiple data points for each recipe by asking my grandmother (and my mother, when she can remember) for the same recipe multiple times on different days and times of day. I push them each time to just give me their best guess at what the measurements are, based on their memory of what they do by feel and sight, and write down what they tell me. This can involve some hand-holding, and tends to go rather like this:

My grandmother: “You put in some paprika.”
Me: “How much paprika?”
Her: “Until it looks right.”
Me: “Is it more than a cup of paprika?”
Her: “Oh, no no no.”
Me: “Is it more than half a cup of paprika?”
Her: (longer pause, then) “Noooo.”
Me: “Is it just a teaspoon? That can’t be right. The taste is too strong.”
Her: “About three tablespoons, maybe. More if you need it.”
And so on.

After a few iterations of this (three per family member seems to both work and not try their patience too much), I average out my data as follows: For each ingredient where there is a mode (a measurement that appears more often than any other measurement), I take the mode, and for each other ingredient, I take the mean.

That’s it, really. Dead simple, if you have the sort of family where you’re encouraged to be a loving but pushy nudge as needed. But it works! I can cook amazing, authentic Hungarian food in the style of the old ladies of Tarpa and Kisar this way! So, go forth and gather awesome recipes. Then come back and teach me them! I can always use more awesome recipes.

Oh, and if that made you hungry, here’s my approximation of my Hungarian grandmother’s recipe for stuffed cabbage.

Nantucket: an accidental limerick detector

How can you not name an accidental limerick detector Nantucket? I think there may be laws about that. Point being, I wrote a program that takes any text and tries to find any accidental limericks that might be hiding within (based on syllable counts and rhyme, ignoring punctuation and intent).

Limericks have a fairly loose form. The rhyme scheme is always AABBA, but the syllable count can be anything along the lines of 7-or-8-or-9/7-or-8-or-9/5-or-6/5-or-6/7-or-8-or-9. And as if that weren’t loosey-goosey enough, they can have either anapaestic meter (duh-duh-DUM, duh-duh-DUM) or amphibrachic meter (duh-DUM-duh, duh-DUM-duh)!

So, rather than write a limerick detector that looks for anything even remotely resembling a limerick, I chose the following as the canonical example to work from (at least for now):


There was a young student from Crew
Who learned how to count in base two.
His sums were all done
With zero and one,
And he found it much simpler to do.

Going by the canonical example above, Nantucket is set to look for limericks that are AABBA (rhyme scheme) and 8/8/5/5/9 (syllable count per line). It currently ignores meter, but I may add that requirement in later. It also only looks at words through an American English accent.

How to implement such a thing? Well, I started simple – I used the CMU Pronouncing Dictionary (“cmudict”) as accessed through NLTK (Python’s natural language toolkit) to get each word’s syllable count and pronunciation, and just had Nantucket give up on any potential limericks that included words not in cmudict. It tokenizes the text, then walks through each word in the text in turn, checks to see if there’s a limerick starting with that word, and gives up if it hits a word it can’t analyze or a limerick form violation (such as a word that overflows the end of a limerick line, or a rhyme scheme problem).

(It turned out I had to know Python or Java for the Stanford Natural Language Processing class I started two weeks ago. Er. Okay! This came in very handy as soon as I decided to build Nantucket last week – apparently Python has better natural language processing (“NLP”) libraries than Ruby, so it was absolutely the right choice for this project.) (I want Python’s scientific libraries, but with RSpec. Is that too much to ask?)

So, that was pretty fast to throw together. I defined a rhyme as being any pair of words which have identical last vowel sound plus all sounds following the last vowel sound. That’s not precisely right in American English, but it’s pretty damn close. This was easy to do with cmudict, because every vowel phoneme in cmudict ends with a digit denoting stress, but consonant phonemes never include digits.

So, that was a long bit of geekery. Time for a poetry break!

from Proust’s Swann’s Way:

bad conduct should deserve Was I
then not yet aware that what I
felt myself for her
depended neither
upon her actions nor upon my

was wonderful to another
How I should have loved to We were
unfortunate to
a third Yes if you
like I must just keep in the line for

to abandon the habit of
lying Even from the point of
view of coquetry
pure and simple he
had told her can’t you see how much of

That was a fun start, but I really wanted to handle the tons of words that just aren’t in cmudict. I tested Nantucket on James Joyce’s Ulysses repeatedly throughout this process, to come up with fantastic lists of words I was failing to catch at every step along the way.

I like to break projects down into small enough chunks that I get a sense of accomplishment frequently enough to keep myself motivated to keep on pushing forward. Here, words naturally fell into two categories – words which land in the middle of a line in a potential limerick, and words which land at the end of a line of a potential limerick. Syllable count matters for both, but I only really care about pronunciation for words that have to rhyme.

Next step was to create a function that determines the approximate number of syllables in any given word. A syllable contains a vowel sound, which can map to anywhere from 0 to 3 vowel graphemes (letters, not sounds) (aeiouy). I figured I could count the vowel grapheme groups, add 1 for each grouping of more than one vowel grapheme that isn’t a common digraph (group of graphemes that generally signifies a single phoneme (sound)), add 1 for the common apostrophe-in-place-of-vowel instances, and subtract 1 for the circumstances where ‘ed’, ‘es’, or ‘e’ are likely silent at the end of a word, and get pretty close.

def approx_nsyl(word):
      digraphs = ["ai", "au", "ay", "ea", "ee", "ei", "ey", "oa", "oe", "oi", "oo", "ou", "oy", "ua", "ue", "ui"]
      # Ambiguous, currently split: ie, io
      # Ambiguous, currently kept together: ui
      digraphs = set(digraphs)
      count = 0
      array = re.split("[^aeiouy]+", word.lower())
      for i, v in enumerate(array):
           if len(v) > 1 and v not in digraphs:
                count += 1
           if v == '':
                del array[i]
      count += len(array)
      if re.search("(?⟨=\w)(ion|ious|(?⟨!t)ed|es|[^lr]e)(?![a-z']+)", word.lower()):
           count -= 1
      if re.search("'ve|n't", word.lower()):
           count += 1
      return count

With that in place, I had Nantucket keep going when it came across non-cmudict words that fell in the middle of potential limerick lines, but give up on any potential limerick that hit a non-cmudict word that would fall at the end of a limerick line and have to rhyme. Better, but not good enough. I needed a way to figure out when words not in cmudict rhyme!

Grapheme-to-phoneme conversion (“g2p”), or converting a list of letters into a corresponding list of sounds, is a fascinating and non-trivial problem. I spent a lot of time falling down the rabbit hole of reading intriguing papers on how to design machine-learning algorithms that can take context into account when doing g2p. There are two big problems with g2p in American English. First, there’s the issue of alignment – there just isn’t a consistent correspondence between the number of graphemes and the number of phonemes. And second, there’s the issue of context – the same grapheme (or group of graphemes) can correspond to a different phoneme depending on the other graphemes in the word (think about the “pol” in “politics” (ah) versus “political” (uh)), or even whether the word is a verb or a noun (the “de” in “defect”, for instance (dee versus deh)).

At a glance, I figured I could do a rough pass at the problem by creating a last-syllable dictionary based on cmudict, which I promptly did. It was actually surprisingly helpful, but still missed a lot of words, and wasn’t as accurate as I wanted it to be. (More on this later.) So, I kept thinking and reading about the problem.

Poetry break!

from James Joyce’s Ulysses:

grace about you I can give you
a rare old wine that’ll send you
skipping to hell and
back Sign a will and
leave us any coin you have If you

then he tipped me just in passing
but I never thought hed write making
an appointment I
had it inside my
petticoat bodice all day reading

meant till he put his tongue in my
mouth his mouth was sweetlike young I
put my knee up to
him a few times to
learn the way what did I tell him I

The most promising idea I came across is implemented as free software already as Sequitur (that link links to the paper it’s based on, but you can also find a a free pre-publication copy of the manuscript here). It uses expectation maximization and viterbi training to create a model for g2p for any language, given a suitable dictionary to train from first. (It argues that it’s method is better than other methods I came across earlier, which generally involved hidden markov models or, in the simplest promising paper I read, going from right to left and looking at three graphemes to either side for context).

So, I installed Sequitur and gave it a whirl, training and testing it with huge portions of cmudict. It took hours to train, though, and I ultimately wasn’t thrilled with its accuracy. (To be fair, I only tested its accuracy overall, not its accuracy for last syllables only).

It occurred to me during this process that my needs were actually simpler than that. I don’t need to be able to do g2p for complete words in order to have a functional limerick detector. I only need to do g2p for the last syllable of each end-of-limerick-line word. That simplifies my alignment problem right off the bat, because I start by aligning my grapheme list and my phoneme list to the right (as in the second paper I link to above), and I don’t go far enough to the left for them to have much opportunity to become misaligned to any relevant expect. It also resolves a large portion of my context problem – there’s less flexibility in last-syllable graphones (pairs of graphemes and phonemes) than in entire-word graphones.

As completely fascinated by machine learning algorithms as I am, I decided to go back to the last-syllable dictionary I’d created and see if I could attack the problem by improving its accuracy instead.

I created it by going through every word in cmudict and pulling it out what looked like the last syllable worth of graphemes and the last syllable worth of phonemes, catching the graphemes with:

graphemes = re.search("((?i)[BCDFGHJKLMNPQRSTVWXZ]{1,2}[AEIOUY]+[BCDFGHJKLMNPQRSTVWXZ]*(E|ED)?('[A-Z]{1,2})?)(?![a-zA-Z]+)", word).group()

And catching the phonemes with:

val = min(vals, key=len)
i = -1
while i >= 0 - len(val):
      if isdigit(val[i][-1]):
           str = " ".join(val[i:])

To improve my accuracy, after a few iterations I chose to grab up to 2 consonants prior to the my best estimate of the last vowel sound in the word, and include copies without the first and without either, in my list. For example:

clotted AH0 D
lotted AH0 D
otted AH0 D
ders ER0 Z
ers ER0 Z
rs ER0 Z

This gave me gleeful flashbacks to a childhood parlor trick that my brothers and I can still all do in unison, where we chant at blazingly fast top speed: “everybody-verybody-erybody-rybody-ybody-body-ody-dy-y!”

Poetry break!

from Dostoevsky’s The Brothers Karamazov:

eyes with a needle I love you
I love only you Ill love you
in Siberia
Why Siberia
Never mind Siberia if you

are children of twelve years old who
have a longing to set fire to
something and they do
set things on fire too
Its a sort of disease Thats not true

and be horror struck How can I
endure this mercy How can I
endure so much love
Am I worthy of
it Thats what he will exclaim Oh I

Anyways, once I had modified set of last syllable graphones (pairs of letter lists and sound lists), I used some sweet little command line tools to sort the results into a list of unique types with frequency counts, like so:

sort < suff_a.txt | uniq -c | sort -nr > suff_b.txt

My last step in creating my cmudict-based last-syllable dictionary (“suffdict”) was to keep only the most likely set of phonemes for each unique set of graphemes, and reformat the list to match cmudict’s format so I could just use NLTK’s cmudict corpus reader for my suffdict as well. I did that like so:

def most_prob(file):
      uniq_suffs = []
      goal = open('suff_c.txt', 'a')
      with open(file) as f:
           for line in f:
                suff = re.search("\s[a-zA-Z']+\s", line).group()
                if suff not in uniq_suffs:
                     uniq_suffs.append(suff)
                     new_line = re.sub("\d+\s(?=[a-z])", "", line)
                     new_line = re.sub("(?<=[a-z])\s(?=[A-Z])", " 1 ", new_line).strip()
                     goal.write(new_line + '\n')
      goal.close()

The above code catches only the first instance of any given grapheme set, which gives me the most probable instance, because I'd already sorted everything in order of highest to lowest number of occurrences.

Now when checking for last-syllable phonemes for a word not in cmudict, I use the same regex I used when creating suffdict to check whether the last syllable worth of graphemes from that novel word is in suffdict, and if so, return the corresponding last syllable worth of phonemes from suffdict. If not, try without the first letter, or without the s or 's at the end.

Fantastic! I was then able to test my accuracy by running every word in cmudict through cmudict and through my suffdict, and then seeing whether the resulting phoneme lists rhymed.

Poetry break!

from Mark Twain's Huckleberry Finn:

he suspicion what we're up to
Maybe he won't But we got to
have it anyway
Come along So they
got out and went in The door slammed to

and see her setting there by her
candle in the window with her
eyes towards the road and
the tears in them and
I wished I could do something for her

I went through a few iterations of tweaking my suffdict creation method to eke a few extra percentage points of accuracy out of it.

My first attempt, which looked at every possible pronunciation for every word in cmudict when creating my suffdict, gave me 80.29% accuracy.

Next, I realized that since I was always using the shortest possible pronunciation when running words through cmudict, I should probably only look at shortest possible pronunciations when creating my suffdict, for consistency's sake. That brought me up to 82.20% accuracy.

After that, it occurred to me that if I included up to 2 consonants at the beginning of each last-syllable grapheme list in suffdict instead of just 1, I would have a bit more context and catch a few more words correctly. Which I did! That got me up to 85.78% accuracy.

This was getting pretty exciting, but still not as good as I wanted it to be. I thought about what I was doing, and decided to revisit the function I'd written that checks whether lists of phonemes (in the cmudict-style, but from any source) rhyme. Maybe the problem was there instead.

And, oh, yes! My rhyme_from_phonemes function was checking to see whether the last vowel phoneme and all phonemes thereafter were identical in both words - really, truly identical, that is. It disqualified even pairs of words that had exactly the same sounds but different stress patterns. This might make sense if I were paying attention to meter or defining rhyming differently, but it wasn't really what I was going for here at all. So, I rewrote that function to ignore the digit of the last vowel phoneme (which denotes stress only) and instead check whether it and all following phonemes were otherwise identical, like so:

def rhyme_from_phonemes(list1, list2):
      i = -1
      while i >= 0 - len(list1):
           if isdigit(list1[i][-1]):
                if i >= 0 - len(list2) and list1[i][:-1] == list2[i][:-1] and (i == -1 or list1[i + 1:] == list2[i + 1:]):
                     return True
                else:
                     return False
           i -= 1

That brought me up to 90.85% accuracy.

That feels pretty good, for my purposes!

I was now catching limericks with novel words at the end of lines. Poetry break time!

from Genesis:

in the iniquity of the
city And while he lingered the
men laid hold upon
his hand and upon
the hand of his wife and upon the

Amorite and the Girgasite
And the Hivite and the Arkite
and the Sinite And
the Arvadite and
the Zemarite and the Hamathite

I took a moment after that to make the basic limerick-finding algorithm a bit faster. My first draft was intentionally simple but inefficient, in that it started fresh for each word in the text, instead of saving the syllable counts and phonemes for words checked on previous limerick attempts. It had to re-analyze a word each time that word was encountered.

That worked well enough to let me get to the interesting g2p problem quickly, but once I was reasonably satisfied with my suffdict, I wanted to refactor to make the whole thing more efficient. The current version holds the phonemes and syllable count of each word encountered in a dict, so it can grab them quickly from that dict the next time they're encountered instead of having to figure them out from scratch again and again as it goes through the text.

I have some thoughts on increasing the efficiency further (by having it skip forward more intelligently whenever it hits a word it can't find phonemes for, for instance), but really, it's at a good enough stage that I wanted to share some accidental limericks with you all already!

Kenobi: a naive Bayesian classifier for Ask Metafilter

I built Kenobi as a way to get my feet wet with machine learning.

A fellow Hacker Schooler had mentioned the idea of a naive Bayesian classifier to me, and my ears perked up – Bayes! Hey, I know Bayes’ Theorem! It’s a generally useful simple equation that helps you figure out how much or how little to take new evidence into account when updating your sense of the probability of something or other.

The basic idea is:

Wait, no, the basic idea is that evidence doesn’t exist in a vacuum. Bayes’ Theorem is a way of quantifying how to look at new evidence in the context of what we know already and understand how we should weigh it when taking it into account, and how to determine more accurate probabilities and beliefs given the evidence we have to work from. If you’re looking for a more detailed understanding, I highly recommend reading Yudkowsky’s particularly clear explanation of Bayes’ Theorem.

(I went to a rationalist party once. Some guy asked me, “Are you a rationalist?” The friend who’d dragged me to the party interrupted with, “Well, she’s not not a rationalist!” And there you have it, I suppose.)

So, that seemed like fun. I’d just finished working on a card game implementation that can run simulations of strategies to help my partner with his game design (Greenland), and was ready for a new project. But a spam filter seemed dull – it’s been done before. Repeatedly. So, what to do?

I’m a huge fan of Ask Metafilter, a community where folks ask questions (shocking, no?) and answer questions asked by others. My fabulous brother got there first, and I appreciate that he dragged me in with him. It can be a bit overwhelming, though. I don’t really have the time to skim through all the questions that get posted, especially since so many of them are about things where I have no useful information or advice to give. It sure would be helpful if something pared the list down to only the questions where my answers would be most likely to actually help others, right? Right!

Kenobi was a perfect combo project for me. I got to explore machine learning, use some of the skills I picked up at the awesome ScraperWiki class on web scraping I took a while back, and create a tool I’d actually use to improve my ability to help others. Right on.

So, what does Kenobi actually do?

Kenobi has two basic functions: analyzing answers you’ve already posted to old AskMeFi questions, and classifying new questions for you to pick out the ones you can answer best.

To analyze your old AskMeFi data, Kenobi:

  1. deletes outdated training data for you from its own database, if any;
  2. logs into Metafilter under a spare account I created for this purpose, because one can’t see favorite counts in user profile comment data unless one is logged in;
  3. searches to find the user ID number associated with your username;
  4. scrapes the answers section of your profile for the above-the-cut text of each question you’ve answered, and whether or not you’ve received at least one favorite on the answer(s) you posted to that question;
  5. separates the old questions into a “should answer” group (those where your answer(s) did get at least one favorite) and a “should NOT answer” group (those where your answer(s) didn’t get any love);
  6. organizes and saves the data from each group (“should answer” and “should NOT answer”) to refer back to when classifying new questions;
  7. compresses the data to save space in the database; and
  8. emails you to let you know that training is done, if you submitted an email address (highly recommended).

To classify new AskMeFi questions for you, Kenobi:

  1. clears out your last batch of results, if any;
  2. parses the Ask Metafilter RSS feed for above-the-cut question text and URLs for the n most recent questions;
  3. decompresses the data it has on you into memory;
  4. for each question, determines the probability that you should answer it and the probability that you should NOT answer it, based on Bayes’ Theorem and your old answer data;
  5. for each question, if the odds that you should answer it is at least 4.5 times higher than the odds that you should NOT answer it, classifies that question as good for you to answer;
  6. saves and displays only and all the new question that are classified as good for you to answer.

Why do the odds that I should answer a question have to be 4.5 times higher than the odds that I should NOT answer that question, for Kenobi to classify it as good for me?

Because when I left the threshold lower, people were getting too many questions that didn’t seem like good fits to them. With a higher threshold, some folks may not get any results at all (sorry!), but people who’ve answered enough past AskMeFi questions to give good data to work from will get much more accurate results.

The closer the two probabilities are, the less confident we can be that we’ve really found a good match and that the question really is a good one for you to answer. It only makes sense to select a question for you when the odds that it’s the kind of question you’re good at answering are significantly higher than the odds that it isn’t.

Why all that compressing and decompressing?

I wrote Kenobi up as a pure Ruby command line tool first, then decided it would be fun to quickly Rails-ize it so more people would be able to play with it more easily. That meant finding a place to deploy it, as easily and cheaply as possible.

Heroku (my host) charges for databases over 5mb. I love you all, but not enough to spend money on you if I don’t have to. I’m trying to be as efficient as possible here, in hopes of not going over and having to actually spend money on this project if I can possibly avoid it.

Why the wait while Kenobi analyzes my old data?

A few reasons!

First, one can’t actually effectively search Metafilter for a user by name or see favorite counts on the list of a user’s past answers in their profile unless one is logged into Metafilter. Metafilter doesn’t even have an API to work with. It does have info dumps, but they’re huge and not updated regularly.

This means that Kenobi has to arduously scrape and parse the html for Metafilter whenever it analyzes old data for a new user. And it has to actually log into the site and click through as a logged-in user to do so, which it does using a gem called Mechanize.

I set the scraping up as a background task with Delayed_Job and set Kenobi up to email people when ready, so no one had to sit around staring at an error message or colorful spinner while waiting for their analysis to come up in the job queue and get done. This meant that there were no more http timeout error, but it also means that your analysis job goes to the end of the queue, however long it may be.

Also, Heroku charges for worker dynos, which are needed to actually run the background processes piling up in that job queue. They charge by the second. (Seriously). But that includes all the time the worker spends sitting around waiting for a job to exist, not just the time it spends actually working on jobs.

This was just a learning project, not something I actually expect to earn anything from or want to pay anything for. So, I spent a bunch of time messing around with a nifty tool called Workless and learning how to have Heroku automatically scale a single worker dyno up and down as jobs are added and completed, so I can pay for as little time as possible.

This slows things down for you even more, because not only are you waiting for the scraping to get done, you’re actually waiting for Heroku to start up a new worker dyno to start working on the scraping before it can get done.

Sorry about that! If you care a lot for some reason, email me and we can commiserate or something.

Wait, so Kenobi picks out questions where my answers will help others, not questions that help me directly?

That’s right! Kenobi‘s selections are based on each new question’s similarity to the past questions to which your answers have been favorited by others, and dissimilarity to the past questions where your answers got no love. It doesn’t pay attention to what you’ve favorited – only to which of your answers have been favorited by other people. It doesn’t really care about your interests at all, other than your interest in being popular of use to others.

Have fun!

Considerating

Considerate or creepy? It can be hard to tell, sometimes! My brother Josh asked me to be his “empathy sherpa” and help him navigate that blurry grey line between sweet and skeezy (or maybe that’s a different project?), so I built Considerating for him while warming up for Hacker School.

Considerating is a simple concept. Each consideration comes with a slider (or dropdown, on mobile browsers) that lets you vote – where on the range between considerate and creepy does this idea fall? You can sign in with Google oauth to submit new considerations of your own. After each vote, the graph is recalculated and redrawn to accurately reflect the updated results.

I did all the coding, and Josh and I collaborated on the design and UI. It was really fun to finally work on a project like this with him! I mean, this is my little brother, the kid who once tried to evict me from my bedroom back when we were young by taping a sign to my door while I was out signed by the “MGMT” – and the person who can most consistently answer correctly when I call him out of the blue to ask, “Hey, what’s that word I’m forgetting?”

My favorite part of this project was the little bit of javascript that makes those whoopety whoopety graphs work out. The code is all here, but this is the particularly fun bit:

function draw(points) {
  var canvas = document.getElementById('graph<%= @consideration.id %>');
  var highest = Math.max.apply(Math, points);
  if (canvas.getContext){
    var ctx = canvas.getContext('2d');
    ctx.strokeStyle = "#000000";
    ctx.lineJoin = "round";
    ctx.lineWidth = 2;
    ctx.beginPath();
    ctx.moveTo(0,115);
    ctx.bezierCurveTo(20, 115,
        20, 115-(points[0]/highest)*100,
        40, 115-(points[0]/highest)*100);
    for (i=1; i<10; i++) {
      ctx.bezierCurveTo(20+(i*40), 115-(points[i-1]/highest)*100,
          20+(i*40), 115-(points[i]/highest)*100,
          40+(i*40), 115-(points[i]/highest)*100);
    }
    ctx.bezierCurveTo(420, 115-(points[9]/highest)*100,
        420, 115,
        440, 115);
    ctx.shadowColor="black";
    ctx.shadowBlur=1;
    ctx.stroke();
  }
}

Imagine the graph as a set of 10 bars. The drawing function takes an array of 10 values, and creates those smooth curves between points set at the intersection of each bar (x-axis) and the number of votes that value has received, scaled appropriately (y-axis).

I'd played with Adobe Illustrator before, and had some sense of how bezier curves work. But it was a lot of fun to have to think through the math that would get me what I wanted from scratch, without being able to rely on click-and-drag visuals. I have a much more solid understanding of what bezier curve control points actually mean now, which I'm really happy about.

I still don't know if I can get away with chasing after strangers' toddlers, though. C'mon, internet, help me figure it out!

The best books I read in 2011

2011 was an amazing year in books for me. I hit a new record with my annual book list (I read 171 books last year!), and a much higher percentage of them than usual were awesome. In fact, I read so many great books last year that I split them up into categories for you here.

Books I loved reading in 2011 that related to decision-making and problem-solving:

  • Collapse: How Societies Choose to Fail or Succeed by Jared Diamond- Well, we’re fucked. Brilliant, brilliant book. Worth reading, but intensely depressing in a clear, logical sort of way. Diamond claims to be cautiously optimistic, and I’d like to believe him, but I’ve read too much Derrick Jensen to be totally convinced.
  • The Logic of Failure: Recognizing and Avoiding Error in Complex Situations by Dietrich Dorner- Analyzes tons of studies on how people try – and fail! – to handle complex situations. The main lessons I drew from this were – (a) You have to think about not only the problems you do have, but also the problems you DON’T have, because otherwise your solutions may well create new problems in the future; (b) feedback has a time lag, and unless you stick to tiny adjustments with delays to record feedback in between, you can easily end up ricocheting between extremes; (c) it’s hard to figure out the right amount of information to gather; (d) both overfocusing and ignoring complex details are extremely dangerous; and (e) learning the issues with dealing with complex problems doesn’t actually help you get better at handling them – only actual experience does that.
  • Complications: A Surgeon’s Notes on an Imperfect Science by Atul Gawande – Absolutely fantastic look at mistakes, the need for training and learning, social and ethical issues that interfere with training, cognitive errors, the difficulty in balancing our need for best health outcomes with our need for training students, and lots of surgical war stories along the way. I also highly recommend reading his essay Personal Best, where he discusses his decision to seek coaching to improve his surgical skills.
  • Honeybee Democracy by Professor Thomas D. Seeley- This was the most incredible book I’ve read in a long time. He describes the studies he performed in trying to determine how honeybee swarms decide on where their next home should be, and get everyone there together. Lots of insight into bees, but also into decision-making process design. Even if you’re not obsessed with bees as I am, this is a book well worth reading – it’s an eloquent depiction of how science is done, plus Seeley is very into the idea that we should learn more on how to manage group decision-making from the example set out by the bees.
  • Nudge: Improving Decisions About Health, Wealth, and Happiness by Richard H. Thaler and Cass R. Sunstein – Totally fascinating book about how to be a better choice architect, largely by adjusting incentives and defaults and making it just a bit easier for people to do what they in theory want to do anyway. Libertarian paternalism.

Books I loved reading in 2011 that related to race and class:

  • A Dry White Season by Andr&eacute; Brink- A white schoolteacher in South Africa learns a bit about race, politics, discrimination, abuse of power, and privilege. Intense and difficult to read, especially in our current political climate.
  • Coyotes: A Journey Through the Secret World of America’s Illegal Aliens by Ted Conover- Gringo journalist decides to cross the border with Mexicans who migrate North for work each summer. Spends time on the journeys, spends time at the harvesting work, spends time driving [with, and also..] them across the country, spends time in their homes.
  • Mister Pip by Lloyd Jones- A story told by a black girl living on a tropical island in the middle of a civil war, where the one white man left volunteers as a teacher, except the only book he has to teach from is Great Expectations. I worried it would be all full of white supremacy bullshit, but ultimately I actually thought it was generally aware, sensitive, and interesting. Heart-breaking in moments. Really an excellent little book, overall.
  • Limbo: Blue-Collar Roots, White-Collar Dreams by Alfred Lubrano- Fascinating, useful book about class issues with people born to working class families who push themselves into the middle class. This book sparked a lot of ideas in me and moments of recognition when thinking about my family dynamics and history and issues I’ve had with others. Highly recommended.
  • The Immortal Life of Henrietta Lacks by Rebecca Skloot – The story of the person and family behind the HeLa cell line! I’d also suggest reading this interview with Sklootwhere she explains her thoughts on structure in writing and how she chose to structure this book in particular.
  • Gang Leader for a Day by Sudhir Venkatesh – Sociologist does fieldwork by hanging out with a gang in the projects. The Freakonomics people love this guy. I can see why.

Books I loved reading in 2011 that related to art and design:

  • The New Drawing on the Right Side of the Brain by Betty EdwardsI went through all the exercises when I started my latest drawing kick, and I found them extremely helpful.
  • Artist’s Complete Problem Solver by Trudy Friend- This is basically one of the best drawing and painting books I’ve found yet. It’s particularly good in terms of very specific techniques and concepts to keep in mind when trying to figure out what to focus on in observing and drawing. Also, micro brushstroke exercises.
  • The Forgery of Venus by Michael Gruber- Drugs, hallucinations, art history, forgery, wonderful descriptions of technical processes of both forgery and painting generally. Absolutely lovely. Rather dark and fucked up in places, but in a beautiful way.
  • Making Comics by Scott McCloud- Someday I’ll illustrate a webcomic. Someday.
  • My Name is Red by Orhan Pamuk- A murder mystery and an exploration on artistic pride, cultural influences, morality, religion, and the meaning of style. Very nice.
  • The Non-Designer’s Design Book by Robin Williams- Reminded me of playing Set and of watching the parents figure out the layout for the old Fiske Terrace newsletter when I was a kid. Dead simple, basic stuff, but good important concepts to keep in mind.
  • Poemcrazy: freeing your life with words by Susan Wooldridge – If you love words, read this.

Books I loved reading in 2011 that related to Judaism:

  • Boychiks in the Hood: Travels in the Hasidic Underground by Robert Eisenberg- Deeply familiar and foreign at the same time. My people, and not my people. Which was perhaps the point. In point of reference, I was raised Conservative and have always identified as a Jewish agnostic.
  • All Other Nights by Dara Horn- Jews in the Civil War! Lying liars who lie! I adore Dara Horn.
  • The Book of the Unknown: Tales of the Thirty-Six by Jonathan Keats- This was really charming! It tells the stories of some of the lamed vavniks (in Jewish lore, these are the 36 just men and women who hold up the world, without realizing it or being acknowledged by others). Here are some of the lamed vavniks of a past era (there must be 36 of them alive in the world at any given time), and they are whores and thieves and golems and all sorts of unlikely personages. It was a good premise, nicely executed, and I particularly loved the rare pleasure of reading Jewish fairy tales that aren’t all about getting the king to promise not to kill all the Jews (only ha ha just kidding).
  • The Man in the White Sharkskin Suit by Lucette Lagnado- Memoir of a Jewish Egyptian woman whose family fled Cairo when she was just a child. Absolutely wonderful, with a personal discussion of cultural stresses and family relationships that did feel real to me. Possibly only interesting if you’re a diaspora Jew whose family had to flee countries in grave danger, though. Hard to tell, since for me, this was my family’s life.
  • City of Oranges by Adam LeBor- Fascinating history of Jews and Arabs in the old city of Yaffa. I’ve been there, and it was interesting to read this with my memories of walking through the area.
  • Lost Tribe: Jewish Fiction from the Edge edited by Paul Zakrzewski – I didn’t expect much of this, but in fact there were a lot of really spectacularly good stories in it! Overlook the kitsch of the title and concept, and you’ll find the good stuff. My oh my.

Other fiction I loved reading in 2011:

  • Santa Olivia and Saints Astray by Jacqueline Carey- Super cute YA with good queer character development and political exploration.
  • Slammerkin by Emma Donoghue- I really enjoyed this novel! It has a lot of human flaws and weakness, and shows not only the way fucked up systems fuck people up, but also the incredibly awfulness that people are capable of. The protagonist is not a good person. You can’t quite figure out if you like her, even though you see the factors that went into messing her up and it’s hard to blame her for the first few. But later on, she’s making choices that you want to hate her for, again and again. But still in a sympathetic way. Really, just about all the characters are powerless in so many ways, and they take out the pain of their powerlessness on each other. The writing style just faded into the background and let me sink into the story, which I love.
  • Sea of Poppies by Amitav Ghosh – Just about the best fiction I’ve read in ages. Indian, complex, epic, poignant, fascinating. Lots of characters, but all are fleshed out and developed and weave in and out of each other’s lives. The focus on language and dialect is brilliant – it’s confusing at times, like the first time you read A Clockwork Orange, but you get the sense that so many of the characters are lost and confused that you’re supposed to be right there with them, and their languages are so defined by their backgrounds/lives/castes that it all comes together in a jumble as their society crackles around them. It’s killing me that the second book of this trilogy isn’t out in paperback yet, and the third isn’t out yet at all. I already want to reread the first.
  • A big bunch of Alastair Reynolds books, which are all thoroughly stuffed with interesting ideas and characters but are ultimately a bit tough to tell apart. He manages to write the same book over and over again without getting boring, which is a neat parlor trick in itself.

Other non-fiction I loved reading in 2011:

  • The Spirit Catches You and You Fall Down: A Hmong Child, Her American Doctors, and the Collision of Two Cultures by Anne Fadiman- An exploration of how cultural differences between Hmong and Americans interfere with access to health care, among other things.
  • Lust for Justice: The Radical Life & Law of J. Tony Serra by Paulette Frankl- Inspiring biography of a radical hippie brilliant criminal defense lawyer, I really want to read a collection of his summations!
  • The Devil in the White City: Murder, Magic, and Madness at the Fair that Changed America by Erik Larson- A history text written in an almost novel-ish prose style, about the 1893 Chicago World’s Fair and the serial killer operating in Chicago at the time.
  • China Witness: Voices From a Silent Generation by Xinran – I’ve been on a Chinese history reading kick. It’s just so huge, and there’s so much out there that I don’t have a clue about. Likened to Studs Terkel’s interview collections, this book is actually a fascinating set of conversations that offer insight into history and culture that I can’t seem to reach anywhere else.

A few sketches on the run

Man playing harmonica on the 2 train:

Another man on the train:

A lady on the phone at Starbucks:

Dave, asleep on the bus, hiding from the light in his hoodie: