About Projects Blog Recipes Email Press Feed

30 Mar 2012
Kenobi: a naive Bayesian classifier for Ask Metafilter

I built Kenobi as a way to get my feet wet with machine learning.

A fellow Recurser had mentioned the idea of a naive Bayesian classifier to me, and my ears perked up - Bayes! Hey, I know Bayes' Theorem! It's a generally useful simple equation that helps you figure out how much or how little to take new evidence into account when updating your sense of the probability of something or other.

The basic idea is:

Wait, no, the basic idea is that evidence doesn't exist in a vacuum. Bayes' Theorem is a way of quantifying how to look at new evidence in the context of what we know already and understand how we should weigh it when taking it into account, and how to determine more accurate probabilities and beliefs given the evidence we have to work from. If you're looking for a more detailed understanding, I highly recommend reading Yudkowsky's particularly clear explanation of Bayes' Theorem.

(I went to a rationalist party once. Some guy asked me, "Are you a rationalist?" The friend who'd dragged me to the party interrupted with, "Well, she's not not a rationalist!" And there you have it, I suppose.)

So, that seemed like fun. I'd just finished working on a card game implementation that can run simulations of strategies to help my partner with his game design (Greenland), and was ready for a new project. But a spam filter seemed dull - it's been done before. Repeatedly. So, what to do?

I'm a huge fan of Ask Metafilter, a community where folks ask questions (shocking, no?) and answer questions asked by others. My fabulous brother got there first, and I appreciate that he dragged me in with him. It can be a bit overwhelming, though. I don't really have the time to skim through all the questions that get posted, especially since so many of them are about things where I have no useful information or advice to give. It sure would be helpful if something pared the list down to only the questions where my answers would be most likely to actually help others, right? Right!

Kenobi was a perfect combo project for me. I got to explore machine learning, use some of the skills I picked up at the awesome ScraperWiki class on web scraping I took a while back, and create a tool I'd actually use to improve my ability to help others. Right on.

So, what does Kenobi actually do?

Kenobi has two basic functions: analyzing answers you've already posted to old AskMeFi questions, and classifying new questions for you to pick out the ones you can answer best.

To analyze your old AskMeFi data, Kenobi:

  1. deletes outdated training data for you from its own database, if any;
  2. logs into Metafilter under a spare account I created for this purpose, because one can't see favorite counts in user profile comment data unless one is logged in;
  3. searches to find the user ID number associated with your username;
  4. scrapes the answers section of your profile for the above-the-cut text of each question you've answered, and whether or not you've received at least one favorite on the answer(s) you posted to that question;
  5. separates the old questions into a "should answer" group (those where your answer(s) did get at least one favorite) and a "should NOT answer" group (those where your answer(s) didn't get any love);
  6. organizes and saves the data from each group ("should answer" and "should NOT answer") to refer back to when classifying new questions;
  7. compresses the data to save space in the database; and
  8. emails you to let you know that training is done, if you submitted an email address (highly recommended).

To classify new AskMeFi questions for you, Kenobi:

  1. clears out your last batch of results, if any;
  2. parses the Ask Metafilter RSS feed for above-the-cut question text and URLs for the n most recent questions;
  3. decompresses the data it has on you into memory;
  4. for each question, determines the probability that you should answer it and the probability that you should NOT answer it, based on Bayes' Theorem and your old answer data;
  5. for each question, if the odds that you should answer it is at least 4.5 times higher than the odds that you should NOT answer it, classifies that question as good for you to answer;
  6. saves and displays only and all the new question that are classified as good for you to answer.

Why do the odds that I should answer a question have to be 4.5 times higher than the odds that I should NOT answer that question, for Kenobi to classify it as good for me?

Because when I left the threshold lower, people were getting too many questions that didn't seem like good fits to them. With a higher threshold, some folks may not get any results at all (sorry!), but people who've answered enough past AskMeFi questions to give good data to work from will get much more accurate results.

The closer the two probabilities are, the less confident we can be that we've really found a good match and that the question really is a good one for you to answer. It only makes sense to select a question for you when the odds that it's the kind of question you're good at answering are significantly higher than the odds that it isn't.

Why all that compressing and decompressing?

I wrote Kenobi up as a pure Ruby command line tool first, then decided it would be fun to quickly Rails-ize it so more people would be able to play with it more easily. That meant finding a place to deploy it, as easily and cheaply as possible.

Heroku (my host) charges for databases over 5mb. I love you all, but not enough to spend money on you if I don't have to. I'm trying to be as efficient as possible here, in hopes of not going over and having to actually spend money on this project if I can possibly avoid it.

Why the wait while Kenobi analyzes my old data?

A few reasons!

First, one can't actually effectively search Metafilter for a user by name or see favorite counts on the list of a user's past answers in their profile unless one is logged into Metafilter. Metafilter doesn't even have an API to work with. It does have info dumps, but they're huge and not updated regularly.

This means that Kenobi has to arduously scrape and parse the html for Metafilter whenever it analyzes old data for a new user. And it has to actually log into the site and click through as a logged-in user to do so, which it does using a gem called Mechanize.

I set the scraping up as a background task with Delayed_Job and set Kenobi up to email people when ready, so no one had to sit around staring at an error message or colorful spinner while waiting for their analysis to come up in the job queue and get done. This meant that there were no more http timeout error, but it also means that your analysis job goes to the end of the queue, however long it may be.

Also, Heroku charges for worker dynos, which are needed to actually run the background processes piling up in that job queue. They charge by the second. (Seriously). But that includes all the time the worker spends sitting around waiting for a job to exist, not just the time it spends actually working on jobs.

This was just a learning project, not something I actually expect to earn anything from or want to pay anything for. So, I spent a bunch of time messing around with a nifty tool called Workless and learning how to have Heroku automatically scale a single worker dyno up and down as jobs are added and completed, so I can pay for as little time as possible.

This slows things down for you even more, because not only are you waiting for the scraping to get done, you're actually waiting for Heroku to start up a new worker dyno to start working on the scraping before it can get done.

Sorry about that! If you care a lot for some reason, email me and we can commiserate or something.

Wait, so Kenobi picks out questions where my answers will help others, not questions that help me directly?

That's right! Kenobi's selections are based on each new question's similarity to the past questions to which your answers have been favorited by others, and dissimilarity to the past questions where your answers got no love. It doesn't pay attention to what you've favorited - only to which of your answers have been favorited by other people. It doesn't really care about your interests at all, other than your interest in being popular of use to others.

Have fun!

13 Feb 2012
Considerating

Considerate or creepy? It can be hard to tell, sometimes! My brother Josh asked me to be his "empathy sherpa" and help him navigate that blurry grey line between sweet and skeezy (or maybe that's a different project?), so I built Considerating for him while warming up for Recurse Center.

Considerating is a simple concept. Each consideration comes with a slider (or dropdown, on mobile browsers) that lets you vote - where on the range between considerate and creepy does this idea fall? You can sign in with Google oauth to submit new considerations of your own. After each vote, the graph is recalculated and redrawn to accurately reflect the updated results.

I did all the coding, and Josh and I collaborated on the design and UI. It was really fun to finally work on a project like this with him! I mean, this is my little brother, the kid who once tried to evict me from my bedroom back when we were young by taping a sign to my door while I was out signed by the "MGMT" - and the person who can most consistently answer correctly when I call him out of the blue to ask, "Hey, what's that word I'm forgetting?"

My favorite part of this project was the little bit of javascript that makes those whoopety whoopety graphs work out. The code is all here, but this is the particularly fun bit:

function draw(points) {
  var canvas = document.getElementById('graph<%= @consideration.id %>');  
  var highest = Math.max.apply(Math, points);
  if (canvas.getContext){  
    var ctx = canvas.getContext('2d');  
    ctx.strokeStyle = "#000000"; 
    ctx.lineJoin = "round";
    ctx.lineWidth = 2;
    ctx.beginPath();
    ctx.moveTo(0,115);
    ctx.bezierCurveTo(20, 115, 
        20, 115-(points[0]/highest)*100, 
        40, 115-(points[0]/highest)*100);
    for (i=1; i<10; i++) {
      ctx.bezierCurveTo(20+(i*40), 115-(points[i-1]/highest)*100, 
          20+(i*40), 115-(points[i]/highest)*100, 
          40+(i*40), 115-(points[i]/highest)*100);
    }
    ctx.bezierCurveTo(420, 115-(points[9]/highest)*100, 
        420, 115, 
        440, 115);
    ctx.shadowColor="black";
    ctx.shadowBlur=1;
    ctx.stroke();
  } 
}

Imagine the graph as a set of 10 bars. The drawing function takes an array of 10 values, and creates those smooth curves between points set at the intersection of each bar (x-axis) and the number of votes that value has received, scaled appropriately (y-axis).

I'd played with Adobe Illustrator before, and had some sense of how bezier curves work. But it was a lot of fun to have to think through the math that would get me what I wanted from scratch, without being able to rely on click-and-drag visuals. I have a much more solid understanding of what bezier curve control points actually mean now, which I'm really happy about.

I still don't know if I can get away with chasing after strangers' toddlers, though. C'mon, internet, help me figure it out!

15 Jan 2012
The best books I read in 2011

2011 was an amazing year in books for me. I hit a new record with my annual book list (I read 171 books last year!), and a much higher percentage of them than usual were awesome. In fact, I read so many great books last year that I split them up into categories for you here.

Books I loved reading in 2011 that related to decision-making and problem-solving:

  • Collapse: How Societies Choose to Fail or Succeed by Jared Diamond- Well, we’re fucked. Brilliant, brilliant book. Worth reading, but intensely depressing in a clear, logical sort of way. Diamond claims to be cautiously optimistic, and I’d like to believe him, but I’ve read too much Derrick Jensen to be totally convinced.
  • The Logic of Failure: Recognizing and Avoiding Error in Complex Situations by Dietrich Dorner- Analyzes tons of studies on how people try - and fail! - to handle complex situations. The main lessons I drew from this were - (a) You have to think about not only the problems you do have, but also the problems you DON’T have, because otherwise your solutions may well create new problems in the future; (b) feedback has a time lag, and unless you stick to tiny adjustments with delays to record feedback in between, you can easily end up ricocheting between extremes; (c) it’s hard to figure out the right amount of information to gather; (d) both overfocusing and ignoring complex details are extremely dangerous; and (e) learning the issues with dealing with complex problems doesn’t actually help you get better at handling them - only actual experience does that.
  • Complications: A Surgeon’s Notes on an Imperfect Science by Atul Gawande - Absolutely fantastic look at mistakes, the need for training and learning, social and ethical issues that interfere with training, cognitive errors, the difficulty in balancing our need for best health outcomes with our need for training students, and lots of surgical war stories along the way. I also highly recommend reading his essay Personal Best, where he discusses his decision to seek coaching to improve his surgical skills.
  • Honeybee Democracy by Professor Thomas D. Seeley- This was the most incredible book I’ve read in a long time. He describes the studies he performed in trying to determine how honeybee swarms decide on where their next home should be, and get everyone there together. Lots of insight into bees, but also into decision-making process design. Even if you’re not obsessed with bees as I am, this is a book well worth reading - it’s an eloquent depiction of how science is done, plus Seeley is very into the idea that we should learn more on how to manage group decision-making from the example set out by the bees.
  • Nudge: Improving Decisions About Health, Wealth, and Happiness by Richard H. Thaler and Cass R. Sunstein - Totally fascinating book about how to be a better choice architect, largely by adjusting incentives and defaults and making it just a bit easier for people to do what they in theory want to do anyway. Libertarian paternalism.


Books I loved reading in 2011 that related to race and class:

  • A Dry White Season by Andre Brink- A white schoolteacher in South Africa learns a bit about race, politics, discrimination, abuse of power, and privilege. Intense and difficult to read, especially in our current political climate.
  • Coyotes: A Journey Through the Secret World of America’s Illegal Aliens by Ted Conover- Gringo journalist decides to cross the border with Mexicans who migrate North for work each summer. Spends time on the journeys, spends time at the harvesting work, spends time driving [with, and also..] them across the country, spends time in their homes.
  • Mister Pip by Lloyd Jones- A story told by a black girl living on a tropical island in the middle of a civil war, where the one white man left volunteers as a teacher, except the only book he has to teach from is Great Expectations. I worried it would be all full of white supremacy bullshit, but ultimately I actually thought it was generally aware, sensitive, and interesting. Heart-breaking in moments. Really an excellent little book, overall.
  • Limbo: Blue-Collar Roots, White-Collar Dreams by Alfred Lubrano- Fascinating, useful book about class issues with people born to working class families who push themselves into the middle class. This book sparked a lot of ideas in me and moments of recognition when thinking about my family dynamics and history and issues I’ve had with others. Highly recommended.
  • The Immortal Life of Henrietta Lacks by Rebecca Skloot - The story of the person and family behind the HeLa cell line! I’d also suggest reading this interview with Skloot where she explains her thoughts on structure in writing and how she chose to structure this book in particular.
  • Gang Leader for a Day by Sudhir Venkatesh - Sociologist does fieldwork by hanging out with a gang in the projects. The Freakonomics people love this guy. I can see why.


Books I loved reading in 2011 that related to art and design:

  • The New Drawing on the Right Side of the Brain by Betty Edwards - I went through all the exercises when I started my latest drawing kick, and I found them extremely helpful.
  • Artist’s Complete Problem Solver by Trudy Friend- This is basically one of the best drawing and painting books I’ve found yet. It’s particularly good in terms of very specific techniques and concepts to keep in mind when trying to figure out what to focus on in observing and drawing. Also, micro brushstroke exercises.
  • The Forgery of Venus by Michael Gruber- Drugs, hallucinations, art history, forgery, wonderful descriptions of technical processes of both forgery and painting generally. Absolutely lovely. Rather dark and fucked up in places, but in a beautiful way.
  • Making Comics by Scott McCloud- Someday I’ll illustrate a webcomic. Someday.
  • My Name is Red by Orhan Pamuk- A murder mystery and an exploration on artistic pride, cultural influences, morality, religion, and the meaning of style. Very nice.
  • The Non-Designer’s Design Book by Robin Williams- Reminded me of playing Set and of watching the parents figure out the layout for the old Fiske Terrace newsletter when I was a kid. Dead simple, basic stuff, but good important concepts to keep in mind.
  • Poemcrazy: freeing your life with words by Susan Wooldridge - If you love words, read this.


Books I loved reading in 2011 that related to Judaism:

  • Boychiks in the Hood: Travels in the Hasidic Underground by Robert Eisenberg- Deeply familiar and foreign at the same time. My people, and not my people. Which was perhaps the point. In point of reference, I was raised Conservative and have always identified as a Jewish agnostic.
  • All Other Nights by Dara Horn- Jews in the Civil War! Lying liars who lie! I adore Dara Horn.
  • The Book of the Unknown: Tales of the Thirty-Six by Jonathan Keats- This was really charming! It tells the stories of some of the lamed vavniks (in Jewish lore, these are the 36 just men and women who hold up the world, without realizing it or being acknowledged by others). Here are some of the lamed vavniks of a past era (there must be 36 of them alive in the world at any given time), and they are whores and thieves and golems and all sorts of unlikely personages. It was a good premise, nicely executed, and I particularly loved the rare pleasure of reading Jewish fairy tales that aren’t all about getting the king to promise not to kill all the Jews (only ha ha just kidding).
  • The Man in the White Sharkskin Suit by Lucette Lagnado- Memoir of a Jewish Egyptian woman whose family fled Cairo when she was just a child. Absolutely wonderful, with a personal discussion of cultural stresses and family relationships that did feel real to me. Possibly only interesting if you’re a diaspora Jew whose family had to flee countries in grave danger, though. Hard to tell, since for me, this was my family’s life.
  • City of Oranges by Adam LeBor- Fascinating history of Jews and Arabs in the old city of Yaffa. I’ve been there, and it was interesting to read this with my memories of walking through the area.
  • Lost Tribe: Jewish Fiction from the Edge edited by Paul Zakrzewski - I didn’t expect much of this, but in fact there were a lot of really spectacularly good stories in it! Overlook the kitsch of the title and concept, and you’ll find the good stuff. My oh my.


Other fiction I loved reading in 2011:

  • Santa Olivia and Saints Astray by Jacqueline Carey- Super cute YA with good queer character development and political exploration.
  • Slammerkin by Emma Donoghue- I really enjoyed this novel! It has a lot of human flaws and weakness, and shows not only the way fucked up systems fuck people up, but also the incredibly awfulness that people are capable of. The protagonist is not a good person. You can’t quite figure out if you like her, even though you see the factors that went into messing her up and it’s hard to blame her for the first few. But later on, she’s making choices that you want to hate her for, again and again. But still in a sympathetic way. Really, just about all the characters are powerless in so many ways, and they take out the pain of their powerlessness on each other. The writing style just faded into the background and let me sink into the story, which I love.
  • Sea of Poppies by Amitav Ghosh - Just about the best fiction I’ve read in ages. Indian, complex, epic, poignant, fascinating. Lots of characters, but all are fleshed out and developed and weave in and out of each other’s lives. The focus on language and dialect is brilliant - it’s confusing at times, like the first time you read A Clockwork Orange, but you get the sense that so many of the characters are lost and confused that you’re supposed to be right there with them, and their languages are so defined by their backgrounds/lives/castes that it all comes together in a jumble as their society crackles around them. It’s killing me that the second book of this trilogy isn’t out in paperback yet, and the third isn’t out yet at all. I already want to reread the first.
  • A big bunch of Alastair Reynolds books, which are all thoroughly stuffed with interesting ideas and characters but are ultimately a bit tough to tell apart. He manages to write the same book over and over again without getting boring, which is a neat parlor trick in itself.


Other non-fiction I loved reading in 2011:

  • The Spirit Catches You and You Fall Down: A Hmong Child, Her American Doctors, and the Collision of Two Cultures by Anne Fadiman- An exploration of how cultural differences between Hmong and Americans interfere with access to health care, among other things.
  • Lust for Justice: The Radical Life & Law of J. Tony Serra by Paulette Frankl- Inspiring biography of a radical hippie brilliant criminal defense lawyer, I really want to read a collection of his summations!
  • The Devil in the White City: Murder, Magic, and Madness at the Fair that Changed America by Erik Larson- A history text written in an almost novel-ish prose style, about the 1893 Chicago World’s Fair and the serial killer operating in Chicago at the time.
  • China Witness: Voices From a Silent Generation by Xinran - I’ve been on a Chinese history reading kick. It’s just so huge, and there’s so much out there that I don’t have a clue about. Likened to Studs Terkel’s interview collections, this book is actually a fascinating set of conversations that offer insight into history and culture that I can’t seem to reach anywhere else.

17 Dec 2011
A few sketches on the run

Man playing harmonica on the 2 train:

Another man on the train:

A lady on the phone at Starbucks:

Dave, asleep on the bus, hiding from the light in his hoodie:

19 Nov 2011
Jailbreak the Patriarchy: GitHub, Press, & Favorite Examples

You ask for it, so you got it - I put the source code for Jailbreak the Patriarchy up on GitHub. Feel free to check it out, contribute, or use it to make your own extensions. Have fun!

I'm staggered and delighted by the responses to my little extension. I'm still not bored of watching people tweet their reactions or examples of swaps they've come across and liked best! Here's a bit of a roundup of the press it's received:

• I was interviewed on APM’s Tech Marketplace. Thank you, public radio! I think you just made my month. My segment starts at 2:35 in that recording.

The New Yorker mentioned me! Of course, they rather hilariously proved my point by writing: “Jezebel tests the new Google Chrome extension Jailbreak the Patriarchy, which feminizes nouns and pronouns on any Web site.” Only in a world in which the default is masculine can gender-swapping be described as “feminizing”.

Jezebel: Fun Chrome Extension Gender-Swaps The Internet

Flavorwire: Re-gender Your Webpages with the New “Jailbreak the Patriarchy” Chrome Extension

The Toronto Star: Woman! I feel like a man: Swap gender of words from your browser

The Mary Sue: You Should Really Check Out Google Chrome’s Genderswap Plugin

Cyborgology has perhaps the most thoughtful response I’ve read yet.

Gina Carey: Swapping Gender in Books - “Humbert Humbert is a middle-aged, fastidious college professor. She also likes little boys.”

Prosumer Report: Genderswap Your View of the World

• Maria Popova (brainpicker) declared Jailbreak the Patriarchy to be the “best Chrome extension ever - wow!

• And so many more! I’m pretty thrilled to have excited Kelley Eskridge, who’s pretty exciting herself. Morning Quickie suggested using Jailbreak in sociology or history classes (and women’s studies, of course). I made it onto Metafilter and Reddit. I love Ellen Chisa’s response. I probably shouldn’t admit to this, but I actually found some thoughtful discussion on gendered language over on the Sensible Erection forum discussion (includes NSFW images) of Jailbreak the Patriarchy.

• Not to mention all the other fabulous folks on Twitter who said wonderful things and quoted some great swaps they were finding: Jonathan Haynes and Oliver Burkeman of the Guardian, Zach Seward of the Wall Street Journal, Julian Sanchez of Reason Magazine, Charlie Glickman of Good Vibrations, Elizabeth Bear, Bloomsbury Press, TrustWomen, ResearchGate, Disinfo, and more. Even GRAMMARHULK seemed excited!

Thank you, everyone! I love seeing all your examples and hearing your responses. I've had an amazing week, seeing everyone react to this thing I built. What a trip!

If you haven't checked it out yet, I suggest you go install Jailbreak the Patriarchy and then read pages such as Schrodinger's Rapist or the art of being an ambitious female. Or perhaps the "relationships" tag on Ask Metafilter. Check out the news on the latest sex/harassment/abuse scandal, the latest corporate scandal, the latest big thing in business or politics. And for best effect, leave it installed for a few days, let yourself forget that it's there, and see what jumps out and surprises you.

Also, ports and spin-offs created by other coders: </p> • Nicholas FitzRoy-Dale ported Jailbreak to work for Safarisinxpi ported Jailbreak to a Greasemonkey script for FirefoxMarianna Kreidler released a gender-neutral version of Jailbreak