About Projects Blog Recipes Email GitHub Twitter Press Feed

02 May 2014
Finding words that sound alike but are spelled wildly differently

I’ve been working on search stuff lately, and we needed some wordlists to help test search results that match only because they sound similar to the query, and not because they’re spelled similarly.

Turns out we couldn’t find a pre-existing wordlist of homophones (words that sound the same but are spelled differently) that are dramatically different in spelling. And our QA team especially wanted some examples of people’s names that meet those criteria.

So, sure, I figured that’d be fun and quick to throw together for them!

It’s a lot like finding anagrams - the basic structure was a dict (a hash map, for the non-Python folks reading this) keyed by the phonetic encoding of each word. Each key pointed to a nested dict, which included an array of words which phonetically matched the key and a bool indicating whether it fit my criteria or not. In the end, all matching words were spit into stdout as a list of comma-separated homophones.

I determined whether words were spelled differently enough by checking whether a small enough percentage of their trigrams were the same. (I also had a minumum length set, so I’d be sure to have enough trigrams per word to be worth checking for match percentage.

(It was kinda neat to find something that felt more like an interview puzzle than anything else, but was actually useful for my day job. Oh hey, look, those skills are occasionally actually useful! Now you don’t have to feel weird about all the time you spent learning how to solve these sorts of puzzles!)

Sweet and simple and fun! Here’s my script and a few of the wordlists I created with it, since I figure other people may also find this sort of thing useful when testing search implementations. (FYI, if you’re using something other than a metaphone/doublemetaphone soundalike algorithm and trigrams for misspellings, you may want to make some adjustments.)

24 Apr 2014
My new favorite vim/tmux bug

This week, I'm grateful that my coworkers know to come grab me if something seriously weird is going on, because it fills me with so much glee! I mean, WHAT.

minimal repro:
On Suffolk (one of our machines), open tmux, open vim, open new terminal tab.
Vim gets “lililililililill” inserted in current file, and beeps a lot
If the file already has content, it prepends i and appends ll to ~10 lines, and sometimes capitalizes something


I'm going to skim over some of the details so that this remains a blog post and not an endless excited ramble, but! This is approximately how figuring this nonsense out went!

Initial poking around

When does the problem happen? When you open a new bash tab or window, or enter any command in any bash session.

"lililililililill" looks very suspicious. Is that a macro or something hiding in one of the vim registers? Use :reg to check the contents of the vim registers - nope, nothing fishy in there!

Is there anything funky in our tmux config? ~/.tmux.conf doesn't exist, and a quick googling around didn't turn up anything on any other standard sorts of tmux config files. Fair enough, put that aside for the moment.

Poked around a bit more to define the edges of the problem:

  • Happens both in iterm and terminal
  • Happens after restarting vim, restarting tmux, reinstalling tmux, reimaging Suffolk
  • Does not happen in vim/macvim outside of tmux.
  • Does not happen in any other of the handful of machines that were checked.
  • Does not happen in vim in tmux when ssh'd into another machine. </ul>

    (One exception to that last one - a coworker said he was able to replicate it when ssh'd into a remote coworker's machine. But when we tried to replicate that, it didn't happen. An isolated datum, potentially relevant, but highly suspect. To this day, I'm pretty convinced that folks got mixed up and it never really happened in the first place - happens to the best of us, and it doesn't fit with any of the other evidence.)

    Cool, we got the lay of the land. So! What changed recently?

    Ah, this machine was newly reimaged. Maybe we have new broken or incompatible versions of some things?

    We were told that the tmux version should be frozen as part of our install script. Do you believe everything you're told? No? Good! You guessed it, we had totally different versions of vim, tmux, bash, and OS X on this machine than on other machines which do not exhibit the same problem.

    Around this time I started up a Google doc to keep track of everything we were trying, because once things start to look complicated I know I won't be able to remember everything I've tried. Especially when multiple people are involved! And it's a horrible waste of time to repeat experiments out of forgetfulness, or even worse, lose potentially relevant data. I won't bore you with a full list of versions and reinstallation steps, but boy do I have all the details in my notes.

    Point being, we downgraded bash, tmux, and vim to match the versions working on other machines, but the problem remained.

    At this point, I was sadly told that the machine was just going to get a bunch of stuff reinstalled and I shouldn't spend any more time poking at it. Sadness! But okay, fair enough, it was getting in people's way and the show must go on.

    But wait! Things don't magically solve themselves after all!

    Imagine my delight when I came in the next morning and heard that the reinstalling stuff hadn't fixed the problem! I'd been super bummed the day before to have my mystery stolen away from me, so this was very exciting! I hung out with a coworker for a bit to give him pointers on how to look into Elasticsearch bugs, then ran off to the biggest mystery of the week.

    Aha! We noticed that we have a tmux-related vim plugin in our vim config - tmux-config. Bonus points for anyone who feels like stopping here to look at that and guess how this story ends. ^___^

    I didn't have much time to play with it in the moment, but the very best thing happened - we were able to replicate it on any machine recently reimaged with our new workstation setup script! This meant I was able to get the bug onto my laptop! AW YEAH.

    Wheeeeee I got to take my bug home with me to play with!

    I sat down to take a closer look at that tmux-config plugin.

    ~/.vim/bundle/tmux-config/tmux-autowrite/autowrite-vim.sh creates a preexec function that’s called whenever you start up a new bash session, or enter any command in bash.

    Line 31 reads:
      tmux send-keys -t $pane ^\\\ ^n F19 WriteAll

    Commenting out that line causes the problem to go away. Running tmux send-keys -t %0 ^\\ ^n F19 WriteAll manually in another bash window causes the bug to manifest regardless. Perfect! What is this thing trying to do, and what is it actually doing?

    Ah! In that same plugin, ~/.vim/bundle/tmux-config/plugin/autowrite.vim:45 defines this relevant mapping:

      map <silent> <F19>WriteAll :silent! wall<CR>

    My notes from the moment my jaw dropped after one look at that:

    I’m not sure what ^\\ and ^n are supposed to do - they don’t seem to be doing anything.

    The rest is a mapping set in autowrite.vim:45 to save vim buffers when you do other stuff in the terminal, basically trying to mimic the way we have macvim set up to save on blur.

    I’m not sure why yet, but F19 is what toggles capitalization and makes the damn beep. It only does the capitalization in vim inside tmux, not in vim outside tmux.

    And then 'WriteAll' is interpreted as a normal vim command -

    • ri replace the character under the cursor with an i
    • e takes you to the end of a word
    • A takes you to the end of the line and puts you into insert mode
    • then ll is inserted at the end of the line </ul>

      Not sure why the preexec function gets run multiple times with each bash session/command, but that's what must be happening!


      WOW. BUT WHY?!???

      This is around the time I settled in to perform a series of experiments.

      What happens if I send F19 alone with tmux send-keys?


        tmux send-keys -t %0 F19


      BEEP BEEP BEEP &c, and if the file open in vim has any contents, the next <= 3 characters get their case toggled.

      Huh, case gets toggled. Not just capitalized. Toggled. That's interesting.

      What happens if I hit F19 in vim outside of tmux?

      Aw, hell, my MacBook doesn't even have an F19 key! Yeargh. Fine, whatever, I went and installed KeyRemap4MacBook so I could remap fn-fn to F19 to test stuff with.

      Result: No beeping or case toggling.

      Why does F19 cause beeping/case toggling in vim inside tmux but not in vim outside tmux?

      Am I super confident that my mapping worked properly? I mean, I tested it with EventViewer, but how realistic is that? Does tmux send-keys somehow send something different than what my mapping thinks I'm sending now?

      How else can I test that F19 is what it claims to be?

      I did some googling around, and learned that you can actually check how keystrokes are encoded in bash by opening up your terminal, hitting control-v, then hitting a key.

      Whoa, neat, that seems useful! I checked encodings to see if I could find a difference, and oho, that jumped out at me!

      • Inside tmux, F19 is encoded as ^[[33~
      • (in our bash outside tmux, it’s ^[[18;2~ instead, dunno why) </ul>

        HOLD ON. Look at that more closely: inside tmux, F19’s encoding ends in '3~', which is exactly the command in vim that you’d expect to toggle case for 3 characters - COINCIDENCE? I THINK NOT.</strong>

        Wait a second. 3~ looks super familiar for another reason! Oh, right, I'd noticed earlier that ~/.vim/bundle/tmux-config/plugin/autowrite.vim:35 set up some function keys like so:

          if &term == "screen-256color"
            set t_F3=^[[25~
            set t_F4=^[[26~
            set t_F5=^[[28~
            set t_F6=^[[29~
            set t_F7=^[[31~
            set t_F8=^[[32~
            set t_F9=^[[33~

        My eyes had skimmed over that bit earlier, because it looked like it only went up to F9. I didn't bother to verify that assumption, just moved right past it. DAMNIT. Time to search the vim docs!

        OH HEY t_F9 refers to F19 in vim ARGH ARGH ARGH HOW DID I MISS THAT.

        So, that's inside a conditional. It doesn't always happen. What's &term? Well, it's whatever $TERM is in bash. Okay, let's verify that!

        Inside tmux, $TERM was set to 'screen'. Outside tmux, $TERM was set to 'xterm-256color'.

        256color... oh hell.

        Some googling around turned up this useful answer on setting up tmux to handle xterm-style function key inputs. Setting that option did in fact make the bug go away! But that option wasn't set on any of our other computers, and we have all these other things hinting at a different solution.

        Time to search that tmux-config plugin for screen-256color to see where it comes up OH GODDAMNIT ~/.vim/bundle/tmux-config/plugin/tmux.conf:9 sets:

          set-option -g default-terminal "screen-256color"

        With that option set, the conditional in autowrite.vim is satisfied, and (when vim is restarted after that option being set and all the vim plugins are sourced) t_F9 (which is secretly F19) is mapped to [33~.

        OH. OH OH OH.

        To sum up

        (1) tmux wasn’t set to handle xterm-style function key inputs, because our tmux.conf wasn’t actually being copied

        • from: ~/.vim/bundle/tmux-config/tmux-autowrite/tmux.conf
        • to: ~/.tmux.conf </ul>

          (2) THEREFORE, tmux hadn’t received this config from our tmux.conf:

              set-option -g default-terminal "screen-256color"

          (3) SO, $TERM inside tmux was “screen” and outside tmux was “xterm-256color”

          (4) This means tmux wasn’t set to handle xterm-style function keys (such as F19). This isn’t super-clear, to be fair. The clear way to set tmux to receive xterm function keys properly would be with “setw -g xterm-keys on”

          (5) Vim checks $TERM to see if function keys are available. See the tmux FAQ. If they’re not, the character codes sent by the function keys are interpreted literally.

          (6) We actually have vim set to interpret the higher function keys explicitly in autowrite.vim:35 - if $TERM is “screen-256color” (which happens explicitly in that tmux.conf we weren't using) then t_F9 (which is F19) is set to ^[[33~

          (7) Why? Because (as I verified with control-v) inside tmux, F19 is encoded as ^[[33~

          (8) Since we never explicitly set it otherwise, $TERM inside tmux was set to “screen” - which means that the condition in our autowrite.vim:35 was never met, and thus t_F9 was never set to ^[[33~ in vim.

          (9) Because t_F9 was never mapped properly in our vim config, when that preexec function ran and bash sent “^\\ ^n F19 WriteAll” to tmux via tmux send-keys, vim escaped into normal mode because of ^\\ ^n and then interpreted the rest literally as ^[[33~WriteAll.

          (10) And because the literal string ^[[33~WriteAll wasn’t mapped in vim (only <F19>WriteAll was!), each character was interpreted as a separate vim command, not part of a single mapping as intended.

          ^[[33~WriteAll as interpreted as a series of vim commands
          • ^[ is escape
          • [3 doesn’t do anything (as far as I can tell)
          • 3~ toggles case for the next three characters
          • W takes you to the start of the next WORD
          • ri replace the character under the cursor with an i
          • te takes you to just before the next e
          • A takes you to the end of the line and puts you into insert mode, and then
          • ll is inserted at the end of the line </ul> </div>

            Long story short, the fix was:

              ln -s ~/.vim/bundle/tmux-config/tmux.conf ~/.tmux.conf

            Process-related takeaways

            Absence of evidence IS evidence of absence - we noticed pretty early on that there was no ~/.tmux.conf, then moved on, figuring that okay, guess there isn't anything weird in the config. Next time, if something is missing that seems like a likely place to look, I want to think of looking at whether analogous config files exist on working machines to compare sooner.

            Verify ALL assumptions sooner (or at least the easy-to-check ones) - I noticed that t_F9 thing way earlier and skimmed past it, assuming that surely t_F9 referred to F9. That's an assumption that would've been super quick to verify! Gotta verify assumptions as they're made, especially ones that are quick and easy to check out.

            Edit: And via the great discussion of this post on Hacker News: "Every bug in existence is a story of different software components doing exactly what they were told to." (Unless you count cosmic ray bugs, natch.)

18 Apr 2014
Talking about Debugging with the Ruby Rogues

I got to hang out and chat about debugging with the Ruby Rogues! I was totally flattered to be invited to be their guest for Ruby Rogues episode 150: The Debugging Mindset with Danielle Sucher, and had lots of fun recording the show.

It was so fantastic to just get to chat about science and problem-solving and trying to get better about putting our egos aside and really evaluating the evidence before us with such a great group of people.

It started like this…

DAVID: I bought a microscope yesterday. And there was a splotch on it and I couldn’t figure out what it is and I did the scientific method trying to figure out where in the microscope the splotch was coming from. Turns out, I was seeing a reflection of my optic nerve.

JAMES: Nice.


JOSH: Yeah, you can look in the microscope a really long time and you won’t find that.

DAVID: Yeah.

DANIELLE: So, when you gaze into the microscope, the microscope gazes back into you.


DAVID: Also gazes back to me, yeah.

JOSH: [inaudible] Are you saying that what you see inside Dave’s eyes is the abyss?


DANIELLE: Yeah, yeah.

JAMES: I just want to know how he proved that hypothesis false. Did he gouge one of his eyes out?


DAVID: Actually, and this is the part that I was very, very proud of, I finally switched eyes. And the splotch moved and changed shape.

So brilliant!

And this was my favorite quote of mine from the episode:

"Look, the goal is to prove that I’m wrong. That means I win. I’ve proved that I was stupid about something so I can move on to being stupid about something more interesting."

Really, you can just check out the whole episode here. Have fun!

21 Mar 2014
How I remember the names of things

Me: “Remembering the names of things is the worst! Like, I can never remember which one is the trainwreck rule.”

Dave: “That’s the Law of Demeter.”

Me: “Right, I also can never remember which one the Law of Demeter is, so that makes sense. But I know and understand the actual principle!”

Dave: “Think of the dots as grains of wheat, and Demeter is the goddess of the harvest! Or think of the e’s in ‘Demeter’ as the dots in the trainwreck?”

Me: “Nah, but I can think of the e’s as regex dots and visualize the trainwreck as /D.m.t.r/! Though to be fair, that would also match Damatar, Dumutur, Dimitir…”

Dave: “Ooh, that works perfectly - with ancient Egyptian, when we don’t know what a vowel sound really was, ‘e’ is actually used as the default vowel!”


13 Jan 2014
Cryptic Crossword for 24Mag

Next weekend is the 2014 MIT Mystery Hunt, and I've been going through Prolog puzzles to prep in eager anticipation. And since I have puzzles on the brain, and this past weekend was the last issue of 24 Magazine, it seems like the right moment to finally post the first cryptic crossword I ever wrote! (This is from back when I was working on 24mag issue 4 in February of last year.)

This latest and final issue of 24 Magazine is stunningly beautiful, rich with color and texture, and I am a little in love with it. I'm totally allowed to say that, because I didn't work on this issue at all! But I'm incredibly proud of and impressed by my friends who did. You can read 24 Magazine issue 6 (the last issue ever!) online here.

So, cryptic crosswords! They're a bit different from the usual sort of crossword you might find in the paper. Each clue actually has two parts - a meaning clue, and a wordplay clue. Common forms of wordplay used in Cryptic clues include (but are not limited to): anagrams, hidden words, double definitions, containers, and homophones. Oh, and you'll never see the meaning clue in the middle of the wordplay clue, mind - it'll always be at the beginning or the end.

Here's a great example from the 2012 MIT Mystery Hunt:

"Charge or no charge, rotten root must be extracted (3)"

You can deconstruct it as follows: "Charge" is the definition, and "no charge, rotten root must be extracted" is the wordplay clue. "No charge" is FREE, from which R ("rotten root" - the first letter of 'rotten') is "extracted". FREE minus R gets you to the answer: FEE.

The following is the first cryptic crossword I wrote, over the course of one long sleepless day and night of magazine construction. If the clues are too hard, I promise it's my fault. Enjoy!

(Extra thanks to Dave Turner, Mike Develin, and Martin DeMello for test-solving, brainstorming, and generally playing along with me back in February 2013 when I was writing this.)