About Projects Blog Recipes Email GitHub Twitter Press Feed

15 Dec 2013
Virex: a Vim-flavored Regex playground on the web

/ vi[RE] / x

a tool for exploring regular expressions in vim

I've been goofing around with Vim and Erlang lately, and since two great tastes taste great together, I made you a thing! My latest toy - Virex, where you can experiment with Vim's regex on the web. It's like Rubular but for Vim regex. Have fun!

I've been having a lot of fun mucking around with learning and tweaking my Vim-related toolset ever since I started working at Case Commons, so when a coworker asked why there wasn't anything like Rubular for Vim's regex, I jumped at the excuse to throw this together.

Using Virex is slower than just experimenting in Vim locally, natch, but it was fun to build and it comes in handy when you're out with your friends arguing about regular expressions in Vim with only your phone handy for trying to prove your point. (SHUT UP, this happens.)

The interesting part turned out to be thinking about security, which is what the rest of this post will mostly be about.

Erlang is handy for sending messages between processes (shocking, n/n?)

I wanted to delegate the user-input test strings and regex patterns directly to Vim rather than try to reimplement Vim's regex perfectly, while avoiding sending anything through the shell (danger zone like whoa, obvs). Erlang's open_port/2 function was the perfect solution.

(Okay, I admit, I also really just wanted an excuse to play with Erlang some more. No Starch Press offered to send me free books to review a few months ago, and on a whim I asked for a copy of Learn You Some Erlang for Great Good!. I've been having a lot of fun exploring Erlang on the side ever since the book arrived.

It's a pretty fantastic book, overall - concepts are explained clearly and thoroughly, in a way that I find very intuitive. My only caveat is that I found some of the examples used by the author distractingly offensive - I was really put off by bound variables being illustrated by a sadface dude in a suit standing next to a smiley lady in a white dress. Also, binary gender examples much? So, problematic. But "I like things, and some of those things are problematic." It's definitely also clear, thorough, and informative.)

Waaah don't shell out via Vim please

So, great, user input is bypassing the shell when being sent directly from my Erlang server to Vim. But wait, it's possible to shell out from Vim in various ways! Oh noes, we can't have that.

I'm highlighting matches by using Vim's regex substitution, %s/PATTERN/REPLACEMENT/g. The risky aspect of this is that the REPLACEMENT section can take any vimscript expression, so I don't want users to be able to escape the PATTERN section and potentially get arbitrary code executed that could let them shell out and cause trouble.

This led to a truly absurd bit of Erlang that rejects any user-input pattern with a forward slash preceded by an odd number of contiguous backslashes. Tsk tsk, don't go trying to escape my slash and causing trouble.

Regular Expressions Denial of Service attacks

The other big security risks I fretted over were Regular Expressions Denial of Service attacks. Regular expressions are pretty powerful, and can be written to run dangerously slowly and consume large amounts of memory.

I had a lot of fun testing Virex with this list of Evil Regexes I found on Wikipedia and this fabulous post Dave sent me, In search of an exponential time regex.

The biggest ReDoS problem Vim's regex seems to be susceptible to is greedy quantifiers along the lines of \(.*\)\{1,32000\} - that hung forever. Bummer.

After a bit of poking around, I determined that \{99,\} and \{,99\} were safe, but \{999,\} and \{,999\} are not. So, Virex rejects repetitions that are 3 digits are longer.

Here's the function I'm using to test whether a user-input regex pattern is safe:

safe(Pattern) when erlang:length(Pattern) > 80 ->
  false;
safe(Pattern) ->
  DangerousRegex = "\{-?[0-9]{3,}|[0-9]{3,}\\\\?\}|([^\\\\]|^)(\\\\\\\\)*/",
  re:run(Pattern, DangerousRegex) =:= nomatch.

As you can see, I'm limiting patterns to 80 characters on basic principle - if you need to test a longer regex than that, you can do it when you get home.

If the pattern is short enough, that long regex I've got there does two other checks - it makes sure that no quantifiers have repeats that are 3 digits are longer, and that no forward slashes are immediately preceded by an even number of backslashes.

regex_meme

(Why so many backslashes? Blame Erlang. I feel like half the time I spent on this little project was focused on making sure I was escaping characters properly as they went through Erlang, Vim, and oh god you got your regex in my regex.)

So.

To sum up - Virex is a webmachine app, with nginx acting as a reverse proxy and serving the static content, which sanitizes the user-input regex patterns and sends them off to Vim to test them out. Alex Feinman designed that awesome logo for me. The source code is here.

I adore Erlang's syntax, and I had a lot of fun exercising the paranoid portion of my brain and exploring evil regexes - hopefully I caught them all. If you can think of anything else I ought to test for, please let me know! (Ideally via twitter or pull requests, not by crashing my server, thankyouverymuch. ^^)

23 Jul 2013
Ruby: Case versus If (and a wee bit about Unless)

My awesome brother-in-law is learning coding generally and Ruby specifically lately, so we decided to check out ScreenHero and try some remote pairing.

One thing led to another, and next think you know he was asking me how you know when to use a case statement versus when to use if/elsif. We chatted about logic and clean design, and then started wondering if there really was a performance difference between case and if/elsif.

(Yes, yes, this sort of micro-optimization is usually way less important than readable and easy to maintain code. But it's still fun to think about!)

"If you find you're looking to optimize that intensely you probably don't want an interpreted language.

I generally try to avoid case statements simply because it makes it easy for me or another programmer to come along in the future and add another case, increasing the branching in a bit of code that should probably have only done one thing in the first place." - Jonan S.

Yep. Agreed on both points. And now that we've acknowledged that this exploration is just for funsies, let's move on.

Midwire ran a few benchmarks and concluded that if/elsif is faster than case because case "implicitly compares using the more expensive === operator".

It's true - case totally uses threequal under the hood! (I mean, there's a reason we say that threequal is for case equality.) But that doesn't feel like the end of the story, so let's see what else we can figure out here.

In some languages, case statements are implemented as hash tables to gain performance over if/elsif chains. Is that true in Ruby? If not, what is going on here? And where can we look for the answers?

THE RUBY DOCS AIN'T GONNA CUT IT THIS TIME

To be honest, I started this exploration by looking through the Ruby docs and clicking to toggle source on case, but there was nothing there. If and case are so fundamental that they can't actually be defined as functions and thus can't be explained by looking at the Ruby sourcecode. We have to look at the parser and compiler to figure out what's going on with them.

Why? Well, scroll down to SICP Exercise 1.6.

Let us assume for the sake of argument that Ruby had if as a language keyword, but wanted to define case in terms of if/elsif/else.

Let's pretend cond is just if, and rewrite this example...

(define (new-if predicate then-clause else-clause)
  (cond (predicate then-clause)
        (else else-clause)))

…more Ruby-ishly:

def new_if(predicate, then_clause, else_clause)
  if predicate
    then_clause
  else
    else_clause
  end
end

Well, okay. Now, what if we tried to use it like this?

irb(main):012:0> x = 0
=> 0
irb(main):013:0> new_if(true, x+=1, x+=2)
=> 1
irb(main):014:0> x
=> 3

Whoops! The then_clause and else_clause get evaluated before being passed in as arguments. That's no good. Okay, fine, what if we set up new_if to take clauses as lambdas instead?

def new_if(predicate, then_clause, else_clause)
  if predicate
    then_clause.call
  else
    else_clause.call
  end
end

Cool, but what about scoping? We want the clause that ends up evaluated to be able to change variables from the scope outside the conditional. Good thing that's not actually a problem in Ruby, where closures include bindings and don't just close over values from the outer scope.

irb(main):015:0> x = 0
=> 0
irb(main):016:0> new_if(true, lambda { x+=1 }, lambda { x+=2 })
=> 1
irb(main):017:0> x
=> 1

If that seems deeply weird to you, you might want to check out this post on how bindings and closures work in Ruby.

But anyways, we know that case doesn't have to take lambdas in Ruby, so we know that this can't be how it works. And if it's not a function defined in Ruby, but is a language keyword instead, we can't really quite figure out what's going on with it just by reading through the docs as usual. Bah, humbug. Where should we look next?

LUCKY FOR US, MRI IS OPEN SOURCE

When in doubt, it's time to refer to primary source materials.

Okay, I see something about NODE_CASE in compile.c. Looks like we go through each NODE_WHEN, deal with array predicates and such, and ultimately add instructions for checkmatch and branchif, like so:

ADD_INSN1(cond_seq, nd_line(vals), checkmatch, INT2FIX(VM_CHECKMATCH_TYPE_CASE | VM_CHECKMATCH_ARRAY));
ADD_INSNL(cond_seq, nd_line(vals), branchif, l1);

So, what’s VM_CHECKMATCH_TYPE_CASE? Turns out it’s check ‘patten === target’.

And just look a bit further down in insns.def and we find DEFINE_INSN checkmatch. Of course, it turns out that checkmatch calls check_match, which checks case equality with idEqq, which (again) turns out to be :===.

That was pretty neat. Now, what about this branchif business? Well, it seems to be defined here. Seems pretty straightforward. It looks like both case and if/elsif are implemented as sequences of conditionals and gotos in Ruby, so we can't expect to get the sort of performance boost with case in Ruby like we see in languages where case statements are implemented as hash tables instead.

That doesn't really answer the initial question, though. Threequal, got it, sure. But what if we use === in our if/elsif statements? Is there any other performance difference between case and if/elsif, really?

OPENING UP THE HOOD WITH RubyVM::InstructionSequence

Oh, to hell with primary source materials. Let's just open up the hood ourselves. Have you played with RubyVM::InstructionSequence yet? Seriously, it's just about the niftiest thing around.

Want to see the YARV ("Yet Another Ruby Virtual machine") bytecode your code really ends up translated into? Sure, no prob.

RubyVM::InstructionSequence::compile_file “[t]akes file, a String with the location of a Ruby source file, reads, parses and compiles the file, and returns iseq, the compiled InstructionSequence with source location metadata set.”

We can verify stuff pretty easily this way. Let's start by testing out a case statement:

number = 15
case number
when 15
  'fifteen'
when 5
  'five'
when 3
  'three'
else
  number
end
== disasm: <RubyVM::InstructionSequence:<main>@./case.rb>===============
local table (size: 2, argc: 0 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 2] number     
0000 trace            1                                               (   1)
0002 putobject        15
0004 setlocal_OP__WC__0 2
0006 trace            1                                               (   2)
0008 getlocal_OP__WC__0 2
0010 dup              
0011 opt_case_dispatch <cdhash>, 35
0014 dup                                                              (   3)
0015 putobject        15
0017 checkmatch       2
0019 branchif         42
0021 dup                                                              (   5)
0022 putobject        5
0024 checkmatch       2
0026 branchif         49
0028 dup                                                              (   7)
0029 putobject        3
0031 checkmatch       2
0033 branchif         56
0035 pop                                                              (  10)
0036 trace            1
0038 getlocal_OP__WC__0 2
0040 leave            
0041 pop              
0042 pop                                                              (  11)
0043 trace            1                                               (   4)
0045 putstring        "fifteen"
0047 leave                                                            (  11)
0048 pop              
0049 pop              
0050 trace            1                                               (   6)
0052 putstring        "five"
0054 leave                                                            (  11)
0055 pop              
0056 pop              
0057 trace            1                                               (   8)
0059 putstring        "three"
0061 leave   

And an if/elsif:

number = 15
if number == 15
  'fifteen'
elsif number == 5
  'five'
elsif number == 3
  'three'
else
  number
end
== disasm: <RubyVM::InstructionSequence:<main>@./if.rb>=================
local table (size: 2, argc: 0 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 2] number     
0000 trace            1                                               (   1)
0002 putobject        15
0004 setlocal_OP__WC__0 2
0006 trace            1                                               (   2)
0008 getlocal_OP__WC__0 2
0010 putobject        15
0012 opt_eq           <callinfo!mid:==, argc:1, ARGS_SKIP>
0014 branchunless     22
0016 trace            1                                               (   3)
0018 putstring        "fifteen"
0020 leave                                                            (   2)
0021 pop              
0022 getlocal_OP__WC__0 2                                             (   4)
0024 putobject        5
0026 opt_eq           <callinfo!mid:==, argc:1, ARGS_SKIP>
0028 branchunless     36
0030 trace            1                                               (   5)
0032 putstring        "five"
0034 leave                                                            (   4)
0035 pop              
0036 getlocal_OP__WC__0 2                                             (   6)
0038 putobject        3
0040 opt_eq           <callinfo!mid:==, argc:1, ARGS_SKIP>
0042 branchunless     50
0044 trace            1                                               (   7)
0046 putstring        "three"
0048 leave                                                            (   6)
0049 pop              
0050 trace            1                                               (   9)
0052 getlocal_OP__WC__0 2
0054 leave   

Let's pause for a moment and make some predictions. What do you think this all might mean?

(a/k/a Dear rabbit hole: I'm in you.)

Personally, I'm pretty intrigued by the fact that if/elsif uses branchunless, while case uses branchif. Based on that alone, I'd expect if/elsif to be faster in situations where one of the first few possibilities is a match, and for case to be faster in situations where a match is found only way further down the list (when if/elsif would have to make more jumps along the way on account of all those branchunlesses).

BENCHMARK ALL THE THINGS

To hell with reading the bytecode. Let's just benchmark some shit.

Here's my rough little benchmarking code. I actually tested if/elsif twice: once with ==, and once with ===.

I tested each option with a list of 15 predicates, ranging from 15 as the first down to 1 as the last. So according to my prediction, case should be fastest if I test with n == 1, and slowest when n == 15.

I ran the benchmark a bunch of times, and got results pretty consistently along these lines:

n = 1 (last clause matches)
if:           7.4821e-07
threequal_if: 1.6830500000000001e-06
case:         3.9176999999999997e-07
 
n = 15 (first clause matches)
if:           3.7357000000000003e-07
threequal_if: 5.0263e-07
case:         4.3348e-07

Benchmarking seems to mostly confirm our theory, except note that that threequal_if (a sequence of if/elsifs comparing with ===) was the slowest in both cases. It was slower by an order of magnitude when the last clause matched (n == 1), where both branchunless and the expensive === comparison were slowing it down, and even when the first clause matched (n == 15), when my guess is that the slowness of === outweighed the slowness of the single extra jump case had to make because of branchif.

(When I was initially writing this post, I had some messed up benchmarking and ended up way down the rabbit hole reading about branch prediction optimization, which your CPU deals with. This paper was interesting, too. But never mind that now.)

Anyways. None of this is dispositive, but we have some evidence and a better understanding of how things are implemented here, which is ultimately the real point.

And speaking of rabbit holes, I wonder why case and if/elsif have that different branching... something about different assumptions about how they'd be used, maybe? What about unless?

n = 1
unless n == 1
  puts "sadface"
end

As one might expect, unless uses branchunless (like case, and unlike if).

== disasm: <RubyVM::InstructionSequence:<main>@./unless.rb>=============
local table (size: 2, argc: 0 [opts: 0, rest: -1, post: 0, block: -1] s1)
[ 2] n          
0000 trace            1                                               (   1)
0002 putobject_OP_INT2FIX_O_1_C_ 
0003 setlocal_OP__WC__0 2
0005 trace            1                                               (   2)
0007 getlocal_OP__WC__0 2
0009 putobject_OP_INT2FIX_O_1_C_ 
0010 opt_eq           <callinfo!mid:==, argc:1, ARGS_SKIP>
0012 branchunless     17
0014 putnil           
0015 leave            
0016 pop              
0017 trace            1                                               (   3)
0019 putself          
0020 putstring        "sadface"
0022 opt_send_simple  <callinfo!mid:puts, argc:1, FCALL|ARGS_SKIP>
0024 leave 

And there you have it. Any conclusions I might draw from that would be pure speculation, so I'll leave the facts to stand on their own merits. Knowing is half the battle &c.

Hope this answers your question, Dan! And gives you a few extra tools for looking into the next few as well. ^^ Hooray for rabbit holes!

01 Jan 2013
The best books I read in 2012

This wasn’t such a great year in books for me (nothing compared to 2011, which was full of winners), but out of the 130 books I read in 2012, there were a few great ones, at least:


Fiction I loved reading in 2012:

  • The Tidewater Tales by John Barth - Long meandering story full of nested stories, about storytellers and sailing and Scheherezade (and a dash of the CIA, but that was my least favorite aspect). When I tried to make grilled chutney bananas as inspired by the characters’ favorite food they weren’t all that good, but the book is worth reading anyway.
  • The Guernsey Literary and Potato Peel Society by Mary Ann Shaffer & Annie Barrows - Startlingly good tale of a journalist/author who befriends a community of rural island folk by becoming their pen-pal shortly after World War II. The bits that truly caught me were the descriptions of how ordinary folks connect with books. This is a book that reminds us how to read.
  • The Mirage by Matt Ruff - The 9/11-inspired novel we’ve been waiting for.
  • Native Son by Richard Wright - Hard to read, in a brilliant way. “I’ll kill you and go to hell and pay for it.”
  • Embassytown by China Mieville - Like his The City and the City, the main value in this book lies in the fact that it has given us a metaphor for describing other things going forward. It’s on this list more for that usefulness than for its literary merit, which was good but not spectacular.
  • Some of the best from Tor.com: 2011 Edition edited by Patrick Nielsen Hayden & Liz Gorinsky - Nearly every story in here was a hit. I am utterly impressed - my like percentages for anthologies are usually much lower.
  • Last Night in Twisted River by John Irving - All of his novels tend to be long and complex and full of intriguing characters and… contrived in a really wonderful way, though I can’t quite think of the word for that at the moment.
  • The Girl Who Couldn’t Come by Joey Comeau - Short stories by the brilliant author of A Softer World.
  • Kindred by Octavia Butler - Time-travel, American history, slavery, feminism, family.
  • An Instance of the Fingerpost by Iain Pears - I don’t typically like mystery novels, but this was excellent. Four unreliable narrators, gender and political issues, science and medicine and experimentation - lots of great stuff going on here.
  • The Persistence of Vision by John Varley - Excellent s/f short stories.


Fiction I loved re-reading in 2012:

  • Wild Seed by Octavia Butler - A fantasy novel of genetic engineering.
  • Flowers for Algernon by Daniel Keyes - Still breaks my heart.
  • The Best of Michael Swanwick by Michael Swanwick - Most of these stories were rereads for me, but there were a few I hadn’t seen before. The man is simply a fucking genius, is all.
  • A Deepness in the Sky by Vernor Vinge - The best thing about Vinge is the way he goes so deeply into the implications of every idea he has.


Books I loved reading in 2012 that related to decision-making and problem-solving and communication:

  • Thinking, Fast and Slow by Daniel Kahneman - This was fucking brilliant and you should all read it right now. Cognitive errors that we all make, understood and named so we can try to be aware of and avoid them more easily, explained in a fun and readable voice! I’d read about a lot of the studies he cites here before, but it was still good to have them all nicely collected and described in a fun, readable voice in this volume.
  • Better: A Surgeon’s Notes on Performance by Atul Gawande - How to go from good enough to better yet. Ask. Don’t complain. Count something. Write. Change. Not as good as his other book, Complications, but a close second.
  • The Most Human Human: What Artificial Intelligence Teaches Us About Being Alive by Brian Christian - The author set out to win the Most Human Human prize as a confederate in the annual Turing Test, and writes about his process in figuring out what human communication involves. Fun, but also helpful in terms of thinking about how to talk and connect with other people generally in life.


I loved a lot of the John McPhee essays I read in 2012:

  • Uncommon Carriers by John McPhee - The best of the lot, for the sake of its first two essays. He follows a truck driver with a thoughtful bent. He discusses how captains of gigantic tankers are trained on scaled down models where even the wind and tides are scaled down by careful choice of location. Fascinating stuff.
  • Pieces of the Frame by John McPhee - Jimmy Carter may be the governor of Georgia, but he’s not the governor of Sam’s canoe.
  • Silk Parachute by John McPhee - I remain astonished that I was bored by the McPhee book I was required to read as a teenager, but love his work so very much now as an adult.


Other non-fiction I loved reading in 2012:

  • Mother Nature: Maternal Instincts and How They Shape the Human Species by Sarah Blaffer Hrdy - After having read about how often primate mothers kill their babies, I’m now even more grateful for my mother not strangling me as a child.
  • Unequal Childhoods: Class, Race, and Family Life by Annette Lareau - Insightful, interesting, and emotionally difficult.
  • What We Knew: Terror, Mass Murder, and Everyday Life in Nazi Germany: An Oral History by Eric A. Johnson & Karl-Heinz Reuband - That’s always been the question. Did they know? How much did they know, and when did they find out? How early on? Point being, of course, how culpable are they? This set of interviews attempts to explore that question. One of the best Holocaust books I’ve ever read, and as the granddaughter of survivors, I’ve read rather a lot of them.
  • The Emperor of Scent: A Story of Perfume, Obsession, and the Last Mystery of the Senses by Chandler Burr - Stupid title. This is actually a story about learning, discovery, and the dirty underbelly of scientific community politics. Also the science of scent.
  • Code: The Hidden Language of Computer Hardware and Software by Charles Petzold - The first 90ish pages got me through middle school Morse code basics &c and high school electronics, and then things started to get really interesting.
  • Confessions of a Pickup Artist Chaser by Clarisse Thorn - I feel dirty but fascinated!
  • The Best American Nonrequired Reading: 2011 edited by Dave Eggers - It was sometimes hard to tell which pieces were fiction and which were not, which was part of the fun.
  • Expressive Figure Drawing by Bill Buchman - Gorgeous prints of his sketches, along with useful exercises and advice.

12 Sep 2012
How I administer little treats to myself

It turns out that Twilio doesn't allow "spoofing" - you can't specify a from_number that isn't a Twilio number, so I can't just forward a text to a random charity directly, which is what I wanted to do in the first place. Bummer!

So instead, now I (or you, if you're into that kind of thing) can text 347-756-4559 with '5' or '10', and it will send back the phone number and code needed to donate that amount to a random charity off the list. (Text anything else, and it'll ask for a 5 or a 10 kthxbye.)

Walking down the street, pass a Starbucks, get that urge to grab a mocha? Feh, just text to get info on a charity that takes $5 donations by text, and send in one of those instead. Don't like the charity returned? Text again to roll the dice again, as it were.

List of charities adapted from here. My modified list is somewhat personalized, with a strong emphasis on charities that focus on medical research and poverty relief over religious institutions.

Just a tiny little simple Flask app, but it really does make me happy.

16 May 2012
Honey from Huffman trees

Did you know that NYC Open Data includes latitude/longitude for every tree in NYC? A tree census - amazing! Surely there's something fun we can do with that!

Well, bees in NYC do a lot of their nectar foraging from flowering trees. It could be interesting to see which kinds of trees my bees are likely foraging from. Bees can fly startlingly long distances to forage if they have to, but my sense is that a 2 mile radius is a decent rough estimate for how far they're likely to fly in search of nectar barring a barren neighborhood. Sure, there are plenty of other flowering plants in the city (I can certainly taste the clover in my spring honey, for instance), but it's still interesting to see which trees are likely contributing to the honey I harvest locally.

So, I pulled the tree census data into mysql to play with. (Bread crumbs for the inspired: I wanted to deploy on Dreamhost, so I used an old version of mysql and was stuck with mbrcontains narrowed down with the haversine equation (and cosine approximation, of course) to find only those trees actually within an n-mile radius of my starting point. Postgres and stcontains with a polygon approximation of the circle would be better, really.)

I'm displaying results as a Huffman tree - a visual representation of the data structure you create when you use Huffman encoding.

Visualized with D3.js, with leaf sizes proportional to the percentage of trees found of each type, it looks rather like this:

You can put in your own beehive address (or home, or neighborhood you're thinking of moving to, whatever makes you happy) and play with my tree-finder here.

When a thing is as dorky as it can possibly be, I know it is done right.