Wednesday, June 16, 2010

HTTP parser with ragel

I've wanted to play with the ragel, and as a target I chose to rewrite a HTTP parser made by Zed for mongrel. Partly because I wanted "my own" license, partly because I do not like the way his code handles the split of the fields: if I understand it right, the entire request needs to fit into the buffer. This is not very pure - if I handled something, I should discard it.

The state machine appeared very easy to make - mostly, take the RFC2616, and start transferring the BNF into ragel format.

Then, define the >action_enter and %action_exit for each of the pair of states that denote the data to capture.

The most interesting part was the state-saving mechanism for the case when the field is split between the multiple buffers, e.g.: "GET / HTTP/1.0\r\nHo", "s", "t: f", "oo", ".", "com\r\n\r\n"

The logic is simple: after each parse cycle, check if we need to stash anything, if we need to - do it. And then when retrieving the field value, check if something was already stashed - if yes, then stash the just-collected data, and retrieve the stashed value.

The nice part is that this voodoo with the stashing does not occur by default - so, in the non-pathological case the memory use will be quite small.

Still a known bug is the fixed allocation for the stash buffer - I did not use realloc yet.

The API is pretty close to Zed's - I like the self-contained nature of his parser and the callback.

For your pleasure, here's the nugget of the ragel code that represents the state machine.


htp_octet = (any);
htp_char = (ascii);
htp_upalpha = (upper);
htp_loalpha = (lower);
htp_alpha = (htp_loalpha | htp_upalpha);
htp_digit = (digit);
htp_ctl = (cntrl | 127);
htp_cr = ( 13 );
htp_lf = ( 10 );
htp_sp = ( ' ' );
htp_ht = ( 9 );
htp_quote = ( '"' );

htp_crlf = ( htp_cr htp_lf? ); # // Accomodate for unix NLs ?
# htp_crlf = ( htp_cr htp_lf ); # // do NOT accomodate for unix NLs ?

htp_lws = ( htp_crlf? (htp_sp | htp_ht)+ );

htp_not_ctl = (htp_octet - htp_ctl);

htp_text = (htp_not_ctl | htp_lws); # (htp_cr | htp_lf | htp_sp | htp_ht));

htp_hex = (xdigit);

htp_tspecials = (
'(' | ')' | '<' | '>' | '@' |
',' | ';' | ':' | '\\' | htp_quote |
'/' | '[' | ']' | '?' | '=' |
'{' | '}' | htp_sp | htp_ht);

htp_token_char = ((htp_char - htp_tspecials) - htp_ctl);
htp_token = (htp_token_char+);

# comments not supported yet - they require a sub-machine
# htp_comment_char = htp_text - ('(' | ')');
# htp_comment = ( '(' (htp_comment_char+ | htp_comment) ')' );

htp_quoted_char = (htp_text - '"');
htp_quoted_string = ( '"' htp_quoted_char* '"' );

htp_quoted_pair = '\\' htp_char;

htp_http_ver_major = htp_digit+ >mark %http_version_major;
htp_http_ver_minor = htp_digit+ >mark %http_version_minor;

htp_http_version = ("HTTP" "/" htp_http_ver_major "." htp_http_ver_minor);

htp_escape = ('%' htp_hex htp_hex);
htp_reserved = (';' | '/' | '?' | ':' | '@' | '&' | '=' | '+');
htp_extra = ('!' | '*' | '\'' | '(' | ')' | ',');
htp_safe = ('$' | '-' | '_' | '.');
htp_unsafe = (htp_ctl | htp_sp | htp_quote | '#' | '%' | '<' | '>');
htp_national = (htp_octet - (htp_alpha | htp_digit | htp_reserved | htp_extra | htp_safe | htp_unsafe));

htp_unreserved = (htp_alpha | htp_digit | htp_safe | htp_extra | htp_national);
htp_uchar = (htp_unreserved | htp_escape);
htp_pchar = (htp_uchar | ':' | '@' | '&' | '=' | '+');

htp_fragment = ( (htp_uchar | htp_reserved)* );
htp_query = ( (htp_uchar | htp_reserved)* );

htp_net_loc = ( (htp_pchar | ';' | '?' )* );
htp_scheme = ( (htp_alpha | htp_digit | '+' | '-' | '.')+ );

htp_param = ( (htp_pchar | '/')* );
htp_params = (htp_param (';' htp_param)* );

htp_segment = (htp_pchar*);
htp_fsegment = (htp_pchar+);
htp_path = (htp_fsegment ('/' htp_fsegment)*);

htp_rel_path = ( htp_path? (';' htp_params)? ('?' htp_query)? );
htp_abs_path = ('/' htp_rel_path);
htp_net_path = ("//" htp_net_loc htp_abs_path?);

htp_relative_uri = (htp_net_path | htp_abs_path | htp_rel_path);
htp_absolute_uri = (htp_scheme ':' (htp_uchar | htp_reserved)*);
htp_uri = ((htp_absolute_uri | htp_relative_uri) ('#' htp_fragment)?);

htp_host = (htp_alpha);
htp_port = (htp_digit+);

htp_http_url = ("http://" htp_host (':' htp_port)? (htp_abs_path)?);

htp_method = ("OPTIONS" | "GET" | "HEAD" | "POST" | "PUT" | "DELETE") >mark %http_method;


htp_request_uri = ('*' | htp_absolute_uri | htp_abs_path) >mark %http_uri;

htp_request_line = (htp_method htp_sp htp_request_uri htp_sp htp_http_version htp_crlf);

htp_header_name = htp_token+ >mark %http_header_name;
# fixme.
htp_header_value_char = htp_octet - htp_cr - htp_lf;
htp_header_value = htp_header_value_char+ >mark_value %http_header_value;

htp_some_header = (htp_header_name ':' htp_sp* htp_header_value htp_crlf);
htp_last_crlf = htp_crlf; # >{ printf("Last CRLF!\n"); eof = pe; };
htp_request = htp_request_line (htp_some_header)* htp_last_crlf;

main := (htp_request) @{ parser->done = 1; };

No comments: