Thursday, June 24, 2010

The drug

A patient in pain is brought to the emergency and has been served shot of drugs aimed to relieve the pain. He never tried the drugs before - and the very thought of this gets him into horror - but there is no way - the pain is too strong even for an energetic person he is.

The drugs help to get a relief, and when everyone falls asleep, the patient jumps into the window and runs away. He continues the shots himself and this allows a bearable life - for a very long time - without attending to the disease. Eventually the disease becomes so evident it is not possible to live with it, the patient will die in a year if they do not cure the disease for good - so he returns to the hospital and asks for the cure.

However, soon the patient realises that curing the disease means no helpful shots of dope anymore. Meantime it has become a routine. It increases alertness and adds to the feeling of well-being. Athletic performance is enhanced - so the routine tasks are done much faster. Sometimes the patient feels the temporary anxiety when looking at the mess around - but, it is short. And it's been a long time - the life without the daily shot seems a hazy mirage, which only has one glaring label on it - the pain.

The pain is gone for good now - and so are the injections. But it feels worse than before. The dependency kicks in. Regular depression. No more euphoria that was offsetting the periods of boredom. It's all gray.

The patient needs to relearn the life from the beginning. All the mistakes he did already but forgot - he needs to do again, one by one. It's a long journey, and every day not once or twice he thinks about taking a magic shot again.

But somewhere there is a hidden thought, the joy of the freedom, the absence of the necessity to fit the entire life according to the need for the next shot.

And this thought brings the meaning back to his life, allows him to continue further even through all the hardships.

It really is worth it.

Living without the life-long dependence.



to the most-debated networking feature of the past decade, with compliments for all the pain it alleviated

Sunday, June 20, 2010

"Your predecessor"

It's a painful reminder about the fragility of life and universe and everything.

Because the "predecessor" means the "one who deceased before".

Which means you too shall die.

Saturday, June 19, 2010

How much are your dreams worth to you ?

I've been always saying (mostly empirically, I never made any detailed analysis) that if you are hunting after money, the IT/telecom work is not your cup of tea. Go into financial - they pay, noticeably more.

However, there is a detail to that. Which was put sharp, to the point it hurt, in one of the comments on the HN thread:

"It's cheap to get that salary, too. All you have to give up is your dreams. My friend who earns £150k hates her job, and used to tell me that every time I saw her."

Maybe this is an isolated case, maybe not. However, this triggered in me an idea for an interview question: "How much of a premium would you require to give up all your dreams ?"

And, it can make up for a good discussion over a beer, too.

Wednesday, June 16, 2010

HTTP parser with ragel

I've wanted to play with the ragel, and as a target I chose to rewrite a HTTP parser made by Zed for mongrel. Partly because I wanted "my own" license, partly because I do not like the way his code handles the split of the fields: if I understand it right, the entire request needs to fit into the buffer. This is not very pure - if I handled something, I should discard it.

The state machine appeared very easy to make - mostly, take the RFC2616, and start transferring the BNF into ragel format.

Then, define the >action_enter and %action_exit for each of the pair of states that denote the data to capture.

The most interesting part was the state-saving mechanism for the case when the field is split between the multiple buffers, e.g.: "GET / HTTP/1.0\r\nHo", "s", "t: f", "oo", ".", "com\r\n\r\n"

The logic is simple: after each parse cycle, check if we need to stash anything, if we need to - do it. And then when retrieving the field value, check if something was already stashed - if yes, then stash the just-collected data, and retrieve the stashed value.

The nice part is that this voodoo with the stashing does not occur by default - so, in the non-pathological case the memory use will be quite small.

Still a known bug is the fixed allocation for the stash buffer - I did not use realloc yet.

The API is pretty close to Zed's - I like the self-contained nature of his parser and the callback.

For your pleasure, here's the nugget of the ragel code that represents the state machine.


htp_octet = (any);
htp_char = (ascii);
htp_upalpha = (upper);
htp_loalpha = (lower);
htp_alpha = (htp_loalpha | htp_upalpha);
htp_digit = (digit);
htp_ctl = (cntrl | 127);
htp_cr = ( 13 );
htp_lf = ( 10 );
htp_sp = ( ' ' );
htp_ht = ( 9 );
htp_quote = ( '"' );

htp_crlf = ( htp_cr htp_lf? ); # // Accomodate for unix NLs ?
# htp_crlf = ( htp_cr htp_lf ); # // do NOT accomodate for unix NLs ?

htp_lws = ( htp_crlf? (htp_sp | htp_ht)+ );

htp_not_ctl = (htp_octet - htp_ctl);

htp_text = (htp_not_ctl | htp_lws); # (htp_cr | htp_lf | htp_sp | htp_ht));

htp_hex = (xdigit);

htp_tspecials = (
'(' | ')' | '<' | '>' | '@' |
',' | ';' | ':' | '\\' | htp_quote |
'/' | '[' | ']' | '?' | '=' |
'{' | '}' | htp_sp | htp_ht);

htp_token_char = ((htp_char - htp_tspecials) - htp_ctl);
htp_token = (htp_token_char+);

# comments not supported yet - they require a sub-machine
# htp_comment_char = htp_text - ('(' | ')');
# htp_comment = ( '(' (htp_comment_char+ | htp_comment) ')' );

htp_quoted_char = (htp_text - '"');
htp_quoted_string = ( '"' htp_quoted_char* '"' );

htp_quoted_pair = '\\' htp_char;

htp_http_ver_major = htp_digit+ >mark %http_version_major;
htp_http_ver_minor = htp_digit+ >mark %http_version_minor;

htp_http_version = ("HTTP" "/" htp_http_ver_major "." htp_http_ver_minor);

htp_escape = ('%' htp_hex htp_hex);
htp_reserved = (';' | '/' | '?' | ':' | '@' | '&' | '=' | '+');
htp_extra = ('!' | '*' | '\'' | '(' | ')' | ',');
htp_safe = ('$' | '-' | '_' | '.');
htp_unsafe = (htp_ctl | htp_sp | htp_quote | '#' | '%' | '<' | '>');
htp_national = (htp_octet - (htp_alpha | htp_digit | htp_reserved | htp_extra | htp_safe | htp_unsafe));

htp_unreserved = (htp_alpha | htp_digit | htp_safe | htp_extra | htp_national);
htp_uchar = (htp_unreserved | htp_escape);
htp_pchar = (htp_uchar | ':' | '@' | '&' | '=' | '+');

htp_fragment = ( (htp_uchar | htp_reserved)* );
htp_query = ( (htp_uchar | htp_reserved)* );

htp_net_loc = ( (htp_pchar | ';' | '?' )* );
htp_scheme = ( (htp_alpha | htp_digit | '+' | '-' | '.')+ );

htp_param = ( (htp_pchar | '/')* );
htp_params = (htp_param (';' htp_param)* );

htp_segment = (htp_pchar*);
htp_fsegment = (htp_pchar+);
htp_path = (htp_fsegment ('/' htp_fsegment)*);

htp_rel_path = ( htp_path? (';' htp_params)? ('?' htp_query)? );
htp_abs_path = ('/' htp_rel_path);
htp_net_path = ("//" htp_net_loc htp_abs_path?);

htp_relative_uri = (htp_net_path | htp_abs_path | htp_rel_path);
htp_absolute_uri = (htp_scheme ':' (htp_uchar | htp_reserved)*);
htp_uri = ((htp_absolute_uri | htp_relative_uri) ('#' htp_fragment)?);

htp_host = (htp_alpha);
htp_port = (htp_digit+);

htp_http_url = ("http://" htp_host (':' htp_port)? (htp_abs_path)?);

htp_method = ("OPTIONS" | "GET" | "HEAD" | "POST" | "PUT" | "DELETE") >mark %http_method;


htp_request_uri = ('*' | htp_absolute_uri | htp_abs_path) >mark %http_uri;

htp_request_line = (htp_method htp_sp htp_request_uri htp_sp htp_http_version htp_crlf);

htp_header_name = htp_token+ >mark %http_header_name;
# fixme.
htp_header_value_char = htp_octet - htp_cr - htp_lf;
htp_header_value = htp_header_value_char+ >mark_value %http_header_value;

htp_some_header = (htp_header_name ':' htp_sp* htp_header_value htp_crlf);
htp_last_crlf = htp_crlf; # >{ printf("Last CRLF!\n"); eof = pe; };
htp_request = htp_request_line (htp_some_header)* htp_last_crlf;

main := (htp_request) @{ parser->done = 1; };

Sunday, June 6, 2010

Snooping on search engines with hex tricks.

The big hex number you see on the title of the blog is there for a reason. It's an experiment on the search engines out there.

What's this hex number ? It's a SHA-1 hash of the string "Andrew Yourtchenko" - this gives a nice token that is unique in the whole world (because noone else has yet come up with the idea to hash my name and put the result online). This gives some very entertaining results if you try to search for this number in the search engines.


  • Google: 2 results after folding the duplicates - main page + one post. Expanding them gives 25 results. Cool.

  • Bing: 0 results. Boo. I could not find any page on this blog using it. 2xBoo.

  • Yahoo: 2 results - main page and one post, different from the post that Google shows.

  • duckduckgo: One hit to the same post as Yahoo shows, not to the main page.

  • altavista: 2 results, same as Yahoo.

  • ask.com: 2 results, same as Google

  • cuil.com: 0 results.

  • baidu: 0 results. Somehow unsurprising at all

  • kosmix: 2 results in google web search (same as google), 4 results in the google blog search, 0 results in yahoo web search. Entertaing how they disagree with Yahoo.

  • yandex.ru: 0 results.

  • yebol.com: 2 results same as yahoo + 1 site result. Very interesting form of presentation. I got to play with this one for daily searches, even though it does not appear to be too fast compared to google.com.



Conclusions:


  1. I don't have many inbound links here - probably about two :)

  2. google.com has a better reach towards the "long tail" (EDIT: 'long tail' in this case being blogger.com - would be interesting to test e.g. typepad.com or other blog sites)

  3. There are much more than 127 billion pages on the web.

  4. yebol.com is a new toy to play with



Though some of the above are fairly obvious, not bad of a result for a single SHA-1 hash, I think. Would be interesting what results this method gives for more popular blogs/websites.

The debatable point in this method is - to which extent do the search engines discriminate the "oddball hex stuff" vs. the "common words". From all I know about the search engines, they should not - vice versa, it's the stopwords (too frequent to be useful ones) that are usually filtered.