Tuesday, March 8, 2011

Autoextraction of Abstracts from RFCs and drafts

An idee fixe (uh, I mean *one more*) of mine is to somehow organize a collection of IETF docs - RFCs/drafts that are somehow touching IPv6 (Thanks to Fred Baker for this nice puzzle).

So, what I have is 140 megabytes of data, sitting in just under 2000 files that represent the RFCs and various drafts.

First step of doing anything at all with this pile is to be able to chop it into some chunks - try to put the congruent parts side by side, move the ASCII pictures aside, and similar mundane tasks.

The first step of doing that is to try to extract the part that is there in almost every IETF doc - the "Abstract" section. In general, the section titles are starting with 0-column indent - while the text of the paragraphs typically has 2+ columns indentation. However, this is a general rule. There are zillions of exceptions over the years. Variations of spelling, wrong indents, MS-DOS carriage returns, all sorts of nasty mess. Anyway, the first try at this has concluded.

I extract the titles out of the pagebreak-placed titles and this is noticeable - in some of them you have the month and the year glued on the right side. This is something that should get fixed eventually, if I figure some heuristic.

Here's a result in case you find it useful at all:

Abstracts from some RFCs and drafts.

No comments: