Friday, June 10, 2011

Indexing wikipedia data dump with ZF

I have decided to try indexing the wikipedia HTML data dump with my zettair fork.

The specs of the machine: 8GB RAM, 8-core Xeon X3440 machine at 2.53Ghz.

multicore nature of the machine did not really matter, since the code is single threaded.

Some info on the dataset:

14257665 documents in total (just above 14 million, that is)

233GiB of data as per "du -kh" output.

The index size is 17Gb.

Indexing this set took about 1.5 days:

real 1935m54.592s
user 138m42.620s
sys 23m49.060s

The search times obviously depend on the word, below are some dummy samples, "cold" is the first time you search, "warm" is the subsequent time.

"wikipedia"

cold: 20 results of 11194872 shown (took 3.886843 seconds)
warm: 20 results of 11194872 shown (took 1.542165 seconds)

"integral":

cold: 20 results of 26981 shown (took 1.168747 seconds)
warm: 20 results of 26981 shown (took 0.065107 seconds)

"schematic":

cold: 20 results of 5056 shown (took 1.040630 seconds)
warm: 20 results of 5056 shown (took 0.022333 seconds)

"surfing":

cold: 20 results of 15340 shown (took 0.840945 seconds)
warm: 20 results of 15340 shown (took 0.039736 seconds)

"asd":
cold: 20 results of 1198 shown (took 0.868518 seconds)
warm: 20 results of 1198 shown (took 0.055091 seconds)

An interesting task probably would be to see how well does the indexing scale (i.e. reindex only partial datasets and graph them). But I figured I'd write up what I have for now.

2 comments:

Software Engineer said...

Hi, Does the dump include all data on wikipedia or you were able to get some specific knowledge category? How the XMLs are indexed alphabetically or what?

Andrew Yourtchenko said...

Yeah it's just the HTML dump - the biggest set I could find without using a spider.