I have decided to try indexing the wikipedia HTML data dump with my zettair fork.
The specs of the machine: 8GB RAM, 8-core Xeon X3440 machine at 2.53Ghz.
multicore nature of the machine did not really matter, since the code is single threaded.
Some info on the dataset:
14257665 documents in total (just above 14 million, that is)
233GiB of data as per "du -kh" output.
The index size is 17Gb.
Indexing this set took about 1.5 days:
The search times obviously depend on the word, below are some dummy samples, "cold" is the first time you search, "warm" is the subsequent time.
cold: 20 results of 11194872 shown (took 3.886843 seconds)
warm: 20 results of 11194872 shown (took 1.542165 seconds)
cold: 20 results of 26981 shown (took 1.168747 seconds)
warm: 20 results of 26981 shown (took 0.065107 seconds)
cold: 20 results of 5056 shown (took 1.040630 seconds)
warm: 20 results of 5056 shown (took 0.022333 seconds)
cold: 20 results of 15340 shown (took 0.840945 seconds)
warm: 20 results of 15340 shown (took 0.039736 seconds)
cold: 20 results of 1198 shown (took 0.868518 seconds)
warm: 20 results of 1198 shown (took 0.055091 seconds)
An interesting task probably would be to see how well does the indexing scale (i.e. reindex only partial datasets and graph them). But I figured I'd write up what I have for now.