Tuesday, March 10, 2009

Zettair is back alive

Hooray! Zettair folks down in Oz have revived the development. "in-development" version now online.

Let's see what we have (with a brutal diff -rc across the directories :-) Disclaimer: this is not necessarily a complete list, at least in the autoconf part - the diff was fairly boring and long.
Alternatively, I might have missed it in the set of whitelist changes or did not get a clue it was important...:)

configure script got what looks a 64-bit "kfreebsd". Probably inherited with a newer version of autotools, but an interesting beast nonetheless, if I read it well, it is here. GNU userland running on top of freebsd kernel.. The other tweaks belong to dragonflybsd. I heard about Hammer filesystem, but haven't tried the OS yet. Might be interesting. All depends on the H/w support.

index.h:

option INDEX_NEW_VOCAB is gone, INDEX_NEW_PARSEBUF (dictate how large the postings hashtable is) appeared.

INDEX_NEW_NO_OFFSETS - build the index without offsets. Probably not very interesting for me...

INDEX_NEW_QTHREADS - how many query threads to allow. Now I wonder if this allows for simultaneous query and indexing - that'd be sweet. My experiments with the previous version running separate processes were not terribly successful. With threads indeed it might be a different story.

some other similar parameter changes to index_load (as opposed to index_new)

index_cleanup is gone.

INDEX_SEARCH_DYNAMIC_RANK - provide the text describing the rank to be used as well as the rank parameters. Interesting, probably a bit too advanced for my needs.

autogenerated code for the metrics is obviously now parametrized.

mutexes for various pieces (For multithreading)

vocab_vector structure is simplified (looks like thanks to getting rid of multiple files)

a lot of multithreaded-oriented code

a hacko-fix in makeindex_append_docno to get around the fact that zettair can't handle docno-s bigger than a pagesize

vocabury operations seem to be simplified - previously the memory management was more sophisticated.

This completes the list - again, very probably I missed something important - if so, drop me a note in the comments.

11 comments:

apa, Angkringan said...

Hi, nice article.

I think you have lots experiences using zettair.

This night, I try zettair and succesfully indexed > 20000 HTML docs.

The search engine sometimes is crash (not in every HTML file result) when I use highlighting summary option, with segmentation fault come from summary.c.

Have you experiences with this? Maybe I need your support.

Andrew Yourtchenko said...

Hi, yes, the highlighting summary is buggy. I think I had fixed it in my fork - https://github.com/ayourtch/zf

My fork has also some more experiments (like doing "not" for the search).

It's still of course mostly for experimental use, but give it a try and tell me what your experience is.

apa, Angkringan said...

thanks for sharing your work, I plan to try it this night.

I try to incorporate zettair with g-wan - www.gwan.ch / www.gwan.com . C search library combined with fast C web application.

best regards and thanks.

Andrew Yourtchenko said...

This is not really "my" library per se - I just put on github the patches, in case they are useful.

I know g-wan but it does not come with the source - that would make me worried.

Take a look at mongrel2, lighttpd or nginx if you want a fast webserver.

Performance comparisons with Apache are not fashionable, IMHO.

Aris Setyawan said...

> Take a look at mongrel2, lighttpd or nginx if you want a fast webserver.

Thank's for your advice.

I have tried to repeat my experiment with https://github.com/ayourtch/zf, and some bugs were gone:

-Now I can add new document to an existing index.
-Now I can index some documents which I exclude in indexing before.
-Some summary highlighting error occurred before is gone

-I have still found segfault in summary capitalisation highlighting

The informations about this error are:
-The file is in xml format. After debugging, I have found the file. The link is www.al-ilmu.com/magazines/detail.php?id=76
-The term I want to search is "tauhid" (an indonesian language)
-Another term experience error too if I applied a search in this document.
-The command I have tried for indexing are /usr/local/zettair/bin/zet -i -f debug www.al-ilmu.com/magazines/detail.php\?id\=76 and /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug www.al-ilmu.com/magazines/detail.php\?id\=76 . All result in segfault.
-The command I used for searching is /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise
-The search is success if I disabled summary highlighting

Is my error caused by wrong doc type I indexed, xml (not just html)?

Aris Setyawan said...

Oh sory, I'm wrong. The doc type was not xml, but an html.

Andrew Yourtchenko said...

If this bug is consistently reproducible, feel free to add all the info to recreate the bug, to the repository - maybe I could find some time to deal with it. Though not in the nearest couple of weeks for sure...

Aris Setyawan said...

Yes, it is reproducible. I have added it.

Aris Setyawan said...

Hi Andrew,

How to do "not" for the search?

Andrew Yourtchenko said...

IIRC, you would just use "not" or something like. Check in the commit log ("git log") - it should be somewhere there.

Aris Setyawan said...

found it:

#define AND_OP "AND"
#define OR_OP "OR"
#define NOSTOP_OP '+'
#define EXCLUDE_OP '-'

and I have tried it with 35 (for starting) html doc, and work as expected.