How search.gmane.org works

Olly Betts

Gmane

…is an archive of public mailing lists, started in early 2002 by Lars Magne Ingebrigtsen ("Mr Gnus"). It's hosted in Norway.

Some key features

Gmane statistics

we:search

Hardware

rain
Current search server. 2U Athlon 64 2GHz machine, 3GB RAM with about 800GB of disks. INN spool NFS mounted.
plane
New search server. 1U Dual dual-core Opteron 2.4GHz machine, 16GB RAM with 3x1.5TB disks for index storage. Has its own spool mirror.

The Date Spool

On to the nitty-gritty:

Format of the Date Spool

One file per minute, e.g.: /ispool/datespool/2009/09/21/13/10

Contents:

macro@sourceware.org
1253538608
1426
From: macro@sourceware.org
Subject: src/gas/testsuite ChangeLog gas/mips/eret-1.d  ...
Xref: news.gmane.org gmane.comp.gnu.binutils.cvs:14452

CVSROOT:        /cvs/src
Module name:    src
[...]

Creating the Date Spool

  • Parse all messages in the INN spool (use libgmime)
  • Create a file with: <time_t> <path>
  • Sort it
  • Create date spool in that order
  • Build Xapian index from datespool

Updating the Date Spool

  • New messages are added to the date spool
  • Also appended to an incremental file
  • Build incremental Xapian index and merge

Xapian features used

Future Plans

  • Finish commissioning the new server
  • Search API (e.g. RSS feeds of results)
  • Easier group search (cpan not gmane.comp.lang.perl.cpan.*)
  • More frequent updates

The End

 Questions welcome

 

Image credits: