Escaping from TWiki

TWiki (1998 - ?)

A Wiki, written in perl. Modifiable via plugins. Used by FLOSS Manuals to write books. FLOSS Manuals want to abandon TWiki but to retain the full history. Losing developers through bitter forking. This talk is about extracting history from TWiki. It has been billed as "Confessions of a detwikifier", so: Confession: I can't remember why I thought this was an interesting topic

TWiki storage format

Independant wikis ("webs"), each a flat directory full of text files. Metadata is stored in-file. Previous revisions are stored with RCS. Pages are stored as written, in either html or Twiki markup. In my case, each "web" is a book and each file is a chapter, usually in plain HTML. Confession: I am writing this at 4:30 on the day of the talk.

RCS

* Revision Control System * Started 1982, Last release 1995. * per-file revision control * CVS was built on top for project-wide versioning
foo.txt
foo.txt,v  <-- the RCS file.
RCS is perfectly adequate for projects consisting of a single file. Confession: these confessions have quote marks because I am using the <q> tag to save typing.

TWiki RCS Example

head 1.9; access; symbols; locks; strict; comment @# @; 1.9 date 2008.10.31.09.54.02; author AdamHyde; state Exp; branches; next 1.8; 1.8 date 2008.10.05.03.28.51; author TomKleen; state Exp; branches; next 1.7; SNIP 1.1 date 2007.05.14.17.10.07; author AdamHyde; state Exp; branches; next ; desc @none @ 1.9 log @none @ text @%META:TOPICINFO{author="AdamHyde" date="1225446842" format="1.1" version="1.9"}% <h1>Envelope Tool </h1> <p>The envelope tool is probably the most important tool for Audacity users. It allows you to alter the volume of the sounds in Audacity which is especially important when you are combining ('mixing') several tracks together. </p> <i>SNIP</i> %META:PREFERENCE{name="xchange.status" value="complete"}% %META:PREFERENCE{name="xchange.uid" value="96d1c1752589e4294ed28dd852d01d3e"}% @ 1.8 log @none @ text @d1 1 a1 1 %META:TOPICINFO{author="TomKleen" date="1223177331" format="1.1" version="1.8"}% d55 1 a55 1 %META:PREFERENCE{name="xchange.status" value="published"}% @ 1.7 log @none @ text @d1 53 a53 79 %META:TOPICINFO{author="AdamHyde" date="1209860466" format="1.1" version="1.7"}% <h1> Envelope Tool </h1> SNIP

RCS delta format

* <'a' or 'd'><line number> <line count> * <lines to add if adding> * repeat
d1 1                           #delete 1st line
a1 2                           #add following two lines at beginning
 data to be added
 here too
d24 3                          #delete lines 24-27

There are dozens of robust RCS extraction tools

Or so you would think. Except: * nobody would use bare RCS for a multifile repository, and * everyone migrated to something sensible in 1997. But you can trick tools into think a directory full of RCS files is a CVS repository by adding one or two magic CVS files. Confession: it was easy, but I have no idea what I did.

Using cvs2svn | git svn

The "history" is not chronological.
SomeChapter            v1.2        2008-06-11 
SomeChapter            v1.1        2008-06-03
AnotherChapter         v1.3        2009-05-22
AnotherChapter         v1.2        2009-01-06 
AnotherChapter         v1.1        2007-07-30 
YetAnotherChapter      v1.27       2008-06-10 
YetAnotherChapter      v1.26       2007-11-12 
YetAnotherChapter      v1.25       2006-10-01 
So you need to extract all the versions and sort them by date before putting them into git. Confession: I deleted all the broken repositories, so I can't do gitk screenshots

Extracting TWiki data with RCS

for (1..$rev){ my $version = `co -p -r 1.$_ -q -ko"`; #whatever } but that takes forever. So you write your own RCS parser that takes advantage of the rcs format and extracts the versions is reverse order, applying cumulative patches, and it is 20 times quicker.

Dealing with TWiki metadata

Each version of each file starts with a line like this:
%META:TOPICINFO{author="AdamHyde" date="1179162607" format="1.1" reprev="1.1" version="1.1"}%
Which contains nothing useful that isn't in the RCS log, so could be discarded. Sometimes TWiki creates a new revision that changes nothing but the metadata. A pathological case is Sugar/_index.txt,v which contains 15632 revisons, of which 27 are content changes.
   next if ($text == $oldtext);
   ...
   $oldtext = $text;

Problem 2.5 TWiki magic tags

Unfortunately TWiki also has magic tags that actually mean something. About 80 in use, once the spambot contributions are removed. Some are simple:
%BR%              <br>
%RED%             <font color="#ff0000">
%ENDCOLOR%        </font>
But also...
%IF{...}%        
%INCLUDE{...}%   include another page (with all kinds of modifier options).
Confession: yes, I actually did bumble along this path

Extracting the history as rendered

Either 1. Stick to the history as rendered for a representative user. 2. Go insane and support TWiki constructs. If only there was a module for rendering TWiki pages. Confession: I pondered this for a while.

Using TWiki.pm and friends

#!/usr/bin/perl =pod Initially based on TWiki::UI::View for which the following copyright notice applies: # Copyright (C) 1999-2007 Peter Thoeny, peter@thoeny.org # and TWiki Contributors. All Rights Reserved. TWiki Contributors # are listed in the AUTHORS file in the root of this distribution. # NOTE: Please extend that file, not this notice. # # Additional copyrights apply to some or all of the code in this # file as follows: # Based on parts of Ward Cunninghams original Wiki and JosWiki. # Copyright (C) 1998 Markus Peter - SPiN GmbH (warpi@spin.de) # Some changes by Dave Harris (drh@bhresearch.co.uk) incorporated # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. For # more details read LICENSE in the root of this distribution. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. # # As per the GPL, removal of this notice is prohibited. (That last claim seems dubious, but there you go). =cut use strict; use warnings; use integer; use POSIX qw(strftime); use Data::Dumper; use constant FIND_EMAILS => 1; BEGIN { # Set library paths in @INC, at compile time unshift @INC, '/home/douglas/fm-data/floss/bin/'; require 'setlib.cfg'; } require TWiki; my $DEFAULT_DOMAIN = 'flossmanuals.net'; #my $TWIKI_PATH = '/home/douglas/fm-data/twiki-data'; my $TWIKI_PATH = '/home/douglas/fm-data/import-tests/twiki-books'; my $STAGING_DIR = '/home/douglas/fm-data/import-tests/staging'; my @BAD_CHAPTERS = qw{WebAtom WebPreferences WebChanges WebRss WebCreateNewTopic WebSearchAdvanced WebHome WebSearch WebIndex WebStatistics WebLeftBar WebTopicList WebNotify WebTopicCreator WebTopicEditTemplate }; my %BAD_BOOKS = map {($_ => 1)} qw{Main TWiki PR Trash}; =pod Get the text and metadata of a twiki page revision. An attempt is made to put the committers correct email in the metadata. render_version($webName, $topicName, $revision, $session, $raw) ==> ($text, $meta) * =$webName= the book. * =$topicName= the chapter. * =$session= a TWiki instance. If undef, it will be created using the admin login. * =$revision= the required version (undef for the latest). * =$raw= is a flag to prevent the expansion of TWiki variables. Return values: * =$text= the text of the chapter, with twiki tags filled out * =$meta= a TWiki::Meta object =cut sub render_version { my ($webName, $topicName, $revision, $session, $raw) = @_; $session ||= new TWiki('admin'); my ($meta, $text) = $session->{store}->readTopic($session, $webName, $topicName, $revision); # Try to set the author's email to something sensible, if possible my $info = $meta->get('TOPICINFO'); my $author = $info->{'author'}; if (! defined $author){ $author = 'Anonymous'; print STDERR "Anonymous author on $webName.$topicName v$revision\n"; } if (FIND_EMAILS){ my $user = $session->{users}->findUserByWikiName($author); $info->{email} = $session->{users}->getEmails($user); } if (! $info->{email}){ $info->{email} = "$author\@$DEFAULT_DOMAIN"; } $info->{book} = $webName; if(! $raw) { $session->enterContext('body_text'); $text = $session->handleCommonTags($text, $webName, $topicName, $meta); $text = $session->renderer->getRenderedVersion($text, $webName, $topicName); $session->leaveContext('body_text'); $text =~ s/( ?) *<\/?(nop|noautolink)\/?>\n?/$1/gois; } return $text, $meta; } =pod extract_all_versions($webName, $topicName, $session) Find all revisions of the chapter $topicName in book $webName and put them in an array indexed by revision numbers. Return the arrays as a reference. Version 0 is undefined. =cut sub extract_all_versions { my ($webName, $topicName, $session) = @_; if (! $session){ $session = new TWiki ('admin'); } my $head = $session->{store}->getRevisionNumber($webName, $topicName); #my @versions = ['', new TWiki::Meta($session, $webName, $topicName)]; my @versions; my $oldtext = 'Something to avoid annoying warnings'; foreach (1 .. $head){ my ($text, $meta) = render_version($webName, $topicName, $_, $session); if ($text ne $oldtext){ $versions[$_] = [$text, $meta]; } $oldtext = $text; } return \@versions; } =pod save_versions($webName, $topicName, $dest) Save versions of chapter $topicName of book $webName in files in the directory $dest. Their names are structured thus: $dest/$topicName.$revision_number.txt $dest ought to exist. Files will be overwritten. =cut sub save_versions { my ($webName, $topicName, $dest) = @_; if (! -d $dest || ! -w $dest){ die "'$dest' is not a writeable directory; it should be.\n"; } my $versions = extract_all_versions($webName, $topicName); foreach my $r (1 .. $#$versions){ my $v = $versions->[$r]; next unless defined $v; my ($text, $meta) = @$v; open FILE, '>', "$dest/$webName-$topicName.$r.txt"; print FILE $meta->stringify . "\n$text\n"; close FILE; } } =pod get_chapters Find all the real chapters in the repository. TWiki will have made a number of unused pages by default: these are filtered out. =cut sub get_chapters { my $book = shift || die "chapters of which book?"; my $session = shift || new TWiki ('admin'); my @chapters = grep { my $chapter = $_; $chapter =~ s/Talk$//; ! grep {$chapter eq $_} @BAD_CHAPTERS; } $session->{store}->getTopicNames($book); return @chapters; } =pod printer Summary of a chapter revision to STDOUT. =cut sub printer { my ($text, $meta, $date, $title) = @_; my $info = $meta->get('TOPICINFO'); my $s = sprintf ("%11s %20.20s %-5.5s %15.15s %.30s", $date, $title, $info->{version}, $info->{author}, $text); $s =~ s/\n/ /g; print "$s\n"; }; sub staging_header { my $meta = shift; my $header = "chapter: $meta->{_topic}\n"; my $info = $meta->get('TOPICINFO'); foreach ('book', 'version', 'date', 'author', 'email'){ if ($info->{$_}){ $header .= "$_: $info->{$_}\n"; } } $header .= "book2: $meta->{_web}\n"; foreach my $type (qw {FILEATTACHMENT PREFERENCE TOPICPARENT FIELD TOPICMOVED}){ next unless defined $meta->{$type}; foreach my $item (@{$meta->{$type}}){ $header .= "$type:"; foreach my $k (sort keys %$item ){ $header .= " '$k'='$item->{$k}'"; } $header .= "\n"; } } return $header; } =pod Save a chapter in the staging directory. =cut sub stage_commit { my ($text, $meta, $date, $title) = @_; my $info = $meta->get('TOPICINFO'); my $filename = sprintf ("%010d.%s.%s.%s", $date, $info->{book}, $title, $info->{version}); my $dir = $STAGING_DIR . '/' . strftime('%Y-%m', localtime($date)); mkdir($dir) unless (-d $dir); my $fh; open ($fh, '>', "$dir/$filename"); print $fh staging_header($meta); print $fh "\n------8<-----------------\n"; print $fh $text; close $fh; } =pod process_book Given a book and a function reference, call the function on every revision of every chapter of the book. The function should take the arguments ($text, $meta, $date, $title). =cut sub process_book { my $book = shift; my $function = shift || \&printer; my $session = shift || new TWiki ('admin'); my @chapters = get_chapters($book, $session); #print STDERR "@chapters\n"; my %commits; for my $chapter (@chapters){ #print STDERR "'$book' '$chapter'\n"; my $versions = extract_all_versions($book, $chapter, $session); for my $v (@$versions){ next unless defined $v; eval { my $date = $v->[1]->get('TOPICINFO')->{'date'}; push @$v, $chapter; my @c; if (defined $commits{$date}){ @c = @{$commits{$date}}; } push @c, $v; $commits{$date} = \@c; }; if ($@){ warn $@; print STDERR $v; #Dumper($v); } } } my @dates = sort {$a <=> $b} keys %commits; for my $date (@dates){ for my $v (@{$commits{$date}}){ my ($text, $meta, $title) = @$v; &$function($text, $meta, $date, $title); } } } sub process_repository { my $function = shift || \&printer; my $filter = shift || '.'; my $session = shift || new TWiki ('admin'); for my $book ($session->{store}->getListOfWebs){ next if ($BAD_BOOKS{$book} || $book !~ /$filter/); print STDERR "'$book'\n"; process_book($book, $function, $session); } } process_repository(\&stage_commit); #process_book($ARGV[0], \&stage_commit); #process_book($ARGV[0], \&printer); #save_versions @ARGV; #save_versions "Inkscape", "Introduction", $ARGV[0];