Escaping from TWiki
TWiki (1998 - ?)
A Wiki, written in perl.
Modifiable via plugins.
Used by FLOSS Manuals to write books.
FLOSS Manuals want to abandon TWiki but to retain the full history.
Losing developers through bitter forking.
This talk is about extracting history from TWiki.
It has been billed as "Confessions of a detwikifier", so:
Confession: I can't remember why I thought this was an interesting topic
TWiki storage format
Independant wikis ("webs"), each a flat directory full of text files.
Metadata is stored in-file.
Previous revisions are stored with RCS.
Pages are stored as written, in either html or Twiki markup.
In my case, each "web" is a book and each file is a chapter, usually in plain HTML.
Confession: I am writing this at 4:30 on the day of the talk.
RCS
* Revision Control System
* Started 1982, Last release 1995.
* per-file revision control
* CVS was built on top for project-wide versioning
foo.txt
foo.txt,v <-- the RCS file.
RCS is perfectly adequate for projects consisting of a single file.
Confession: these confessions have quote marks because I am using the <q> tag to save typing.
TWiki RCS Example
head 1.9;
access;
symbols;
locks; strict;
comment @# @;
1.9
date 2008.10.31.09.54.02; author AdamHyde; state Exp;
branches;
next 1.8;
1.8
date 2008.10.05.03.28.51; author TomKleen; state Exp;
branches;
next 1.7;
SNIP
1.1
date 2007.05.14.17.10.07; author AdamHyde; state Exp;
branches;
next ;
desc
@none
@
1.9
log
@none
@
text
@%META:TOPICINFO{author="AdamHyde" date="1225446842" format="1.1" version="1.9"}%
<h1>Envelope Tool
</h1>
<p>The envelope tool is probably the most important tool for Audacity users. It allows you to alter the volume of the sounds in Audacity which is especially important when you are combining ('mixing') several tracks together.
</p>
<i>SNIP</i>
%META:PREFERENCE{name="xchange.status" value="complete"}%
%META:PREFERENCE{name="xchange.uid" value="96d1c1752589e4294ed28dd852d01d3e"}%
@
1.8
log
@none
@
text
@d1 1
a1 1
%META:TOPICINFO{author="TomKleen" date="1223177331" format="1.1" version="1.8"}%
d55 1
a55 1
%META:PREFERENCE{name="xchange.status" value="published"}%
@
1.7
log
@none
@
text
@d1 53
a53 79
%META:TOPICINFO{author="AdamHyde" date="1209860466" format="1.1" version="1.7"}%
<h1>
Envelope Tool
</h1>
SNIP
RCS delta format
* <'a' or 'd'><line number> <line count>
* <lines to add if adding>
*
repeat
d1 1 #delete 1st line
a1 2 #add following two lines at beginning
data to be added
here too
d24 3 #delete lines 24-27
There are dozens of robust RCS extraction tools
Or so you would think. Except:
* nobody would use bare RCS for a multifile repository, and
* everyone migrated to something sensible in 1997.
But you can trick tools into think a directory full of RCS files is a
CVS repository by adding one or two magic CVS files.
Confession: it was easy, but I have no idea what I did.
Using cvs2svn | git svn
The "history" is not chronological.
SomeChapter v1.2 2008-06-11
SomeChapter v1.1 2008-06-03
AnotherChapter v1.3 2009-05-22
AnotherChapter v1.2 2009-01-06
AnotherChapter v1.1 2007-07-30
YetAnotherChapter v1.27 2008-06-10
YetAnotherChapter v1.26 2007-11-12
YetAnotherChapter v1.25 2006-10-01
So you need to extract all the versions and sort them by date before
putting them into git.
Confession: I deleted all the broken repositories, so I can't do gitk screenshots
Extracting TWiki data with RCS
for (1..$rev){
my $version = `co -p -r 1.$_ -q -ko"`;
#whatever
}
but that takes forever. So you write your own RCS parser that takes
advantage of the rcs format and extracts the versions is reverse
order, applying cumulative patches, and it is 20 times quicker.
Dealing with TWiki metadata
Each version of each file starts with a line like this:
%META:TOPICINFO{author="AdamHyde" date="1179162607" format="1.1" reprev="1.1" version="1.1"}%
Which contains nothing useful that isn't in the RCS log, so could be discarded.
Sometimes TWiki creates a new revision that changes nothing but the
metadata. A pathological case is
Sugar/_index.txt,v
which contains 15632 revisons, of which 27 are content changes.
next if ($text == $oldtext);
...
$oldtext = $text;
Problem 2.5 TWiki magic tags
Unfortunately TWiki also has magic tags that actually mean something.
About 80 in use, once the spambot contributions are removed.
Some are simple:
%BR% <br>
%RED% <font color="#ff0000">
%ENDCOLOR% </font>
But also...
%IF{...}%
%INCLUDE{...}% include another page (with all kinds of modifier options).
Confession: yes, I actually did bumble along this path
Extracting the history as rendered
Either
1. Stick to the history as rendered for a representative user.
2. Go insane and support TWiki constructs.
If only there was a module for rendering TWiki pages.
Confession: I pondered this for a while.
Using TWiki.pm and friends
#!/usr/bin/perl
=pod
Initially based on TWiki::UI::View for which the following copyright
notice applies:
# Copyright (C) 1999-2007 Peter Thoeny, peter@thoeny.org
# and TWiki Contributors. All Rights Reserved. TWiki Contributors
# are listed in the AUTHORS file in the root of this distribution.
# NOTE: Please extend that file, not this notice.
#
# Additional copyrights apply to some or all of the code in this
# file as follows:
# Based on parts of Ward Cunninghams original Wiki and JosWiki.
# Copyright (C) 1998 Markus Peter - SPiN GmbH (warpi@spin.de)
# Some changes by Dave Harris (drh@bhresearch.co.uk) incorporated
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version. For
# more details read LICENSE in the root of this distribution.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# As per the GPL, removal of this notice is prohibited.
(That last claim seems dubious, but there you go).
=cut
use strict;
use warnings;
use integer;
use POSIX qw(strftime);
use Data::Dumper;
use constant FIND_EMAILS => 1;
BEGIN {
# Set library paths in @INC, at compile time
unshift @INC, '/home/douglas/fm-data/floss/bin/';
require 'setlib.cfg';
}
require TWiki;
my $DEFAULT_DOMAIN = 'flossmanuals.net';
#my $TWIKI_PATH = '/home/douglas/fm-data/twiki-data';
my $TWIKI_PATH = '/home/douglas/fm-data/import-tests/twiki-books';
my $STAGING_DIR = '/home/douglas/fm-data/import-tests/staging';
my @BAD_CHAPTERS = qw{WebAtom WebPreferences WebChanges WebRss
WebCreateNewTopic WebSearchAdvanced WebHome WebSearch
WebIndex WebStatistics WebLeftBar WebTopicList WebNotify
WebTopicCreator WebTopicEditTemplate
};
my %BAD_BOOKS = map {($_ => 1)} qw{Main TWiki PR Trash};
=pod
Get the text and metadata of a twiki page revision. An attempt is
made to put the committers correct email in the metadata.
render_version($webName, $topicName, $revision, $session, $raw) ==> ($text, $meta)
* =$webName= the book.
* =$topicName= the chapter.
* =$session= a TWiki instance. If undef, it will be created using the admin login.
* =$revision= the required version (undef for the latest).
* =$raw= is a flag to prevent the expansion of TWiki variables.
Return values:
* =$text= the text of the chapter, with twiki tags filled out
* =$meta= a TWiki::Meta object
=cut
sub render_version {
my ($webName, $topicName, $revision, $session, $raw) = @_;
$session ||= new TWiki('admin');
my ($meta, $text) = $session->{store}->readTopic($session, $webName, $topicName, $revision);
# Try to set the author's email to something sensible, if possible
my $info = $meta->get('TOPICINFO');
my $author = $info->{'author'};
if (! defined $author){
$author = 'Anonymous';
print STDERR "Anonymous author on $webName.$topicName v$revision\n";
}
if (FIND_EMAILS){
my $user = $session->{users}->findUserByWikiName($author);
$info->{email} = $session->{users}->getEmails($user);
}
if (! $info->{email}){
$info->{email} = "$author\@$DEFAULT_DOMAIN";
}
$info->{book} = $webName;
if(! $raw) {
$session->enterContext('body_text');
$text = $session->handleCommonTags($text, $webName, $topicName, $meta);
$text = $session->renderer->getRenderedVersion($text, $webName, $topicName);
$session->leaveContext('body_text');
$text =~ s/( ?) *<\/?(nop|noautolink)\/?>\n?/$1/gois;
}
return $text, $meta;
}
=pod
extract_all_versions($webName, $topicName, $session)
Find all revisions of the chapter $topicName in book $webName and put
them in an array indexed by revision numbers. Return the arrays as a
reference.
Version 0 is undefined.
=cut
sub extract_all_versions {
my ($webName, $topicName, $session) = @_;
if (! $session){
$session = new TWiki ('admin');
}
my $head = $session->{store}->getRevisionNumber($webName, $topicName);
#my @versions = ['', new TWiki::Meta($session, $webName, $topicName)];
my @versions;
my $oldtext = 'Something to avoid annoying warnings';
foreach (1 .. $head){
my ($text, $meta) = render_version($webName, $topicName, $_, $session);
if ($text ne $oldtext){
$versions[$_] = [$text, $meta];
}
$oldtext = $text;
}
return \@versions;
}
=pod
save_versions($webName, $topicName, $dest)
Save versions of chapter $topicName of book $webName in files in the
directory $dest. Their names are structured thus:
$dest/$topicName.$revision_number.txt
$dest ought to exist. Files will be overwritten.
=cut
sub save_versions {
my ($webName, $topicName, $dest) = @_;
if (! -d $dest || ! -w $dest){
die "'$dest' is not a writeable directory; it should be.\n";
}
my $versions = extract_all_versions($webName, $topicName);
foreach my $r (1 .. $#$versions){
my $v = $versions->[$r];
next unless defined $v;
my ($text, $meta) = @$v;
open FILE, '>', "$dest/$webName-$topicName.$r.txt";
print FILE $meta->stringify . "\n$text\n";
close FILE;
}
}
=pod
get_chapters
Find all the real chapters in the repository. TWiki will have made a
number of unused pages by default: these are filtered out.
=cut
sub get_chapters {
my $book = shift || die "chapters of which book?";
my $session = shift || new TWiki ('admin');
my @chapters = grep {
my $chapter = $_;
$chapter =~ s/Talk$//;
! grep {$chapter eq $_} @BAD_CHAPTERS;
} $session->{store}->getTopicNames($book);
return @chapters;
}
=pod
printer
Summary of a chapter revision to STDOUT.
=cut
sub printer {
my ($text, $meta, $date, $title) = @_;
my $info = $meta->get('TOPICINFO');
my $s = sprintf ("%11s %20.20s %-5.5s %15.15s %.30s", $date, $title,
$info->{version}, $info->{author}, $text);
$s =~ s/\n/ /g;
print "$s\n";
};
sub staging_header {
my $meta = shift;
my $header = "chapter: $meta->{_topic}\n";
my $info = $meta->get('TOPICINFO');
foreach ('book', 'version', 'date', 'author', 'email'){
if ($info->{$_}){
$header .= "$_: $info->{$_}\n";
}
}
$header .= "book2: $meta->{_web}\n";
foreach my $type (qw {FILEATTACHMENT PREFERENCE TOPICPARENT FIELD TOPICMOVED}){
next unless defined $meta->{$type};
foreach my $item (@{$meta->{$type}}){
$header .= "$type:";
foreach my $k (sort keys %$item ){
$header .= " '$k'='$item->{$k}'";
}
$header .= "\n";
}
}
return $header;
}
=pod
Save a chapter in the staging directory.
=cut
sub stage_commit {
my ($text, $meta, $date, $title) = @_;
my $info = $meta->get('TOPICINFO');
my $filename = sprintf ("%010d.%s.%s.%s", $date, $info->{book}, $title,
$info->{version});
my $dir = $STAGING_DIR . '/' . strftime('%Y-%m', localtime($date));
mkdir($dir) unless (-d $dir);
my $fh;
open ($fh, '>', "$dir/$filename");
print $fh staging_header($meta);
print $fh "\n------8<-----------------\n";
print $fh $text;
close $fh;
}
=pod
process_book
Given a book and a function reference, call the function on every
revision of every chapter of the book.
The function should take the arguments ($text, $meta, $date, $title).
=cut
sub process_book {
my $book = shift;
my $function = shift || \&printer;
my $session = shift || new TWiki ('admin');
my @chapters = get_chapters($book, $session);
#print STDERR "@chapters\n";
my %commits;
for my $chapter (@chapters){
#print STDERR "'$book' '$chapter'\n";
my $versions = extract_all_versions($book, $chapter, $session);
for my $v (@$versions){
next unless defined $v;
eval {
my $date = $v->[1]->get('TOPICINFO')->{'date'};
push @$v, $chapter;
my @c;
if (defined $commits{$date}){
@c = @{$commits{$date}};
}
push @c, $v;
$commits{$date} = \@c;
};
if ($@){
warn $@;
print STDERR $v; #Dumper($v);
}
}
}
my @dates = sort {$a <=> $b} keys %commits;
for my $date (@dates){
for my $v (@{$commits{$date}}){
my ($text, $meta, $title) = @$v;
&$function($text, $meta, $date, $title);
}
}
}
sub process_repository {
my $function = shift || \&printer;
my $filter = shift || '.';
my $session = shift || new TWiki ('admin');
for my $book ($session->{store}->getListOfWebs){
next if ($BAD_BOOKS{$book} || $book !~ /$filter/);
print STDERR "'$book'\n";
process_book($book, $function, $session);
}
}
process_repository(\&stage_commit);
#process_book($ARGV[0], \&stage_commit);
#process_book($ARGV[0], \&printer);
#save_versions @ARGV;
#save_versions "Inkscape", "Introduction", $ARGV[0];