Mirroring Wikipedia

From MattWiki

This page will talk about mirroring an install of Wikipedia on your own hardware. This How-To is built with MediaWiki version 1.17.0 in mind.

Installing Required packets

On Fedora you will need to install the following:

yum install httpd mysql-server mysql php php-pdo perl-DBD-MySQL php-xml

Configuring mySQL

Add or modify the /etc/my.cf file to add the following entry

max_allowed_packet=128M

Building the MediaWiki Install

To start with, we need an installed version of MediaWiki, you can find the current version at http://www.mediawiki.org/wiki/Download

cd /var/www/
wget http://download.wikimedia.org/mediawiki/1.17/mediawiki-1.17.0.tar.gz
tar -xzf mediawiki-1.17.0.tar.gz -C wikipedia
chown -R apache:apache /var/www/wikipedia

Needed Plugins for Wikipedia

Via SVN

svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/CategoryTree/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/CharInsert
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/Cite/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/ExpandTemplates
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/SyntaxHighlight_GeSHi/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/Poem/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/OpenSearchXml/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/WikiEditor/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/wikihiero/
svn checkout http://svn.wikimedia.org/svnroot/mediawiki/branches/REL1_17/extensions/Vector/

A list of needed extensions can be found on MediaWiki site here: http://www.mediawiki.org/wiki/Category:Extensions_used_on_Wikimedia

  • Cite - Adds two parser hooks to MediaWiki, <ref> and <references />; these operate together to add citations to pages.

Downloading the data

Download & Import Script:

#!/bin/bash

DOWNLOADDIR=/var/www/wikipedia-downloads
MEDIAWIKIDIR=/var/www/wikipedia
SQLNAME='wikipedia_db'

########################################
DATE=$1
DIR=$DOWNLOADDIR/$DATE

mkdir -p $DIR
cd $DIR

echo "Date: $DATE"
echo "##############################################"
wget -c http://dumps.wikimedia.org/enwiki/$DATE/enwiki-$DATE-md5sums.txt
echo "Finished downloading MD5 file"
echo ""
for a in `egrep "enwiki-........-pages-articles.\.xml\.bz2|enwiki-........-pages-articles..\.xml\.bz2" enwiki-$DATE-md5sums.txt |awk '{print $2}'`
do
  echo ""
  echo "##############################################"
  echo "Working on: $a"
  echo ""
  wget -c http://dumps.wikimedia.org/enwiki/$DATE/$a
  php $MEDIAWIKIDIR/maintenance/initStats.php --update
  php $MEDIAWIKIDIR/maintenance/importDump.php $DIR/$a
  /sbin/service mysqld restart
done

echo "UPDATE site_stats SET ss_total_views = 0 WHERE ss_row_id = 1; UPDATE page SET page_counter = 0;" |mysql $SQLNAME
php $MEDIAWIKIDIR/maintenance/initStats.php --update
php $MEDIAWIKIDIR/maintenance/rebuildrecentchanges.php

echo "##############################################"
echo "Done!!!"

Importing the data into MeidaWiki

Scripts

Script to download the needed files:

for a in 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 10 9 8 7 6 5 4 3 2 1
do
  cd /var/www/wikipedia-downloads/20110620/
  wget -c http://download.wikimedia.org/enwiki/20110620/enwiki-20110620-pages-articles$a.xml.bz2
  bzip2 -d /var/www/wikipedia-downloads/20110620/enwiki-20110620-pages-articles$a.xml.bz2
done

Script for importing the data (this can take days):

for a in 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 10 9 8 7 6 5 4 3 2 1
do
  echo "-----------  Working on enwiki-20110620-pages-articles$a.xml  -----------"
  php /var/www/wikipedia.mattrude.com/maintenance/importDump.php /var/www/wikipedia-downloads/20110620/enwiki-20110620-pages-articles$a.xml
  php /var/www/wikipedia.mattrude.com/maintenance/initStats.php --update
done
php maintenance/rebuildrecentchanges.php