Help:Export

From The Scuba Wiki

Jump to: navigation, search
MediaWiki Handbook: Contents, Readers, Editors, Moderators, System admins +/-

Wiki pages can be exported in a special XML format to upload import into another MediaWiki installation (if this function is enabled on the destination wiki, and the user is a sysop there) or use it elsewise for instance for analysing the content. See also m:Syndication feeds for exporting other information but pages and Help:Import on importing pages.

Contents

How to export

There are at least four ways to export pages:

By default only the current version of a page is included. Optionally you can get all versions with date, time, user name and edit summary. Optionally the latest version of all templates called directly or indirectly are also exported.

Additionally you can copy the SQL database. This is how dumps of the database were made available before MediaWiki 1.5 and it won't be explained here further.

Using 'Special:Export'

To export all pages of a namespace, for example.

1. Get the names of pages to export

  1. Go to Special:Allpages and choose the desired article/file.
  2. Copy the list of page names to a text editor
  3. Put all page names on separate lines
    1. You can achieve that relatively quickly if you paste the names into say MS Word - use paste special as unformatted text - then open the replace function (CTRL+h), entering ^t in Find what, entering ^p in Replace with and then hitting Replace All button. (This doesn't seem to work - there are no tabs between the page names.)
    2. Vim also allows for a quick way to fix line breaks: after pasting the whole list, run the command :1,$s/\t/\r/g to replace all tabs by carriage returns and then :1,$s/^\n//g to remove every line containing only a newline character.
    3. Another approach is to copy the formated text into any editor exposing the html. Remove all <tr> and </tr> tags and replace all <td> tags to <tr><td> and </td> tags to </td></tr>. the html will then be parsed into the needed format.
    4. If you have shell and mysql access to your server, you can use this script:
#
mysql -umike -pmikespassword -hlocalhost wikidbname 
select page_title from wiki_page where page_namespace=0
EOF

Note, replace mike and mikespassword with your own. Also, this example shows tables with the prefix wiki_

  1. Prefix the namespace to the page names (e.g. 'Help:Contents'), unless the selected namespace is the main namespace.
  2. Repeat the steps above for other namespaces (e.g. Category:, Template:, etc.)

A similar script for PostgreSQL databases looks like this:

$ psql -At -U wikiuser -h localhost wikidb -c "select page_title from mediawiki.page"

Note, replace wikiuser with your own, the database will prompt you for a password. This example shows tables without the prefix wiki_ and with the namespace specified as part of the table name.

Alternatively, a quick approach for those with access to a machine with Python installed:

  1. Go to Special:Allpages and choose the desired namespace.
  2. Save the entire webpage as index.php.htm
  3. Run export_all_helper.py in the same directory as the saved file.
  4. Save the page names output by the script.

2. Perform the export

  • Go to Special:Export and paste all your page names into the textbox, making sure there are no empty lines.
  • Click 'Submit query'
  • Save the resulting XML to a file using your browser's save facility.

and finally...

  • Open the XML file in a text editor. Scroll to the bottom to check for error messages.

Now you can use this XML file to perform an import.

Exporting the full history

A checkbox in the Special:Export interface selects whether to export the full history (all versions of an article) or the most recent version of articles. A maximum of 100 revisions are returned; other revisions can be requested as detailed in MW:Parameters to Special:Export.

Export format

The format of the XML file you receive is the same in all ways. It is codified in XML Schema at http://www.mediawiki.org/xml/export-0.3.xsd This format is not intended for viewing in a web browser. Some browsers show you pretty-printed XML with "+" and "-" links to view or hide selected parts. Alternatively the XML-source can be viewed using the "view source" feature of the browser, or after saving the XML file locally, with a program of choice. If you directly read the XML source it won't be difficult to find the actual wikitext. If you don't use a special XML editor "<" and ">" appear as &lt; and &gt;, to avoid a conflict with XML tags; to avoid ambiguity, "&" is coded as "&amp;".

In the current version the export format does not contain an XML replacement of wiki markup (see Wikipedia DTD for an older proposal). You only get the wikitext as you get when editing the article.

Example

  <mediawiki xml:lang="en">
    <page>
      <title>Page title</title>
      <restrictions>edit=sysop:move=sysop</restrictions>
      <revision>
        <timestamp>2001-01-15T13:15:00Z</timestamp>
        <contributor><username>Foobar</username></contributor>
        <comment>I have just one thing to say!</comment>
        <text>A bunch of [[text]] here.</text>
        <minor />
      </revision>
      <revision>
        <timestamp>2001-01-15T13:10:27Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>new!</comment>
        <text>An earlier [[revision]].</text>
      </revision>
    </page>
    
    <page>
      <title>Talk:Page title</title>
      <revision>
        <timestamp>2001-01-15T14:03:00Z</timestamp>
        <contributor><ip>10.0.0.2</ip></contributor>
        <comment>hey</comment>
        <text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
      </revision>
    </page>
  </mediawiki>

DTD

Here is an unofficial, short Document Type Definition version of the format. If you don't know what a DTD is just ignore it.

<!ELEMENT mediawiki (siteinfo,page*)>
<!-- version contains the version number of the format (currently 0.3) -->
<!ATTLIST mediawiki
  version  CDATA  #REQUIRED 
  xmlns CDATA #FIXED "http://www.mediawiki.org/xml/export-0.3/"
  xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation CDATA #FIXED
    "http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd"
  xml:lang  CDATA #IMPLIED
>
<!ELEMENT siteinfo (sitename,base,generator,case,namespaces)>
<!ELEMENT sitename (#PCDATA)>      <!-- name of the wiki -->
<!ELEMENT base (#PCDATA)>          <!-- url of the main page -->
<!ELEMENT generator (#PCDATA)>     <!-- MediaWiki version string -->
<!ELEMENT case (#PCDATA)>          <!-- how cases in page names are handled -->
   <!-- possible values: 'first-letter' | 'case-sensitive'
                         'case-insensitive' option is reserved for future -->
<!ELEMENT namespaces (namespace+)> <!-- list of namespaces and prefixes -->
  <!ELEMENT namespace (#PCDATA)>     <!-- contains namespace prefix -->
  <!ATTLIST namespace key CDATA #REQUIRED> <!-- internal namespace number -->
<!ELEMENT page (title,id?,restrictions?,(revision|upload)*)>
  <!ELEMENT title (#PCDATA)>         <!-- Title with namespace prefix -->
  <!ELEMENT id (#PCDATA)> 
  <!ELEMENT restrictions (#PCDATA)>  <!-- optional page restrictions -->
<!ELEMENT revision (id?,timestamp,contributor,minor?,comment?,text)>
  <!ELEMENT timestamp (#PCDATA)>     <!-- according to ISO8601 -->
  <!ELEMENT minor EMPTY>             <!-- minor flag -->
  <!ELEMENT comment (#PCDATA)> 
  <!ELEMENT text (#PCDATA)>          <!-- Wikisyntax -->
  <!ATTLIST text xml:space CDATA  #FIXED "preserve">
<!ELEMENT contributor ((username,id) | ip)>
  <!ELEMENT username (#PCDATA)>
  <!ELEMENT ip (#PCDATA)>
<!ELEMENT upload (timestamp,contributor,comment?,filename,src,size)>
  <!ELEMENT filename (#PCDATA)>
  <!ELEMENT src (#PCDATA)>
  <!ELEMENT size (#PCDATA)>

Processing XML export

There are undoubtedly many tools which can process the exported XML. If you process a large number of pages (for instance a whole dump) you probably won't be able to get the document in main memory so you will need a parser based on SAX or other event-driven methods.

You can also just use regular expressions to directly process parts of the XML code. This may be faster than other methods but not recommended because it's difficult to maintain.

Please list methods and tools for processing XML export here:

Details and practical advice

  • To determine the namespace of a page you have to match its title to the prefixed defined in

/mediawiki/siteinfo/namespaces/namespace

  • Possible restrictions are
    • sysop (protected pages)

See also


+/-

Links to other help pages

Help contents
Meta | Wikinews | Wikipedia | Wikiquote | Wiktionary | commons: | mw: | b: | s: | mw:Manual | google
Versions of this help page (for other languages see below)
Meta | Wikinews | Wikipediahttp://en.wikipedia.org/Help:Export | Wikiquote | Wiktionary
What links here on Meta or from Meta | Wikipedia | MediaWiki
Reading
Go | Search | Stop words | URL | Namespace | Page name | Section
Backlinks | Link | Piped link | Interwiki link | Redirect | Category | Image page
Logging in and preferences
Logging in | Preferences | User style
Editing
Advanced editing | Editing FAQ | Edit toolbar | Export | Import | Shortcuts
Tracking changes
Recent changes (enhanced) | Related changes | Watching pages | Diff
Page history | Edit summary | User contributions | Minor edit | Patrolled edit
Style & formatting
Wikitext examples | Reference card | HTML in wikitext | List | Table | Sorting | Colors
Special input and output
Inputbox | Special characters | Displaying a formula | Images (uploads) | EasyTimeline
Advanced functioning
Template | Advanced templates | Parser function | ParserFunctions | Parameter default
Variable | Magic word | System message | Substitution | Array | Calculation
Page management
Starting a new page | Renaming (moving) a page | Protecting pages | Deleting a page
Special pages
Talk page | Testing | Sandbox | CentralNotice

Template:-

Personal tools
support the site