mwxml 0.3.6


pip install mwxml

  Latest version

Released: Feb 13, 2025

Project Links

Meta
Author: Aaron Halfaker

Classifiers

Programming Language
  • Python
  • Python :: 3
  • Python :: 3 :: Only

Environment
  • Other Environment

Intended Audience
  • Developers

License
  • OSI Approved :: MIT License

Operating System
  • OS Independent

Topic
  • Software Development :: Libraries :: Python Modules
  • Text Processing :: Linguistic
  • Text Processing :: General
  • Utilities
  • Scientific/Engineering

# MediaWiki XML

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: complexity and performance of streaming XML parsing. This library enables memory efficent stream processing of XML dumps with a simple [iterator](https://pythonhosted.org/mwxml/iteration.html) strategy. This library also implements a distributed processing strategy (see [map()](https://pythonhosted.org/mwxml/map.html)) that enables parallel processing of many XML dump files at the same time.

## Example

>>> import mwxml
>>>
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki
>>>
>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

## Author * Aaron Halfaker – https://github.com/halfak

## See also * http://dumps.wikimedia.org/ * http://community.wikia.com/wiki/Help:Database_download

Wheel compatibility matrix

Platform Python 2 Python 3
any

Files in release

Extras: None
Dependencies:
jsonschema (>=2.5.1)
mwcli (>=0.0.2)
mwtypes (>=0.4.0)
para (>=0.0.1)