python - Parse large XML with lxml -
i trying script working. far doesn't managed output anything.
this test.xml
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it"> <page> <title>mediawiki:category</title> <ns>0</ns> <id>2</id> <revision> <id>11248</id> <timestamp>2003-12-31t13:47:54z</timestamp> <contributor> <username>frieda</username> <id>0</id> </contributor> <minor /> <text xml:space="preserve">categoria</text> <sha1>0acykl71lto9v65yve23lmjgia1h6sz</sha1> <model>wikitext</model> <format>text/x-wiki</format> </revision> </page> </mediawiki> and code
from lxml import etree def fast_iter(context, func): # fast_iter useful if need free memory while iterating through # large xml file. # # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ # author: liza daly event, elem in context: func(elem) elem.clear() while elem.getprevious() not none: del elem.getparent()[0] del context def process_element(elem): if elem.ns.text == '0': print elem.title.text context=etree.iterparse('test.xml', events=('end',), tag='page') fast_iter(context, process_element) i don't error, there's no output. want parse element if 0.
you parsing namespaced document, , there no 'page' tag present, because applies tags without namespace.
you instead looking '{http://www.mediawiki.org/xml/export-0.8/}page' element, contains '{http://www.mediawiki.org/xml/export-0.8/}ns' element.
many lxml methods let specify namespace map make matching easier, iterparse() method not 1 of them, unfortunately.
the following .iterparse() call processes right page tags:
context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page') but you'll need use .find() ns , title tags on page element, or use xpath() calls text directly:
def process_element(elem): if elem.xpath("./*[local-name()='ns']/text()=0"): print elem.xpath("./*[local-name()='title']/text()")[0] which, input example, prints:
>>> fast_iter(context, process_element) mediawiki:category
Comments
Post a Comment