python - Parse large XML with lxml -


i trying script working. far doesn't managed output anything.

this test.xml

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it"> <page>     <title>mediawiki:category</title>     <ns>0</ns>     <id>2</id>     <revision>       <id>11248</id>       <timestamp>2003-12-31t13:47:54z</timestamp>       <contributor>         <username>frieda</username>         <id>0</id>       </contributor>       <minor />       <text xml:space="preserve">categoria</text>       <sha1>0acykl71lto9v65yve23lmjgia1h6sz</sha1>       <model>wikitext</model>       <format>text/x-wiki</format>     </revision>   </page> </mediawiki> 

and code

from lxml import etree  def fast_iter(context, func):     # fast_iter useful if need free memory while iterating through     # large xml file.     #     # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/     # author: liza daly     event, elem in context:         func(elem)         elem.clear()         while elem.getprevious() not none:             del elem.getparent()[0]     del context  def process_element(elem):     if elem.ns.text == '0':         print elem.title.text  context=etree.iterparse('test.xml', events=('end',), tag='page') fast_iter(context, process_element) 

i don't error, there's no output. want parse element if 0.

you parsing namespaced document, , there no 'page' tag present, because applies tags without namespace.

you instead looking '{http://www.mediawiki.org/xml/export-0.8/}page' element, contains '{http://www.mediawiki.org/xml/export-0.8/}ns' element.

many lxml methods let specify namespace map make matching easier, iterparse() method not 1 of them, unfortunately.

the following .iterparse() call processes right page tags:

context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page') 

but you'll need use .find() ns , title tags on page element, or use xpath() calls text directly:

def process_element(elem):     if elem.xpath("./*[local-name()='ns']/text()=0"):         print elem.xpath("./*[local-name()='title']/text()")[0] 

which, input example, prints:

>>> fast_iter(context, process_element) mediawiki:category 

Comments

Popular posts from this blog

Change php variable from jquery value using ajax (same page) -

Pull out data related to my apps from Android Play Store and iOS App Store -

How can I fetch data from a web server in an android application? -