python - Beautiful Soup Child Tags Left Over After Extract -

May 15, 2011

when extracting tags don't want list comprehension there still tags supposed removed still there.

import requests, pprint bs4 import beautifulsoup bs  blacklist = ['a', 'title', 'p', 'input', 'u', 'body', 'html',          'textarea', 'nobr', 'b', 'span', 'td', 'tr',           'br', 'table', 'form', 'img', 'head', 'meta',           'script', 'style', 'center',]  soup = bs(requests.get('http://www.google.com').text)  soup = [s.extract() s in soup() if s.name not in blacklist]  # when printing tag names, show tag div. # pprint.pprint( [s.name s in soup] )  # inside of divs tags don't want. pprint.pprint(soup)

output

[<div id="mngb"></div>,  <div id="gbar"><nobr><b class="gb1">search</b> <a class="gb1" href="http://www.google.com/imghp?hl=en&amp;tab=wi">images</a> <a class="gb1" href="http://maps.google.com/maps?hl=en&amp;tab=wl">maps</a> <a class="gb1" href="https://play.google.com/?hl=en&amp;tab=w8">play</a> <a class="gb1" href="http://www.youtube.com/?tab=w1">youtube</a> <a class="gb1" href="http://news.google.com/nwshp?hl=en&amp;tab=wn">news</a> <a class="gb1" href="https://mail.google.com/mail/?tab=wm">gmail</a> <a class="gb1" href="https://drive.google.com/?tab=wo">drive</a> <a class="gb1" href="http://www.google.com/intl/en/options/" style="text-decoration:none"><u>more</u> »</a></nobr></div>,  <div id="guser" width="100%"><nobr><span class="gbi" id="gbn"></span><span class="gbf" id="gbf"></span><span id="gbe"></span><a class="gb4" href="http://www.google.com/history/optout?hl=en">web history</a> | <a class="gb4" href="/preferences?hl=en">settings</a> | <a class="gb4" href="https://accounts.google.com/servicelogin?hl=en&amp;continue=http://www.google.com/" id="gb_70" target="_top">sign in</a></nobr></div>,  <div class="gbh" style="left:0"></div>,  <div class="gbh" style="right:0"></div>,  <div id="lga"><img alt="google" height="95" id="hplogo" onload="window.lol&amp;&amp;lol()" src="/intl/en_all/images/srpr/logo1w.png" style="padding:28px 0 14px" width="275"/><br/><br/></div>,  <div class="ds" style="height:32px;margin:4px 0"><input autocomplete="off" class="lst" maxlength="2048" name="q" size="57" style="color:#000;margin:0;padding:5px 8px 0 6px;vertical-align:top" title="google search" value=""/></div>,  <div id="gac_scont"></div>,  <div style="font-size:83%;min-height:3.5em"><br/></div>,  <div style="font-size:10pt"></div>,  <div id="fll" style="margin:19px auto;text-align:center"><a href="/intl/en/ads/">advertising programs</a><a href="/services/">business solutions</a><a href="https://plus.google.com/116899029375914044550" rel="publisher">+google</a><a href="/intl/en/about.html">about google</a></div>,  <div id="xjsd"></div>,  <div id="xjsi"><script>if(google.y)google.y.first=[];(function(){function b(a){window.settimeout(function(){var c=document.createelement("script");c.src=a;document.getelementbyid("xjsd").appendchild(c)},0)}google.dljp=function(a){google.xjsi||(google.xjsu=a,b(a))};google.dlj=b;})(); if(!google.xjs){google.dstr=[];google.rein=[];window._=window._||{};window._._dumpexception=function(e){throw e};if(google.timers&amp;&amp;google.timers.load.t){google.timers.load.t.xjsls=new date().gettime();}google.dljp('/xjs/_/js/k\x3dpxufaya-26a.en_us./m\x3dsb_he,pcc/rt\x3dj/d\x3d1/sv\x3d1/rs\x3daitrstnufuvo3tysbamkh3iqobwpur6jea');google.xjs=1;}google.pmc={"sb":{"agen":true,"cgen":true,"client":"heirloom-hp","dh":true,"ds":"","eqch":true,"fl":true,"host":"google.com","jsonp":true,"msgs":{"lcky":"i\u0026#39;m feeling lucky","lml":"learn more","oskt":"input tools","psrc":"this search removed \u003ca href=\"/history\"\u003eweb history\u003c/a\u003e","psrl":"remove","sbit":"search image","srch":"google search"},"ovr":{"l":1,"ms":1},"pq":"","qcpw":false,"scd":10,"sce":5,"stok":"btuwxqimkjlvcutq1u6pc2hrvde"},"hp":{},"pcc":{}};google.y.first.push(function(){if(google.med){google.med('init');google.inithistory();google.med('history');}google.history&amp;&amp;google.history.initialize('/');google.hs&amp;&amp;google.hs.init&amp;&amp;google.hs.init()});if(google.j&amp;&amp;google.j.en&amp;&amp;google.j.xi){window.settimeout(google.j.xi,0);}</script></div>]

how remove tags don't want child of tags want? more specific need method used cases, code simple example.

try this:

blacklist = ['a', 'title', 'p', 'input', 'u', 'body', 'html','textarea', 'nobr', 'b', 'span', 'td', 'tr', 'br', 'table', 'form', 'img', 'head', 'meta', 'script', 'style', 'center'] soup = [tag tag in soup.findall(true) if tag.name not in blacklist]

Search This Blog

New Mian

python - Beautiful Soup Child Tags Left Over After Extract -

Comments

Post a Comment

Popular posts from this blog

android - java.net.UnknownHostException(Unable to resolve host “URL”: No address associated with hostname) -

jquery - How can I dynamically add a browser tab? -

keyboard - C++ GetAsyncKeyState alternative -