python - Can't parse a second table with beautifulsoup even if the first one works? -
i trying parse tables using beautifulsoup. first 1 on page easy cannot parse similar table on same page. not understand why.
here code. in advance help.
import urllib2 bs4 import beautifulsoup url = urllib2.urlopen("https://dl.dropboxusercontent.com/u/956261/poftext.html") contenthtml = url.read() soup = beautifulsoup(contenthtml) tableuserdetails = soup.find("table", {"class" : "user-details"}) = 0 tableuserdetailslist = [] row in tableuserdetails.findall('tr'): col in row.findall('td'): contenttd = col.contents[0].string.strip() if contenttd: print "td number %d : %s" % (i, contenttd) tableuserdetailslist.append(contenttd) += 1 # first table ok print tableuserdetailslist # 1 tableuserdetails = soup.find("table", {"class" : "secondpart"}) = 0 tableuserdetailslist = [] row in tableuserdetails.findall('tr'): col in row.findall('td'): contenttd = col.contents[0].string.strip() if contenttd: print "td number %d : %s" % (i, contenttd) tableuserdetailslist.append(contenttd) += 1 print tableuserdetailslist # list empty :(
here simplified version of html code trying parse:
<!doctype html public "-//w3c//dtd xhtml 1.0 transitional//en" "http://www.w3.org/tr/xhtml1/dtd/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> french.kiss sorties, sport, voyages, nouvelles expériences</title> </head> <body style='background-color: #fff;' leftmargin='0' topmargin='0' marginwidth='0' marginheight='0' link='#1e55d6' vlink='#1e55d6' text='#6551b0'> <table class="user-details"> <tr> <td class="headline txtblue size15" style="width:80px"> </td> <td style="width:10px"> </td> <td class="txtgrey size15"> fume occasionnellement silhouette mince </td> <td width="25px;"> </td> <td class="headline txtblue size15"> city </td> <td class="txtgrey size15"> paris ile-de-france </td> </tr> <tr> <td class="headline txtblue size15"> details </td> <td style="width:10px"> </td> <td class="txtgrey size15"> 26 year old un homme, 185cm, sans religion </td> <td> </td> <td class="headline txtblue size15"> ethnicity </td> <td class="txtgrey size15"> caucasienne balance châtains </td> </tr> <tr> <td class="headline txtblue size15"> intent </td> <td style="width:10px"> </td> <td class="txtgrey size15"> french.kiss cherche une relation amoureuse. </td> <td> </td> <td class="headline txtblue size15" style="width:90px"> education </td> <td class="txtgrey size15"> diplôme universitaire/licence </td> </tr> <tr> <td class="headline txtblue size15"> personnalité </td> <td style="width:10px"> </td> <td class="txtgrey size15"> </td> <td> </td> <td> <span class="headline txtblue size15">profession </span> </td> <td> <span class="txtgrey size15"> visioconférence</span> </td> </tr> </table> <table width="85%" class="secondpart"> <tr height="25px"> <td width="200px"> <span class="headline txtblue size14">i seeking a</span> </td> <td width="300px"> <span class="txtgrey size14"> une femme</span> </td> <td width="25px"> </td> <td width="200px"> <span class="headline txtblue size14">for</span> </td> <td width="200px"> <span class="txtgrey size14"> sorties</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14"><a href='needs_test.aspx'>needs test</a></span> </td> <td> <span class="txtgrey size14"><a href='needs_test.aspx'> <a href="needs_view.aspx?id=38028200">view relationship needs</a></a></span> </td> <td> </td> <td> <span class="headline txtblue size14"><a href='poftest.aspx'>chemistry</a></span> </td> <td> <span class="txtgrey size14"><a href='poftest.aspx'> <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">view chemistry results</a></a></span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">do drink?</span> </td> <td> <span class="txtgrey size14"> occasionnellement</span> </td> <td> </td> <td> <span class="headline txtblue size14">do want children?</span> </td> <td> <span class="txtgrey size14"> non divulgué</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">marital status</span> </td> <td> <span class="txtgrey size14"> célibataire</span> </td> <td> </td> <td> <span class="headline txtblue size14">do drugs?</span> </td> <td> <span class="txtgrey size14"> non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">pets </span> </td> <td> <span class="txtgrey size14"> aucun</span> </td> <td> </td> <td> <span class="headline txtblue size14">eye color</span> </td> <td> <span class="txtgrey size14"> bruns</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">do have car? </span> </td> <td> <span class="txtgrey size14"> n/a</span> </td> <td> </td> <td> <span class="headline txtblue size14">do have children?</span> </td> <td> <span class="txtgrey size14"> non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">longest relationship</span> </td> <td> <span class="txtgrey size14"> plus de 2 ans</span> </td> <td> </td> <td> </td> <td> </td> </tr> </table> </body> </html>
tableuserdetails.content, tableuserdetails , tableuserdetailslist both tables:
* first table *
print tableuserdetails.content = none
print tableuserdetails =
<table class="user-details"> <tr> <td class="headline txtblue size15" style="width:80px"> </td> <td style="width:10px"> </td> <td class="txtgrey size15"> fume occasionnellement silhouette mince </td> <td width="25px;"> </td> <td class="headline txtblue size15"> city </td> <td class="txtgrey size15"> paris ile-de-france </td> </tr> <tr> <td class="headline txtblue size15"> details </td> <td style="width:10px"> </td> <td class="txtgrey size15"> 26 year old un homme, 185cm, sans religion </td> <td> </td> <td class="headline txtblue size15"> ethnicity </td> <td class="txtgrey size15"> caucasienne balance châtains </td> </tr> <tr> <td class="headline txtblue size15"> intent </td> <td style="width:10px"> </td> <td class="txtgrey size15"> french.kiss cherche une relation amoureuse. </td> <td> </td> <td class="headline txtblue size15" style="width:90px"> education </td> <td class="txtgrey size15"> diplôme universitaire/licence </td> </tr> <tr> <td class="headline txtblue size15"> personnalité </td> <td style="width:10px"> </td> <td class="txtgrey size15"> </td> <td> </td> <td> <span class="headline txtblue size15">profession </span> </td> <td> <span class="txtgrey size15"> visioconférence</span> </td> </tr> </table>
print tableuserdetailslist = [u'about', u'fume occasionnellement silhouette mince', u'city', u'paris ile-de-france', u'details', u'26 year old un homme, 185cm, sans religion', u'ethnic ity', u'caucasienne balance ch\xe2tains', u'intent', u'french.kiss cherche une relation amoureuse.', u'education', u'dipl\xf4me universitaire/licence', u'p ersonnalit\xe9']
* second table *
print tableuserdetails.content = none
print tableuserdetails =
<table width="85%" class="secondpart"> <tr height="25px"> <td width="200px"> <span class="headline txtblue size14">i seeking a</span> </td> <td width="300px"> <span class="txtgrey size14"> une femme</span> </td> <td width="25px"> </td> <td width="200px"> <span class="headline txtblue size14">for</span> </td> <td width="200px"> <span class="txtgrey size14"> sorties</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14"><a href='needs_test.aspx'>needs test</a></span> </td> <td> <span class="txtgrey size14"><a href='needs_test.aspx'> <a href="needs_view.aspx?id=38028200">view relationship needs</a></a></span> </td> <td> </td> <td> <span class="headline txtblue size14"><a href='poftest.aspx'>chemistry</a></span> </td> <td> <span class="txtgrey size14"><a href='poftest.aspx'> <a href="personality.aspx?id=26&user_id=41724176" rel="nofollow">view chemistry results</a></a></span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">do drink?</span> </td> <td> <span class="txtgrey size14"> occasionnellement</span> </td> <td> </td> <td> <span class="headline txtblue size14">do want children?</span> </td> <td> <span class="txtgrey size14"> non divulgué</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">marital status</span> </td> <td> <span class="txtgrey size14"> célibataire</span> </td> <td> </td> <td> <span class="headline txtblue size14">do drugs?</span> </td> <td> <span class="txtgrey size14"> non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">pets </span> </td> <td> <span class="txtgrey size14"> aucun</span> </td> <td> </td> <td> <span class="headline txtblue size14">eye color</span> </td> <td> <span class="txtgrey size14"> bruns</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">do have car? </span> </td> <td> <span class="txtgrey size14"> n/a</span> </td> <td> </td> <td> <span class="headline txtblue size14">do have children?</span> </td> <td> <span class="txtgrey size14"> non</span> </td> </tr> <tr height="25px"> <td> <span class="headline txtblue size14">longest relationship</span> </td> <td> <span class="txtgrey size14"> plus de 2 ans</span> </td> <td> </td> <td> </td> <td> </td> </tr> </table>
print tableuserdetailslist = []
this works:
tableuserdetailslist = [] row in tableuserdetails.findall('tr'): col in row.findall('td'): contents = list(col.stripped_strings) if contents: contenttd = contents[0] print "td number %d : %s" % (i, contenttd) tableuserdetailslist.append(contenttd) += 1
the problem second table contains spans
. line break before span
interpreted content , returned in col.contents
list.
it works first table. anubhav commented, should consider iterating on tables , not having same code twice.
Comments
Post a Comment