Python Regex to findall lines contains specific type of filenames -
i have text file. want lines contain file-name if file-name .doc or .pdf type file.
for example,
<tr><td align="right">4.</td> <td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td> <td align="left" valign="top" width=72>l. sam</td> </tr> <tr><td align="right">5.</td> <td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td> <td align="left" valign="top" width=72>g.k. ram</td> </tr>
using python re.findall()
want following lines.
<td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td> <td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td>
can body please tell me scalable way define pattern in re.findall()?
you can use regex:
(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)
output:
>>> html = """<tr><td align="right">4.</td> ... <td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td> ... <td align="left" valign="top" width=72>l. sam</td> ... </tr> ... <tr><td align="right">5.</td> ... <td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td> ... <td align="left" valign="top" width=72>g.k. ram</td> ... </tr>""" >>> re.findall("(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)", html) ['<td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td>', '<td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td>']
Comments
Post a Comment