Python Regex to findall lines contains specific type of filenames -


i have text file. want lines contain file-name if file-name .doc or .pdf type file.

for example,

<tr><td align="right">4.</td> <td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td> <td align="left" valign="top" width=72>l. sam</td> </tr> <tr><td align="right">5.</td> <td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td> <td align="left" valign="top" width=72>g.k. ram</td> </tr> 

using python re.findall() want following lines.

<td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td> <td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td> 

can body please tell me scalable way define pattern in re.findall()?

you can use regex:

(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*) 

output:

>>> html = """<tr><td align="right">4.</td> ... <td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td> ... <td align="left" valign="top" width=72>l. sam</td> ... </tr> ... <tr><td align="right">5.</td> ... <td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td> ... <td align="left" valign="top" width=72>g.k. ram</td> ... </tr>""" >>> re.findall("(.*?<a\shref=[\"']\w+(?:\.doc|\.pdf)[\"']>.*)", html) ['<td align="left" valign="top" width=50%><a href="abc.pdf"> on complex analytic manifolds</a></td>', '<td align="left" valign="top" width=50%><a href="def.doc"> on geometric theory of fields</a>*</td>'] 

Comments

Popular posts from this blog

jquery - How can I dynamically add a browser tab? -

node.js - Getting the socket id,user id pair of a logged in user(s) -

keyboard - C++ GetAsyncKeyState alternative -