javascript - Splitting HTML Content Into Sentences, But Keeping Subtags Intact -


i'm using code below separate text within paragraph tag sentences. working okay few exceptions. however, tags within paragraphs chewed , spit out. example:

<p>this sample of <a href="#">link</a> getting chewed up.</p> 

so, how can ignore tags such parse sentences , place span tags around them , keep , , etc...tags in place? or smarter somehow walk dom , way?

// split text on page clickable sentences $('p').each(function() {     var sentences = $(this)         .text()         .replace(/(((?![.!?]['"]?\s).)*[.!?]['"]?)(\s|$)/g,                   '<span class="sentence">$1</span>$3');     $(this).html(sentences); }); 

i using in chrome extension content script; means javascript injected page comes in contact , parses <p> tags on fly. therefore, needs javascript.

soapbox

we craft regex match specific case, given html parsing , use case hints number of tags in there, you'd best off using dom or using product html agility (free)

however

if you're looking pull out inner text , not interested in retaining of tag data, use regex , repalace matches null

(<[^>]*>)

enter image description here enter image description here

retain sentence including sub tags

  • ((?:<p(?:\s[^>]*)?>).*?</p>) - retain paragraph tags , entire sentence, not data outside paragraph

  • (?:<p(?:\s[^>]*)?>)(.*?)(?:</p>) - retain paragraph innertext including subtags, , store sentence group 1

  • (<p(?:\s[^>]*)?>)(.*?)(</p>) - capture open , close paragraph tags , innertext including sub tags

granted these powershell examples, regex , replace function should similar

$string = '<img> not stuff either</img><p class=supercoolstuff>this sample of <a href="#">link</a> getting chewed up.</p><a> other stuff</a>'  write-host "replace p tags new span tag" $string -replace '(?:<p(?:\s[^>]*)?>)(.*?)(?:</p>)', '<span class=sentence>$1</span>'  write-host write-host "insert p tag's inner text span new span tag , return entire thing including p tags" $string -replace '(<p(?:\s[^>]*)?>)(.*?)(</p>)', '$1<span class=sentence>$2</span>$3' 

yields

replace p tags new span tag <img> not stuff either</img><span class=sentence>this sample of <a href="#">link</a> getting chewed up.</span ><a> other stuff</a>  insert p tag's inner text span new span tag , return entire thing including p tags <img> not stuff either</img><p class=supercoolstuff><span class=sentence>this sample of <a href="#">link</a>  getting chewed up.</span></p><a> other stuff</a> 

Comments

Popular posts from this blog

jquery - How can I dynamically add a browser tab? -

node.js - Getting the socket id,user id pair of a logged in user(s) -

keyboard - C++ GetAsyncKeyState alternative -