Does python re (regex) have an alternative to \u unicode escape sequences? -

June 15, 2013

python treats \uxxxx unicode character escape inside string literal (e.g. u"\u2014" gets interpreted unicode character u+2014). discovered (python 2.7) standard regex module doesn't treat \uxxxx unicode character. example:

codepoint = 2014 # got dynamically somewhere  test = u"this string ends \u2014" pattern = r"\u%s$" % codepoint assert(pattern[-5:] == "2014$") # ends escape sequence u+2014 assert(re.search(pattern, test) != none) # failure -- no match (bad) assert(re.search(pattern, "u2014")!= none) # success -- matches (bad)

obviously if able specify regex pattern string literal, can have same effect if regex engine understood \uxxxx escapes:

test = u"this string ends \u2014" pattern = u"\u2014$" assert(pattern[:-1] == u"\u2014") # ends actual unicode char u+2014 assert(re.search(pattern, test) != none)

but if need construct pattern dynamically?

use unichr() function create unicode characters codepoint:

pattern = u"%s$" % unichr(codepoint)

Search This Blog

New Mian

Does python re (regex) have an alternative to \u unicode escape sequences? -

Comments

Post a Comment

Popular posts from this blog

android - java.net.UnknownHostException(Unable to resolve host “URL”: No address associated with hostname) -

jquery - How can I dynamically add a browser tab? -

keyboard - C++ GetAsyncKeyState alternative -