Does python re (regex) have an alternative to \u unicode escape sequences? -
python treats \uxxxx unicode character escape inside string literal (e.g. u"\u2014" gets interpreted unicode character u+2014). discovered (python 2.7) standard regex module doesn't treat \uxxxx unicode character. example:
codepoint = 2014 # got dynamically somewhere test = u"this string ends \u2014" pattern = r"\u%s$" % codepoint assert(pattern[-5:] == "2014$") # ends escape sequence u+2014 assert(re.search(pattern, test) != none) # failure -- no match (bad) assert(re.search(pattern, "u2014")!= none) # success -- matches (bad)
obviously if able specify regex pattern string literal, can have same effect if regex engine understood \uxxxx escapes:
test = u"this string ends \u2014" pattern = u"\u2014$" assert(pattern[:-1] == u"\u2014") # ends actual unicode char u+2014 assert(re.search(pattern, test) != none)
but if need construct pattern dynamically?
use unichr()
function create unicode characters codepoint:
pattern = u"%s$" % unichr(codepoint)
Comments
Post a Comment