Hide keyboard shortcuts

Hot-keys on this page

r m x p   toggle line displays

j k   next/prev highlighted chunk

0   (zero) top of page

1   (one) first highlighted chunk

1""" 

2This is a python implementation of wcwidth() and wcswidth(). 

3 

4https://github.com/jquast/wcwidth 

5 

6from Markus Kuhn's C code, retrieved from: 

7 

8 http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 

9 

10This is an implementation of wcwidth() and wcswidth() (defined in 

11IEEE Std 1002.1-2001) for Unicode. 

12 

13http://www.opengroup.org/onlinepubs/007904975/functions/wcwidth.html 

14http://www.opengroup.org/onlinepubs/007904975/functions/wcswidth.html 

15 

16In fixed-width output devices, Latin characters all occupy a single 

17"cell" position of equal width, whereas ideographic CJK characters 

18occupy two such cells. Interoperability between terminal-line 

19applications and (teletype-style) character terminals using the 

20UTF-8 encoding requires agreement on which character should advance 

21the cursor by how many cell positions. No established formal 

22standards exist at present on which Unicode character shall occupy 

23how many cell positions on character terminals. These routines are 

24a first attempt of defining such behavior based on simple rules 

25applied to data provided by the Unicode Consortium. 

26 

27For some graphical characters, the Unicode standard explicitly 

28defines a character-cell width via the definition of the East Asian 

29FullWidth (F), Wide (W), Half-width (H), and Narrow (Na) classes. 

30In all these cases, there is no ambiguity about which width a 

31terminal shall use. For characters in the East Asian Ambiguous (A) 

32class, the width choice depends purely on a preference of backward 

33compatibility with either historic CJK or Western practice. 

34Choosing single-width for these characters is easy to justify as 

35the appropriate long-term solution, as the CJK practice of 

36displaying these characters as double-width comes from historic 

37implementation simplicity (8-bit encoded characters were displayed 

38single-width and 16-bit ones double-width, even for Greek, 

39Cyrillic, etc.) and not any typographic considerations. 

40 

41Much less clear is the choice of width for the Not East Asian 

42(Neutral) class. Existing practice does not dictate a width for any 

43of these characters. It would nevertheless make sense 

44typographically to allocate two character cells to characters such 

45as for instance EM SPACE or VOLUME INTEGRAL, which cannot be 

46represented adequately with a single-width glyph. The following 

47routines at present merely assign a single-cell width to all 

48neutral characters, in the interest of simplicity. This is not 

49entirely satisfactory and should be reconsidered before 

50establishing a formal standard in this area. At the moment, the 

51decision which Not East Asian (Neutral) characters should be 

52represented by double-width glyphs cannot yet be answered by 

53applying a simple rule from the Unicode database content. Setting 

54up a proper standard for the behavior of UTF-8 character terminals 

55will require a careful analysis not only of each Unicode character, 

56but also of each presentation form, something the author of these 

57routines has avoided to do so far. 

58 

59http://www.unicode.org/unicode/reports/tr11/ 

60 

61Latest version: http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c 

62""" 

63from __future__ import division 

64 

65# std imports 

66import os 

67import sys 

68import warnings 

69 

70# local 

71from .table_wide import WIDE_EASTASIAN 

72from .table_zero import ZERO_WIDTH 

73from .unicode_versions import list_versions 

74 

75try: 

76 from functools import lru_cache 

77except ImportError: 

78 # lru_cache was added in Python 3.2 

79 from backports.functools_lru_cache import lru_cache 

80 

81# global cache 

82_UNICODE_CMPTABLE = None 

83_PY3 = (sys.version_info[0] >= 3) 

84 

85 

86# NOTE: created by hand, there isn't anything identifiable other than 

87# general Cf category code to identify these, and some characters in Cf 

88# category code are of non-zero width. 

89# Also includes some Cc, Mn, Zl, and Zp characters 

90ZERO_WIDTH_CF = set([ 

91 0, # Null (Cc) 

92 0x034F, # Combining grapheme joiner (Mn) 

93 0x200B, # Zero width space 

94 0x200C, # Zero width non-joiner 

95 0x200D, # Zero width joiner 

96 0x200E, # Left-to-right mark 

97 0x200F, # Right-to-left mark 

98 0x2028, # Line separator (Zl) 

99 0x2029, # Paragraph separator (Zp) 

100 0x202A, # Left-to-right embedding 

101 0x202B, # Right-to-left embedding 

102 0x202C, # Pop directional formatting 

103 0x202D, # Left-to-right override 

104 0x202E, # Right-to-left override 

105 0x2060, # Word joiner 

106 0x2061, # Function application 

107 0x2062, # Invisible times 

108 0x2063, # Invisible separator 

109]) 

110 

111 

112def _bisearch(ucs, table): 

113 """ 

114 Auxiliary function for binary search in interval table. 

115 

116 :arg int ucs: Ordinal value of unicode character. 

117 :arg list table: List of starting and ending ranges of ordinal values, 

118 in form of ``[(start, end), ...]``. 

119 :rtype: int 

120 :returns: 1 if ordinal value ucs is found within lookup table, else 0. 

121 """ 

122 lbound = 0 

123 ubound = len(table) - 1 

124 

125 if ucs < table[0][0] or ucs > table[ubound][1]: 

126 return 0 

127 while ubound >= lbound: 

128 mid = (lbound + ubound) // 2 

129 if ucs > table[mid][1]: 

130 lbound = mid + 1 

131 elif ucs < table[mid][0]: 

132 ubound = mid - 1 

133 else: 

134 return 1 

135 

136 return 0 

137 

138 

139@lru_cache(maxsize=1000) 

140def wcwidth(wc, unicode_version='auto'): 

141 r""" 

142 Given one Unicode character, return its printable length on a terminal. 

143 

144 :param str wc: A single Unicode character. 

145 :param str unicode_version: A Unicode version number, such as 

146 ``'6.0.0'``, the list of available version levels may be 

147 listed by pairing function :func:`list_versions`. 

148 

149 Any version string may be specified without error -- the nearest 

150 matching version is selected. When ``latest`` (default), the 

151 highest Unicode version level is used. 

152 :return: The width, in cells, necessary to display the character of 

153 Unicode string character, ``wc``. Returns 0 if the ``wc`` argument has 

154 no printable effect on a terminal (such as NUL '\0'), -1 if ``wc`` is 

155 not printable, or has an indeterminate effect on the terminal, such as 

156 a control character. Otherwise, the number of column positions the 

157 character occupies on a graphic terminal (1 or 2) is returned. 

158 :rtype: int 

159 

160 The following have a column width of -1: 

161 

162 - C0 control characters (U+001 through U+01F). 

163 

164 - C1 control characters and DEL (U+07F through U+0A0). 

165 

166 The following have a column width of 0: 

167 

168 - Non-spacing and enclosing combining characters (general 

169 category code Mn or Me in the Unicode database). 

170 

171 - NULL (``U+0000``). 

172 

173 - COMBINING GRAPHEME JOINER (``U+034F``). 

174 

175 - ZERO WIDTH SPACE (``U+200B``) *through* 

176 RIGHT-TO-LEFT MARK (``U+200F``). 

177 

178 - LINE SEPARATOR (``U+2028``) *and* 

179 PARAGRAPH SEPARATOR (``U+2029``). 

180 

181 - LEFT-TO-RIGHT EMBEDDING (``U+202A``) *through* 

182 RIGHT-TO-LEFT OVERRIDE (``U+202E``). 

183 

184 - WORD JOINER (``U+2060``) *through* 

185 INVISIBLE SEPARATOR (``U+2063``). 

186 

187 The following have a column width of 1: 

188 

189 - SOFT HYPHEN (``U+00AD``). 

190 

191 - All remaining characters, including all printable ISO 8859-1 

192 and WGL4 characters, Unicode control characters, etc. 

193 

194 The following have a column width of 2: 

195 

196 - Spacing characters in the East Asian Wide (W) or East Asian 

197 Full-width (F) category as defined in Unicode Technical 

198 Report #11 have a column width of 2. 

199 

200 - Some kinds of Emoji or symbols. 

201 """ 

202 # NOTE: created by hand, there isn't anything identifiable other than 

203 # general Cf category code to identify these, and some characters in Cf 

204 # category code are of non-zero width. 

205 ucs = ord(wc) 

206 if ucs in ZERO_WIDTH_CF: 

207 return 0 

208 

209 # C0/C1 control characters 

210 if ucs < 32 or 0x07F <= ucs < 0x0A0: 

211 return -1 

212 

213 _unicode_version = _wcmatch_version(unicode_version) 

214 

215 # combining characters with zero width 

216 if _bisearch(ucs, ZERO_WIDTH[_unicode_version]): 

217 return 0 

218 

219 return 1 + _bisearch(ucs, WIDE_EASTASIAN[_unicode_version]) 

220 

221 

222def wcswidth(pwcs, n=None, unicode_version='auto'): 

223 """ 

224 Given a unicode string, return its printable length on a terminal. 

225 

226 :param str pwcs: Measure width of given unicode string. 

227 :param int n: When ``n`` is None (default), return the length of the 

228 entire string, otherwise width the first ``n`` characters specified. 

229 :param str unicode_version: An explicit definition of the unicode version 

230 level to use for determination, may be ``auto`` (default), which uses 

231 the Environment Variable, ``UNICODE_VERSION`` if defined, or the latest 

232 available unicode version, otherwise. 

233 :rtype: int 

234 :returns: The width, in cells, necessary to display the first ``n`` 

235 characters of the unicode string ``pwcs``. Returns ``-1`` if 

236 a non-printable character is encountered. 

237 """ 

238 # pylint: disable=C0103 

239 # Invalid argument name "n" 

240 

241 end = len(pwcs) if n is None else n 

242 idx = slice(0, end) 

243 width = 0 

244 for char in pwcs[idx]: 

245 wcw = wcwidth(char, unicode_version) 

246 if wcw < 0: 

247 return -1 

248 width += wcw 

249 return width 

250 

251 

252@lru_cache(maxsize=128) 

253def _wcversion_value(ver_string): 

254 """ 

255 Integer-mapped value of given dotted version string. 

256 

257 :param str ver_string: Unicode version string, of form ``n.n.n``. 

258 :rtype: tuple(int) 

259 :returns: tuple of digit tuples, ``tuple(int, [...])``. 

260 """ 

261 retval = tuple(map(int, (ver_string.split('.')))) 

262 return retval 

263 

264 

265@lru_cache(maxsize=8) 

266def _wcmatch_version(given_version): 

267 """ 

268 Return nearest matching supported Unicode version level. 

269 

270 If an exact match is not determined, the nearest lowest version level is 

271 returned after a warning is emitted. For example, given supported levels 

272 ``4.1.0`` and ``5.0.0``, and a version string of ``4.9.9``, then ``4.1.0`` 

273 is selected and returned: 

274 

275 >>> _wcmatch_version('4.9.9') 

276 '4.1.0' 

277 >>> _wcmatch_version('8.0') 

278 '8.0.0' 

279 >>> _wcmatch_version('1') 

280 '4.1.0' 

281 

282 :param str given_version: given version for compare, may be ``auto`` 

283 (default), to select Unicode Version from Environment Variable, 

284 ``UNICODE_VERSION``. If the environment variable is not set, then the 

285 latest is used. 

286 :rtype: str 

287 :returns: unicode string, or non-unicode ``str`` type for python 2 

288 when given ``version`` is also type ``str``. 

289 """ 

290 # Design note: the choice to return the same type that is given certainly 

291 # complicates it for python 2 str-type, but allows us to define an api that 

292 # to use 'string-type', for unicode version level definitions, so all of our 

293 # example code works with all versions of python. That, along with the 

294 # string-to-numeric and comparisons of earliest, latest, matching, or 

295 # nearest, greatly complicates this function. 

296 _return_str = not _PY3 and isinstance(given_version, str) 

297 

298 if _return_str: 

299 unicode_versions = [ucs.encode() for ucs in list_versions()] 

300 else: 

301 unicode_versions = list_versions() 

302 latest_version = unicode_versions[-1] 

303 

304 if given_version in (u'auto', 'auto'): 

305 given_version = os.environ.get( 

306 'UNICODE_VERSION', 

307 'latest' if not _return_str else latest_version.encode()) 

308 

309 if given_version in (u'latest', 'latest'): 

310 # default match, when given as 'latest', use the most latest unicode 

311 # version specification level supported. 

312 return latest_version if not _return_str else latest_version.encode() 

313 

314 if given_version in unicode_versions: 

315 # exact match, downstream has specified an explicit matching version 

316 # matching any value of list_versions(). 

317 return given_version if not _return_str else given_version.encode() 

318 

319 # The user's version is not supported by ours. We return the newest unicode 

320 # version level that we support below their given value. 

321 try: 

322 cmp_given = _wcversion_value(given_version) 

323 

324 except ValueError: 

325 # submitted value raises ValueError in int(), warn and use latest. 

326 warnings.warn("UNICODE_VERSION value, {given_version!r}, is invalid. " 

327 "Value should be in form of `integer[.]+', the latest " 

328 "supported unicode version {latest_version!r} has been " 

329 "inferred.".format(given_version=given_version, 

330 latest_version=latest_version)) 

331 return latest_version if not _return_str else latest_version.encode() 

332 

333 # given version is less than any available version, return earliest 

334 # version. 

335 earliest_version = unicode_versions[0] 

336 cmp_earliest_version = _wcversion_value(earliest_version) 

337 

338 if cmp_given <= cmp_earliest_version: 

339 # this probably isn't what you wanted, the oldest wcwidth.c you will 

340 # find in the wild is likely version 5 or 6, which we both support, 

341 # but it's better than not saying anything at all. 

342 warnings.warn("UNICODE_VERSION value, {given_version!r}, is lower " 

343 "than any available unicode version. Returning lowest " 

344 "version level, {earliest_version!r}".format( 

345 given_version=given_version, 

346 earliest_version=earliest_version)) 

347 return earliest_version if not _return_str else earliest_version.encode() 

348 

349 # create list of versions which are less than our equal to given version, 

350 # and return the tail value, which is the highest level we may support, 

351 # or the latest value we support, when completely unmatched or higher 

352 # than any supported version. 

353 # 

354 # function will never complete, always returns. 

355 for idx, unicode_version in enumerate(unicode_versions): 

356 # look ahead to next value 

357 try: 

358 cmp_next_version = _wcversion_value(unicode_versions[idx + 1]) 

359 except IndexError: 

360 # at end of list, return latest version 

361 return latest_version if not _return_str else latest_version.encode() 

362 

363 # Maybe our given version has less parts, as in tuple(8, 0), than the 

364 # next compare version tuple(8, 0, 0). Test for an exact match by 

365 # comparison of only the leading dotted piece(s): (8, 0) == (8, 0). 

366 if cmp_given == cmp_next_version[:len(cmp_given)]: 

367 return unicode_versions[idx + 1] 

368 

369 # Or, if any next value is greater than our given support level 

370 # version, return the current value in index. Even though it must 

371 # be less than the given value, its our closest possible match. That 

372 # is, 4.1 is returned for given 4.9.9, where 4.1 and 5.0 are available. 

373 if cmp_next_version > cmp_given: 

374 return unicode_version 

375 assert False, ("Code path unreachable", given_version, unicode_versions)