HomeAboutCodePastes

Writing a simple unicode selector

One of the emacs packages I use, counsel, has a pretty nifty command called counsel-unicode-char that lets you lookup unicode characters by code point or name and inserts them into the current buffer. In non-emacs speak, this is essentially just an easy way to type emoji, symbols, and any other unicode glyph. I thought it might be handy to write a quick script that would allow me to do this anywhere on my machine and copy the character to the clipboard.

Enter unicodedata

So… programmatically getting a UTF8 character name? Naturally we turn to python - if there's a quick way to do it, python usually has it. I was glad to find out that python comes with a bulit-in library called unicodedata that could do exactly what I wanted. So I quickly whipped up the following:

#!/usr/bin/python

import sys, unicodedata

# Print all named unicode chars
try:
    # Max of range chosen experimentally, lol
    for i in range(32,918000):
        try:
            char = chr(i)
            print(f'U+{i:05x}\t{char}\t{unicodedata.name(char)}')
        except ValueError:
            continue
except (BrokenPipeError, IOError):
    pass

sys.stderr.close()

We start at the first printable character, with an integer value of 32, and loop all the way to 918000, the largest character that this library has a name for. We print the hexadecimal codepoint, the char itself, and the character name, ignoring characters that don't have a name.

Then we can pass this into rofi to create our selector!

unicode.py | rofi -i -dmenu | cut -d$'\t' -f2 | xclip -r -selection clipboard

This is why we're catching BrokenPipeError and IOError and closing stderr - in case we exit rofi earlier than the python script finishes producing output.

Here's what this looks like:

rofi-unicode.png

Can we do better?

So there I was, feeling pretty good about myself - it's python, it's built-in, it's portable, what's not to love? Well, pretty often when I write a program the next question in my head is either "Can I do it in scheme?" or "Can I do it faster?".

I began looking into the source code for python's unicodedata library and found that they were manually generating a database based on unicode.org's UnicodeData.txt and couple other files. Very impressive, over-my-head sort of stuff. I then found that unicode.org has a C library for working with unicode, and wondered why python doesn't use that - probably something to do with not relying on that dependency.

While I could just use the C library directly, Chicken Scheme is more fun to write, and I thought it might be a useful library to have in that language too. So after way too much time and head scratching, I present: icu. Here's what the same program looks like in Chicken:

#!/usr/bin/chicken-scheme
;; AUTOCOMPILE: -O5
(import chicken.format
        icu
        utf8)
(do ((i 32 (add1 i)))
    ((= i 918000))
  (let* ((char (integer->char i))
         (name (char-string-name char)))
    (when name
      ;; Currently printf doesn't support utf8
      ;; (printf "U+~a\t~a\t~a\n" (number->string i 16) (string char) name)
      (display "U+")
      (display (number->string i 16))
      (display #\tab)
      (display char)
      (display #\tab)
      (display name)
      (newline))))

The shebang here is from the wonderful autocompile egg. The speed gain here turned out not to be that great, maybe about 2x, but the fun gain…

Date: 2020-12-18