I wrote previously about how, when attempting to copy a large batch of files from my Linux laptop to my M1 MacBook Pro, many files with Japanese filenames failed to copy until I renamed them. As a temporary workaround (and as a programming exercise), I created scripts in both Ruby and F# to randomize their names (since the names were not particularly significant).

I recently returned to this issue and discovered that the underlying issue was as I’d expected: Unicode normalization issues between OSes.

For example, take the following Japanese characters: and , both of which represent the sound “bo” in Japanese using its katakana script. They appear identical, but their compositions are actually completely different:

  • The former is a single “composed” code point, 0x30dc, which represents the entire character as seen and is an example of NFC (Normalization Form Canonical Composition)
  • The latter is a combination of the “decomposed” code points (0x30db) and ◌゙ (0x3099) and is an example of NFD (Normalization Form Canonical Decomposition)

For greatest compatibility between Linux and macOS, it seems the NFC form is best.

I’ve created a new F# script, Recursive Path Normalizer, that recursively normalizes all directory and file names to NFC form within a given path. Of course, there are other ways to do this, such as convmv¹, but it was a good chance to work with F# again and craft my own solution. (I adapted a chunk of code from my previous random-name script, which gave me a quick start. I also incorporated Startwatch as a timer.)

I’m still keeping an eye on things, but this seems to have resolved my copying problem. ✌️


¹ Conversion to NFC can be done thusly: convmv -r -f utf8 -t utf8 --nfc --notest ..


References and other helpful links:


<
Previous Post
Advent of Code 2024 in F#
>
Next Post
Installing Ruby on Rails on M1 MacBook Pro