Programmer Humor

32410 readers

1840 users here now

Post funny things about programming here! (Or just rant about your favourite programming language.)

Rules:

Posts must be relevant to programming, programmers, or computer science.
No NSFW content.
Jokes must be in good taste. No hate speech, bigotry, etc.

founded 5 years ago

MODERATORS

[email protected]

609

blahaj (lemmy.zip)

submitted 3 months ago by [email protected] to c/[email protected]

56 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 16 points 3 months ago (1 children)

Unicode in filenames can be a bad idea, since there are more than one way to achieve what looks like the same character. So matching patterns could fail if you think it's one way, but it's actually another representation in unicode.

[–] [email protected] 5 points 3 months ago (1 children)

Good point. Do filesystems use a normal form to at least prevent having two files with effectively the same name?

I should point out the flip side though, that there's no avoiding Unicode in filenames. Users in languages that don't use the Latin alphabet (such as Japanese, Chinese, Korean, Hebrew, Arabic, Greek and Russian, and the list could go on) can reasonably expect to be able to give a file a name they can read and understand with no extra effort. All the software woes that come with it - too bad, software needs to deal with it.

[–] [email protected] 2 points 3 months ago (1 children)

I'm not sure. A few years ago I remember that OpenBSD expected ASCII for files, but I think Linux expects utf-8. I could be wrong though.

[–] [email protected] 3 points 3 months ago* (last edited 3 months ago)

I'm assuming Unicode anyway, and UTF-8 is by far the most natural because most files will be in ASCII. A "normal form" (see link above), you might think of it as a canonical form, is a way to check if two strings are equivalent, even if they encoded the text differently. Like the example mentioned on Wikipedia:

For example, the distinct Unicode strings "U+212B" (the angstrom sign "Å") and "U+00C5" (the Swedish letter "Å") are both expanded by NFD (or NFKD) into the sequence "U+0041 U+030A" (Latin letter "A" and combining ring above "°") which is then reduced by NFC (or NFKC) to "U+00C5" (the Swedish letter "Å").