[PATCH 00/53] Get rid of UTF-8 chars that can be mapped as ASCII

David Woodhouse dwmw2 at infradead.org
Tue May 11 02:19:40 PDT 2021

On Tue, 2021-05-11 at 11:00 +0200, Mauro Carvalho Chehab wrote:
> Yet, this series has two positive side effects:
>  - it helps people needing to touch the documents using non-utf8 locales[1];
>  - it makes easier to grep for a text;
> [1] There are still some widely used distros nowadays (LTS ones?) that
>     don't set UTF-8 as default. Last time I installed a Debian machine
>     I had to explicitly set UTF-8 charset after install as the default
>     were using ASCII encoding (can't remember if it was Debian 10 or an
>     older version).

This whole line of thinking is fundamentally wrong.

A given set of characters in a "text file" are encoded with a specific
character set / encoding. To interpret that file and convert the bytes
back to characters, we need to use the *same* charset.

That charset is a property of the text file, and each text file or
piece of text in a system (like this email, which will contain a
Content-Type: header indicating the charset) might be encoded with a
*different* character set.

In the days before you could connect computers together — or before you
could exchange data between computers in different countries, at least
— perhaps it made sense to store 'text' files without explicitly noting
their encoding. And to interpret them using some kind of "default"
character set.

Those days are long gone. You're trying to work around an egregiously
stupid bug, if you're trying to pander to "default" encodings. There
*is* no default encoding that even makes sense, except perhaps UTF-8.
To *speak* of them as you did shows a misunderstanding of how broken
they are. It's *precisely* that kind of half-baked thinking which
always used to lead to stupid assumptions and double conversions and
Mojibake. Before we just standardised on UTF-8 everywhere and it
stopped mattering so much.

Just don't.

Now, you *can* make this work if you really insist on it, even for
systems with EBCDIC as their default encoding. Just make git do the
"convert to local charset" on checkout, precisely the same way as it
does CRLF for Windows systems. But it's stupid and anachronistic, so I
don't really see the point.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5174 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-riscv/attachments/20210511/acf24092/attachment.bin>

More information about the linux-riscv mailing list