[physfs] Can PhysFS not read certain file names on Windows? [Solution ideas included]

Sat Apr 18 16:25:39 EDT 2020

After stumbling across an interesting node.js issue and digging into the
PhysFS code I am wondering, is it the case that PhysFS can fundamentally
not access files with certain file names on Windows?

I believe the example used here to break node.js might likely also break
PhysFS: https://github.com/nodejs/node/issues/23735

(And summed up conceptually very briefly, the issue appears to be that
Windows filenames are arbitrary unsigned 16-bit ints per character, NOT
necessarily valid UTF-16 or valid surrogated UTF-32.)

The potential issue on PhysFS's side is code like this:

https://github.com/criptych/physfs/blob/master/src/physfs_unicode.c#L207

If you forget the standard for a second and think about what sort of
mathematical transformation this is: it transforms arbitrary 16-bit wide
chars (which is essentially what a windows file name can be) such that
it apparently maps multiple characters to UNICODE_BOGUS_CHAR_CODEPOINT.
This would make the conversion lossy, since there is no way you can get
back the same 16-bit wide char string converting back from UTF-8 when
any two original 16-bit values are collapsed. However, obviously, a
non-lossy conversion path back would need to be guaranteed for all
possible file names to be addressable by the PhysFS-using application.

I think one quite hacky(!) solution here would be:

HACKY: Use something like
https://en.wikipedia.org/wiki/Private_Use_Areas instead: treat all
invalid UTF-16 as raw 8-bit binary (as 2x 8-bit per wide char,
obviously), with each 8-bit encoded in a PIA code point range (like,
e.g. as code point 0xF0000 + <0-255 value of byte> per byte). This range
could then be decoded back to the raw 8-bit sequences if incoming from
UTF-8 (and dropped if it's not always two 8-bit chars in a row, not
mapping back to a 16-bit int). This would retain even any originally
invalid 16-bit on Windows. Caution: valid UTF-16 can use characters in
this PIA range too, even though that is discouraged - that would need to
be treated as invalid UTF-16 and also "byte-mapped" like that. (This is
the hacky part. But in the end that shouldn't be the worst offense since
PIA is only for private use anyway, but strictly speaking this is
butchering valid Unicode in ways many people would not expect. But it
would guarantee all files are accessible through PhysFS on Windows even
if through weird strings, which is not the case right now.)

There are probably nicer solutions than that to avoid collapsing values
in lossy ways to UNICODE_BOGUS_CHAR_CODEPOINT, this was just the first
idea that came to my mind.

(Another idea would be to use a code range way higher up since Unicode
code points are only defined up to 0x10FFFF right now, while UTF-8 and
UTF-16 can encode up to 0xFFFFFFFF I think. This would avoid remapping a
part of valid UTF-16 that falls into the PIA range in hackish ways, but
these upper >0x10FFF areas might be used in future standards while PIA
is at least guaranteed to never be used differently/for "proper
characters" in the future. So this might be nicer for now, but way
uglier some years ahead.)

In any case, I think it might be worth fixing this limitation of PhysFS
somehow, if I am right and it's actually present.

The problem is this issue limits the universal use of PhysFS, since it
incentivizes not using PhysFS unless required, and instead using custom
code that CAN open all files on disk on Windows. And in my opinion, this
goes against PhysFS's core concept to unite regular file access and
virtual mounted archive file access into one API and saving everyone
from the duplicate code paths.

Regards,

Ellie