[physfs] Can PhysFS not read certain file names on Windows? [Solution ideas included]

Sat Apr 18 16:44:50 EDT 2020

I just saw this in the Node.js bug report:

https://simonsapin.github.io/wtf-8/

>From what I can tell I THINK it uses a similar concept of
one-invalid-byte-to-one-"special"-code-point, except in a different
not-PIA range (not 0xF0000 + above, but 0x10) if I am reading it correctly.

However, I am not sure:

1. whether WTF-8 uses a code point range is a choice guaranteed to be
free of "regular" character use in the future, and therefore not making
future clashes more likely (at least a PIA range as I suggested is
guaranteed to have no regularly recognized characters at any time in the
future)

2. whether WTF-8 can map any arbitrary 16-bit unsigned int non-lossy, or
"just" surrogates + regular characters (does that leave an uncovered
code point range? I'm not sure) although I'm not sure that is relevant
if PhysFS implemented it and passed through any non-surrogate 16-bit
values just as-is code points (no matter if they make sense or not) anyway

Especially point 2 might need some clearing up to ensure it actually
solves the problem and makes really all file names reachable. Other than
that it could be a better idea than my hack, because at least it's a
sort-of standard. (Although I don't know how widely used it is)

On 4/18/20 10:25 PM, Ellie wrote:
> After stumbling across an interesting node.js issue and digging into the
> PhysFS code I am wondering, is it the case that PhysFS can fundamentally
> not access files with certain file names on Windows?
> 
> I believe the example used here to break node.js might likely also break
> PhysFS: https://github.com/nodejs/node/issues/23735
> 
> (And summed up conceptually very briefly, the issue appears to be that
> Windows filenames are arbitrary unsigned 16-bit ints per character, NOT
> necessarily valid UTF-16 or valid surrogated UTF-32.)
> 
> 
> The potential issue on PhysFS's side is code like this:
> 
> https://github.com/criptych/physfs/blob/master/src/physfs_unicode.c#L207
> 
> If you forget the standard for a second and think about what sort of
> mathematical transformation this is: it transforms arbitrary 16-bit wide
> chars (which is essentially what a windows file name can be) such that
> it apparently maps multiple characters to UNICODE_BOGUS_CHAR_CODEPOINT.
> This would make the conversion lossy, since there is no way you can get
> back the same 16-bit wide char string converting back from UTF-8 when
> any two original 16-bit values are collapsed. However, obviously, a
> non-lossy conversion path back would need to be guaranteed for all
> possible file names to be addressable by the PhysFS-using application.
> 
> I think one quite hacky(!) solution here would be:
> 
> HACKY: Use something like
> https://en.wikipedia.org/wiki/Private_Use_Areas instead: treat all
> invalid UTF-16 as raw 8-bit binary (as 2x 8-bit per wide char,
> obviously), with each 8-bit encoded in a PIA code point range (like,
> e.g. as code point 0xF0000 + <0-255 value of byte> per byte). This range
> could then be decoded back to the raw 8-bit sequences if incoming from
> UTF-8 (and dropped if it's not always two 8-bit chars in a row, not
> mapping back to a 16-bit int). This would retain even any originally
> invalid 16-bit on Windows. Caution: valid UTF-16 can use characters in
> this PIA range too, even though that is discouraged - that would need to
> be treated as invalid UTF-16 and also "byte-mapped" like that. (This is
> the hacky part. But in the end that shouldn't be the worst offense since
> PIA is only for private use anyway, but strictly speaking this is
> butchering valid Unicode in ways many people would not expect. But it
> would guarantee all files are accessible through PhysFS on Windows even
> if through weird strings, which is not the case right now.)
> 
> There are probably nicer solutions than that to avoid collapsing values
> in lossy ways to UNICODE_BOGUS_CHAR_CODEPOINT, this was just the first
> idea that came to my mind.
> 
> (Another idea would be to use a code range way higher up since Unicode
> code points are only defined up to 0x10FFFF right now, while UTF-8 and
> UTF-16 can encode up to 0xFFFFFFFF I think. This would avoid remapping a
> part of valid UTF-16 that falls into the PIA range in hackish ways, but
> these upper >0x10FFF areas might be used in future standards while PIA
> is at least guaranteed to never be used differently/for "proper
> characters" in the future. So this might be nicer for now, but way
> uglier some years ahead.)
> 
> In any case, I think it might be worth fixing this limitation of PhysFS
> somehow, if I am right and it's actually present.
> 
> The problem is this issue limits the universal use of PhysFS, since it
> incentivizes not using PhysFS unless required, and instead using custom
> code that CAN open all files on disk on Windows. And in my opinion, this
> goes against PhysFS's core concept to unite regular file access and
> virtual mounted archive file access into one API and saving everyone
> from the duplicate code paths.
> 
> Regards,
> 
> Ellie
>