[physfs] Can PhysFS not read certain file names on Windows? [Solution ideas included]

Sat Apr 18 16:46:47 EDT 2020

Sorry, ignore the "0x10". I was meant to write 0x10000, but it's with
some sort of offset anyway that i can't fully decipher right now. I wish
the WTF-8 specs were written more mathematically, and less
here-is-dumped-C-code style... oh well

On 4/18/20 10:44 PM, Ellie wrote:
> I just saw this in the Node.js bug report:
> 
> https://simonsapin.github.io/wtf-8/
> 
> From what I can tell I THINK it uses a similar concept of
> one-invalid-byte-to-one-"special"-code-point, except in a different
> not-PIA range (not 0xF0000 + above, but 0x10) if I am reading it correctly.
> 
> However, I am not sure:
> 
> 1. whether WTF-8 uses a code point range is a choice guaranteed to be
> free of "regular" character use in the future, and therefore not making
> future clashes more likely (at least a PIA range as I suggested is
> guaranteed to have no regularly recognized characters at any time in the
> future)
> 
> 2. whether WTF-8 can map any arbitrary 16-bit unsigned int non-lossy, or
> "just" surrogates + regular characters (does that leave an uncovered
> code point range? I'm not sure) although I'm not sure that is relevant
> if PhysFS implemented it and passed through any non-surrogate 16-bit
> values just as-is code points (no matter if they make sense or not) anyway
> 
> Especially point 2 might need some clearing up to ensure it actually
> solves the problem and makes really all file names reachable. Other than
> that it could be a better idea than my hack, because at least it's a
> sort-of standard. (Although I don't know how widely used it is)
> 
> On 4/18/20 10:25 PM, Ellie wrote:
>> After stumbling across an interesting node.js issue and digging into the
>> PhysFS code I am wondering, is it the case that PhysFS can fundamentally
>> not access files with certain file names on Windows?
>>
>> I believe the example used here to break node.js might likely also break
>> PhysFS: https://github.com/nodejs/node/issues/23735
>>
>> (And summed up conceptually very briefly, the issue appears to be that
>> Windows filenames are arbitrary unsigned 16-bit ints per character, NOT
>> necessarily valid UTF-16 or valid surrogated UTF-32.)
>>
>>
>> The potential issue on PhysFS's side is code like this:
>>
>> https://github.com/criptych/physfs/blob/master/src/physfs_unicode.c#L207
>>
>> If you forget the standard for a second and think about what sort of
>> mathematical transformation this is: it transforms arbitrary 16-bit wide
>> chars (which is essentially what a windows file name can be) such that
>> it apparently maps multiple characters to UNICODE_BOGUS_CHAR_CODEPOINT.
>> This would make the conversion lossy, since there is no way you can get
>> back the same 16-bit wide char string converting back from UTF-8 when
>> any two original 16-bit values are collapsed. However, obviously, a
>> non-lossy conversion path back would need to be guaranteed for all
>> possible file names to be addressable by the PhysFS-using application.
>>
>> I think one quite hacky(!) solution here would be:
>>
>> HACKY: Use something like
>> https://en.wikipedia.org/wiki/Private_Use_Areas instead: treat all
>> invalid UTF-16 as raw 8-bit binary (as 2x 8-bit per wide char,
>> obviously), with each 8-bit encoded in a PIA code point range (like,
>> e.g. as code point 0xF0000 + <0-255 value of byte> per byte). This range
>> could then be decoded back to the raw 8-bit sequences if incoming from
>> UTF-8 (and dropped if it's not always two 8-bit chars in a row, not
>> mapping back to a 16-bit int). This would retain even any originally
>> invalid 16-bit on Windows. Caution: valid UTF-16 can use characters in
>> this PIA range too, even though that is discouraged - that would need to
>> be treated as invalid UTF-16 and also "byte-mapped" like that. (This is
>> the hacky part. But in the end that shouldn't be the worst offense since
>> PIA is only for private use anyway, but strictly speaking this is
>> butchering valid Unicode in ways many people would not expect. But it
>> would guarantee all files are accessible through PhysFS on Windows even
>> if through weird strings, which is not the case right now.)
>>
>> There are probably nicer solutions than that to avoid collapsing values
>> in lossy ways to UNICODE_BOGUS_CHAR_CODEPOINT, this was just the first
>> idea that came to my mind.
>>
>> (Another idea would be to use a code range way higher up since Unicode
>> code points are only defined up to 0x10FFFF right now, while UTF-8 and
>> UTF-16 can encode up to 0xFFFFFFFF I think. This would avoid remapping a
>> part of valid UTF-16 that falls into the PIA range in hackish ways, but
>> these upper >0x10FFF areas might be used in future standards while PIA
>> is at least guaranteed to never be used differently/for "proper
>> characters" in the future. So this might be nicer for now, but way
>> uglier some years ahead.)
>>
>> In any case, I think it might be worth fixing this limitation of PhysFS
>> somehow, if I am right and it's actually present.
>>
>> The problem is this issue limits the universal use of PhysFS, since it
>> incentivizes not using PhysFS unless required, and instead using custom
>> code that CAN open all files on disk on Windows. And in my opinion, this
>> goes against PhysFS's core concept to unite regular file access and
>> virtual mounted archive file access into one API and saving everyone
>> from the duplicate code paths.
>>
>> Regards,
>>
>> Ellie
>>