[physfs] Can PhysFS not read certain file names on Windows? [Solution ideas included]

Sat Apr 18 17:50:05 EDT 2020

For what's worth it, WTF-8 is basically "UTF-8 but invalid codepoints
are allowed" (i.e. any encoded UTF-8 value is valid, even if it's not
a codepoint allowed by Unicode). Whether it works or not depends on
how strict string validation is, if functions just parse UTF-8 blindly
it may already Just Work™.

Probably the most common form of WTF-8 is precisely the one brought up
here, where surrogate codepoints are considered valid, and yes, this
is to ensure loseless conversion between UTF-8 and UCS-2/UTF-16
(WTF-16?). In particular, to allow lone surrogates to be encoded
properly. It also usually allows U+FFFE and U+FFFF (which are normally
invalid due to their use for BOM in UTF-16).

2020-04-18 17:46 GMT-03:00, Ellie <etc0de at wobble.ninja>:
> Sorry, ignore the "0x10". I was meant to write 0x10000, but it's with
> some sort of offset anyway that i can't fully decipher right now. I wish
> the WTF-8 specs were written more mathematically, and less
> here-is-dumped-C-code style... oh well
>
> On 4/18/20 10:44 PM, Ellie wrote:
>> I just saw this in the Node.js bug report:
>>
>> https://simonsapin.github.io/wtf-8/
>>
>> From what I can tell I THINK it uses a similar concept of
>> one-invalid-byte-to-one-"special"-code-point, except in a different
>> not-PIA range (not 0xF0000 + above, but 0x10) if I am reading it
>> correctly.
>>
>> However, I am not sure:
>>
>> 1. whether WTF-8 uses a code point range is a choice guaranteed to be
>> free of "regular" character use in the future, and therefore not making
>> future clashes more likely (at least a PIA range as I suggested is
>> guaranteed to have no regularly recognized characters at any time in the
>> future)
>>
>> 2. whether WTF-8 can map any arbitrary 16-bit unsigned int non-lossy, or
>> "just" surrogates + regular characters (does that leave an uncovered
>> code point range? I'm not sure) although I'm not sure that is relevant
>> if PhysFS implemented it and passed through any non-surrogate 16-bit
>> values just as-is code points (no matter if they make sense or not) anyway
>>
>> Especially point 2 might need some clearing up to ensure it actually
>> solves the problem and makes really all file names reachable. Other than
>> that it could be a better idea than my hack, because at least it's a
>> sort-of standard. (Although I don't know how widely used it is)
>>
>> On 4/18/20 10:25 PM, Ellie wrote:
>>> After stumbling across an interesting node.js issue and digging into the
>>> PhysFS code I am wondering, is it the case that PhysFS can fundamentally
>>> not access files with certain file names on Windows?
>>>
>>> I believe the example used here to break node.js might likely also break
>>> PhysFS: https://github.com/nodejs/node/issues/23735
>>>
>>> (And summed up conceptually very briefly, the issue appears to be that
>>> Windows filenames are arbitrary unsigned 16-bit ints per character, NOT
>>> necessarily valid UTF-16 or valid surrogated UTF-32.)
>>>
>>>
>>> The potential issue on PhysFS's side is code like this:
>>>
>>> https://github.com/criptych/physfs/blob/master/src/physfs_unicode.c#L207
>>>
>>> If you forget the standard for a second and think about what sort of
>>> mathematical transformation this is: it transforms arbitrary 16-bit wide
>>> chars (which is essentially what a windows file name can be) such that
>>> it apparently maps multiple characters to UNICODE_BOGUS_CHAR_CODEPOINT.
>>> This would make the conversion lossy, since there is no way you can get
>>> back the same 16-bit wide char string converting back from UTF-8 when
>>> any two original 16-bit values are collapsed. However, obviously, a
>>> non-lossy conversion path back would need to be guaranteed for all
>>> possible file names to be addressable by the PhysFS-using application.
>>>
>>> I think one quite hacky(!) solution here would be:
>>>
>>> HACKY: Use something like
>>> https://en.wikipedia.org/wiki/Private_Use_Areas instead: treat all
>>> invalid UTF-16 as raw 8-bit binary (as 2x 8-bit per wide char,
>>> obviously), with each 8-bit encoded in a PIA code point range (like,
>>> e.g. as code point 0xF0000 + <0-255 value of byte> per byte). This range
>>> could then be decoded back to the raw 8-bit sequences if incoming from
>>> UTF-8 (and dropped if it's not always two 8-bit chars in a row, not
>>> mapping back to a 16-bit int). This would retain even any originally
>>> invalid 16-bit on Windows. Caution: valid UTF-16 can use characters in
>>> this PIA range too, even though that is discouraged - that would need to
>>> be treated as invalid UTF-16 and also "byte-mapped" like that. (This is
>>> the hacky part. But in the end that shouldn't be the worst offense since
>>> PIA is only for private use anyway, but strictly speaking this is
>>> butchering valid Unicode in ways many people would not expect. But it
>>> would guarantee all files are accessible through PhysFS on Windows even
>>> if through weird strings, which is not the case right now.)
>>>
>>> There are probably nicer solutions than that to avoid collapsing values
>>> in lossy ways to UNICODE_BOGUS_CHAR_CODEPOINT, this was just the first
>>> idea that came to my mind.
>>>
>>> (Another idea would be to use a code range way higher up since Unicode
>>> code points are only defined up to 0x10FFFF right now, while UTF-8 and
>>> UTF-16 can encode up to 0xFFFFFFFF I think. This would avoid remapping a
>>> part of valid UTF-16 that falls into the PIA range in hackish ways, but
>>> these upper >0x10FFF areas might be used in future standards while PIA
>>> is at least guaranteed to never be used differently/for "proper
>>> characters" in the future. So this might be nicer for now, but way
>>> uglier some years ahead.)
>>>
>>> In any case, I think it might be worth fixing this limitation of PhysFS
>>> somehow, if I am right and it's actually present.
>>>
>>> The problem is this issue limits the universal use of PhysFS, since it
>>> incentivizes not using PhysFS unless required, and instead using custom
>>> code that CAN open all files on disk on Windows. And in my opinion, this
>>> goes against PhysFS's core concept to unite regular file access and
>>> virtual mounted archive file access into one API and saving everyone
>>> from the duplicate code paths.
>>>
>>> Regards,
>>>
>>> Ellie
>>>
> _______________________________________________
> physfs mailing list
> physfs at icculus.org
> http://icculus.org/mailman/listinfo/physfs
>