[physfs] Can PhysFS not read certain file names on Windows? [Solution ideas included]

Sat Apr 18 23:52:30 EDT 2020

(resent because I used wrong sender, I don't think it reached the list)

Ah, nice. Sounds like a useful solution then, I thought from the WTF-8
specs that invalid surrogates are mapped around but I guess they're just
not then.

I think PhysFS should adopt WTF-8 then!

On 4/18/20 11:50 PM, Sik wrote:
> For what's worth it, WTF-8 is basically "UTF-8 but invalid codepoints
> are allowed" (i.e. any encoded UTF-8 value is valid, even if it's not
> a codepoint allowed by Unicode). Whether it works or not depends on
> how strict string validation is, if functions just parse UTF-8 blindly
> it may already Just Work™.
> 
> Probably the most common form of WTF-8 is precisely the one brought up
> here, where surrogate codepoints are considered valid, and yes, this
> is to ensure loseless conversion between UTF-8 and UCS-2/UTF-16
> (WTF-16?). In particular, to allow lone surrogates to be encoded
> properly. It also usually allows U+FFFE and U+FFFF (which are normally
> invalid due to their use for BOM in UTF-16).
> 
> 2020-04-18 17:46 GMT-03:00, Ellie <etc0de at wobble.ninja>:
>> Sorry, ignore the "0x10". I was meant to write 0x10000, but it's with
>> some sort of offset anyway that i can't fully decipher right now. I wish
>> the WTF-8 specs were written more mathematically, and less
>> here-is-dumped-C-code style... oh well
>>
>> On 4/18/20 10:44 PM, Ellie wrote:
>>> I just saw this in the Node.js bug report:
>>>
>>> https://simonsapin.github.io/wtf-8/
>>>
>>> From what I can tell I THINK it uses a similar concept of
>>> one-invalid-byte-to-one-"special"-code-point, except in a different
>>> not-PIA range (not 0xF0000 + above, but 0x10) if I am reading it
>>> correctly.
>>>
>>> However, I am not sure:
>>>
>>> 1. whether WTF-8 uses a code point range is a choice guaranteed to be
>>> free of "regular" character use in the future, and therefore not making
>>> future clashes more likely (at least a PIA range as I suggested is
>>> guaranteed to have no regularly recognized characters at any time in the
>>> future)
>>>
>>> 2. whether WTF-8 can map any arbitrary 16-bit unsigned int non-lossy, or
>>> "just" surrogates + regular characters (does that leave an uncovered
>>> code point range? I'm not sure) although I'm not sure that is relevant
>>> if PhysFS implemented it and passed through any non-surrogate 16-bit
>>> values just as-is code points (no matter if they make sense or not) anyway
>>>
>>> Especially point 2 might need some clearing up to ensure it actually
>>> solves the problem and makes really all file names reachable. Other than
>>> that it could be a better idea than my hack, because at least it's a
>>> sort-of standard. (Although I don't know how widely used it is)
>>>
>>> On 4/18/20 10:25 PM, Ellie wrote:
>>>> After stumbling across an interesting node.js issue and digging into the
>>>> PhysFS code I am wondering, is it the case that PhysFS can fundamentally
>>>> not access files with certain file names on Windows?
>>>>
>>>> I believe the example used here to break node.js might likely also break
>>>> PhysFS: https://github.com/nodejs/node/issues/23735
>>>>
>>>> (And summed up conceptually very briefly, the issue appears to be that
>>>> Windows filenames are arbitrary unsigned 16-bit ints per character, NOT
>>>> necessarily valid UTF-16 or valid surrogated UTF-32.)
>>>>
>>>>
>>>> The potential issue on PhysFS's side is code like this:
>>>>
>>>> https://github.com/criptych/physfs/blob/master/src/physfs_unicode.c#L207
>>>>
>>>> If you forget the standard for a second and think about what sort of
>>>> mathematical transformation this is: it transforms arbitrary 16-bit wide
>>>> chars (which is essentially what a windows file name can be) such that
>>>> it apparently maps multiple characters to UNICODE_BOGUS_CHAR_CODEPOINT.
>>>> This would make the conversion lossy, since there is no way you can get
>>>> back the same 16-bit wide char string converting back from UTF-8 when
>>>> any two original 16-bit values are collapsed. However, obviously, a
>>>> non-lossy conversion path back would need to be guaranteed for all
>>>> possible file names to be addressable by the PhysFS-using application.
>>>>
>>>> I think one quite hacky(!) solution here would be:
>>>>
>>>> HACKY: Use something like
>>>> https://en.wikipedia.org/wiki/Private_Use_Areas instead: treat all
>>>> invalid UTF-16 as raw 8-bit binary (as 2x 8-bit per wide char,
>>>> obviously), with each 8-bit encoded in a PIA code point range (like,
>>>> e.g. as code point 0xF0000 + <0-255 value of byte> per byte). This range
>>>> could then be decoded back to the raw 8-bit sequences if incoming from
>>>> UTF-8 (and dropped if it's not always two 8-bit chars in a row, not
>>>> mapping back to a 16-bit int). This would retain even any originally
>>>> invalid 16-bit on Windows. Caution: valid UTF-16 can use characters in
>>>> this PIA range too, even though that is discouraged - that would need to
>>>> be treated as invalid UTF-16 and also "byte-mapped" like that. (This is
>>>> the hacky part. But in the end that shouldn't be the worst offense since
>>>> PIA is only for private use anyway, but strictly speaking this is
>>>> butchering valid Unicode in ways many people would not expect. But it
>>>> would guarantee all files are accessible through PhysFS on Windows even
>>>> if through weird strings, which is not the case right now.)
>>>>
>>>> There are probably nicer solutions than that to avoid collapsing values
>>>> in lossy ways to UNICODE_BOGUS_CHAR_CODEPOINT, this was just the first
>>>> idea that came to my mind.
>>>>
>>>> (Another idea would be to use a code range way higher up since Unicode
>>>> code points are only defined up to 0x10FFFF right now, while UTF-8 and
>>>> UTF-16 can encode up to 0xFFFFFFFF I think. This would avoid remapping a
>>>> part of valid UTF-16 that falls into the PIA range in hackish ways, but
>>>> these upper >0x10FFF areas might be used in future standards while PIA
>>>> is at least guaranteed to never be used differently/for "proper
>>>> characters" in the future. So this might be nicer for now, but way
>>>> uglier some years ahead.)
>>>>
>>>> In any case, I think it might be worth fixing this limitation of PhysFS
>>>> somehow, if I am right and it's actually present.
>>>>
>>>> The problem is this issue limits the universal use of PhysFS, since it
>>>> incentivizes not using PhysFS unless required, and instead using custom
>>>> code that CAN open all files on disk on Windows. And in my opinion, this
>>>> goes against PhysFS's core concept to unite regular file access and
>>>> virtual mounted archive file access into one API and saving everyone
>>>> from the duplicate code paths.
>>>>
>>>> Regards,
>>>>
>>>> Ellie
>>>>
>> _______________________________________________
>> physfs mailing list
>> physfs at icculus.org
>> http://icculus.org/mailman/listinfo/physfs
>>
> _______________________________________________
> physfs mailing list
> physfs at icculus.org
> http://icculus.org/mailman/listinfo/physfs
>