[physfs] Unicode conversions fail outside BMP

Sun Nov 4 01:10:44 EDT 2012

On 11/3/2012 8:56 AM, Jookia wrote:
> Hello!
>
> Recently I've been trying to get my application to run on Windows. At 
> first I found that it didn't load files properly when they had a 
> certain phrase in them that's meant to break things that don't support 
> Unicode properly.
>
> Long story short, I wrote this:
>
> #include <physfs.h>
> #include <stdio.h>
>
> int main(int argc, char** argv)
> {
>   PHYSFS_init(argv[0]);
>   PHYSFS_mount(PHYSFS_getBaseDir(), "", 0);
>   PHYSFS_File* file = PHYSFS_openRead("𝓲");
>
>   printf("%s\n", PHYSFS_getLastError());
>
>   return 0;
> }
>
> (If you see a box please try and copy this in your IDE. MSVC displays 
> it fine.)
>
> On Linux it works. It writes 'No such file or directory' to my 
> console. Fantastic!
>
> On Wine + MinGW (I know, I know, I'll explain in a tick why I don't 
> think they're at fault. I haven't got PhysFS to work in Windows yet.) 
> it returns 'Invalid name.'. What?
>
> Digging deeper in to the code, I found that windows.c's 
> doPlatformExists fails. After manually printing out the UTF-16 string 
> to a file and then using iconv to read it, it turns out my lovely 
> character has turned to a question mark, which happens to be an 
> invalid name.
>
> Changing the code up so it bypasses the Unicode conversions:
>
> static int doPlatformExists(LPWSTR wpath)
> {
>     LPWSTR newpath = L"Z:\\home\\jookia\\Staging\\test-𝓲";
>
>     if(pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_ATTRIBUTES)
>     {
>         wpath = newpath;
>     }
>
>     BAIL_IF_MACRO
>     (
>         pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_ATTRIBUTES,
>         winApiStrError(), 0
>     );
>     return(1);
> } /* doPlatformExists */
>
> Will make the program write out 'File not found.' to my console.
>
> It seems the error is in utf8ToUcs2 not accounting for surrogates:
>
>
>         /* !!! BLUESKY: UTF-16 surrogates? */
>         if (cp > 0xFFFF)
>             cp = UNICODE_BOGUS_CHAR_CODEPOINT;
>
> I know UCS-2 technically doesn't account for surrogates, but we're 
> using UTF-16 in Windows. Commenting out the code will make a weird 
> path due to it not being converted properly, but it will bring up a 
> 'File not found.'
>
> So... Are there any plans to fix this? Are patches welcome?
UCS-2 doesn't "technically" do anything. It's two bytes per codepoint 
and stops at codepoint 0xFFFF.  Changing utf8ToUcs2 to use surrogate 
pairs would be terrible, since that's not what UCS-2 is.

Instead, you want a utf8ToUtf16 function.