[physfs] Unicode conversions fail outside BMP
Jason McKesson
korval2 at gmail.com
Sun Nov 4 01:10:44 EDT 2012
On 11/3/2012 8:56 AM, Jookia wrote:
> Hello!
>
> Recently I've been trying to get my application to run on Windows. At
> first I found that it didn't load files properly when they had a
> certain phrase in them that's meant to break things that don't support
> Unicode properly.
>
> Long story short, I wrote this:
>
> #include <physfs.h>
> #include <stdio.h>
>
> int main(int argc, char** argv)
> {
> PHYSFS_init(argv[0]);
> PHYSFS_mount(PHYSFS_getBaseDir(), "", 0);
> PHYSFS_File* file = PHYSFS_openRead("𝓲");
>
> printf("%s\n", PHYSFS_getLastError());
>
> return 0;
> }
>
> (If you see a box please try and copy this in your IDE. MSVC displays
> it fine.)
>
> On Linux it works. It writes 'No such file or directory' to my
> console. Fantastic!
>
> On Wine + MinGW (I know, I know, I'll explain in a tick why I don't
> think they're at fault. I haven't got PhysFS to work in Windows yet.)
> it returns 'Invalid name.'. What?
>
> Digging deeper in to the code, I found that windows.c's
> doPlatformExists fails. After manually printing out the UTF-16 string
> to a file and then using iconv to read it, it turns out my lovely
> character has turned to a question mark, which happens to be an
> invalid name.
>
> Changing the code up so it bypasses the Unicode conversions:
>
> static int doPlatformExists(LPWSTR wpath)
> {
> LPWSTR newpath = L"Z:\\home\\jookia\\Staging\\test-𝓲";
>
> if(pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_ATTRIBUTES)
> {
> wpath = newpath;
> }
>
> BAIL_IF_MACRO
> (
> pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_ATTRIBUTES,
> winApiStrError(), 0
> );
> return(1);
> } /* doPlatformExists */
>
> Will make the program write out 'File not found.' to my console.
>
> It seems the error is in utf8ToUcs2 not accounting for surrogates:
>
>
> /* !!! BLUESKY: UTF-16 surrogates? */
> if (cp > 0xFFFF)
> cp = UNICODE_BOGUS_CHAR_CODEPOINT;
>
> I know UCS-2 technically doesn't account for surrogates, but we're
> using UTF-16 in Windows. Commenting out the code will make a weird
> path due to it not being converted properly, but it will bring up a
> 'File not found.'
>
> So... Are there any plans to fix this? Are patches welcome?
UCS-2 doesn't "technically" do anything. It's two bytes per codepoint
and stops at codepoint 0xFFFF. Changing utf8ToUcs2 to use surrogate
pairs would be terrible, since that's not what UCS-2 is.
Instead, you want a utf8ToUtf16 function.
More information about the physfs
mailing list