[physfs] Unicode conversions fail outside BMP

Tim Čas darkuranium at gmail.com
Sun Nov 4 06:23:24 EST 2012


I'd just like to note that (in my personal experience, take it for what it
is) while Windows *in theory* supports UTF-16, in practice, it cannot work
properly with pairs and is therefore more like UCS-2.

I've noticed this on multiple occasions, including while doing (forced by
the university, I might add) C# development.

On 4 November 2012 06:10, Jason McKesson <korval2 at gmail.com> wrote:

> On 11/3/2012 8:56 AM, Jookia wrote:
>
>> Hello!
>>
>> Recently I've been trying to get my application to run on Windows. At
>> first I found that it didn't load files properly when they had a certain
>> phrase in them that's meant to break things that don't support Unicode
>> properly.
>>
>> Long story short, I wrote this:
>>
>> #include <physfs.h>
>> #include <stdio.h>
>>
>> int main(int argc, char** argv)
>> {
>>   PHYSFS_init(argv[0]);
>>   PHYSFS_mount(PHYSFS_**getBaseDir(), "", 0);
>>   PHYSFS_File* file = PHYSFS_openRead("𝓲");
>>
>>   printf("%s\n", PHYSFS_getLastError());
>>
>>   return 0;
>> }
>>
>> (If you see a box please try and copy this in your IDE. MSVC displays it
>> fine.)
>>
>> On Linux it works. It writes 'No such file or directory' to my console.
>> Fantastic!
>>
>> On Wine + MinGW (I know, I know, I'll explain in a tick why I don't think
>> they're at fault. I haven't got PhysFS to work in Windows yet.) it returns
>> 'Invalid name.'. What?
>>
>> Digging deeper in to the code, I found that windows.c's doPlatformExists
>> fails. After manually printing out the UTF-16 string to a file and then
>> using iconv to read it, it turns out my lovely character has turned to a
>> question mark, which happens to be an invalid name.
>>
>> Changing the code up so it bypasses the Unicode conversions:
>>
>> static int doPlatformExists(LPWSTR wpath)
>> {
>>     LPWSTR newpath = L"Z:\\home\\jookia\\Staging\\**test-𝓲";
>>
>>     if(pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_**ATTRIBUTES)
>>     {
>>         wpath = newpath;
>>     }
>>
>>     BAIL_IF_MACRO
>>     (
>>         pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_**ATTRIBUTES,
>>         winApiStrError(), 0
>>     );
>>     return(1);
>> } /* doPlatformExists */
>>
>> Will make the program write out 'File not found.' to my console.
>>
>> It seems the error is in utf8ToUcs2 not accounting for surrogates:
>>
>>
>>         /* !!! BLUESKY: UTF-16 surrogates? */
>>         if (cp > 0xFFFF)
>>             cp = UNICODE_BOGUS_CHAR_CODEPOINT;
>>
>> I know UCS-2 technically doesn't account for surrogates, but we're using
>> UTF-16 in Windows. Commenting out the code will make a weird path due to it
>> not being converted properly, but it will bring up a 'File not found.'
>>
>> So... Are there any plans to fix this? Are patches welcome?
>>
> UCS-2 doesn't "technically" do anything. It's two bytes per codepoint and
> stops at codepoint 0xFFFF.  Changing utf8ToUcs2 to use surrogate pairs
> would be terrible, since that's not what UCS-2 is.
>
> Instead, you want a utf8ToUtf16 function.
>
> ______________________________**_________________
> physfs mailing list
> physfs at icculus.org
> http://icculus.org/mailman/**listinfo/physfs<http://icculus.org/mailman/listinfo/physfs>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://icculus.org/pipermail/physfs/attachments/20121104/d537bf63/attachment.htm>


More information about the physfs mailing list