[physfs] Unicode conversions fail outside BMP
Tim Čas
darkuranium at gmail.com
Sun Nov 4 06:23:24 EST 2012
I'd just like to note that (in my personal experience, take it for what it
is) while Windows *in theory* supports UTF-16, in practice, it cannot work
properly with pairs and is therefore more like UCS-2.
I've noticed this on multiple occasions, including while doing (forced by
the university, I might add) C# development.
On 4 November 2012 06:10, Jason McKesson <korval2 at gmail.com> wrote:
> On 11/3/2012 8:56 AM, Jookia wrote:
>
>> Hello!
>>
>> Recently I've been trying to get my application to run on Windows. At
>> first I found that it didn't load files properly when they had a certain
>> phrase in them that's meant to break things that don't support Unicode
>> properly.
>>
>> Long story short, I wrote this:
>>
>> #include <physfs.h>
>> #include <stdio.h>
>>
>> int main(int argc, char** argv)
>> {
>> PHYSFS_init(argv[0]);
>> PHYSFS_mount(PHYSFS_**getBaseDir(), "", 0);
>> PHYSFS_File* file = PHYSFS_openRead("𝓲");
>>
>> printf("%s\n", PHYSFS_getLastError());
>>
>> return 0;
>> }
>>
>> (If you see a box please try and copy this in your IDE. MSVC displays it
>> fine.)
>>
>> On Linux it works. It writes 'No such file or directory' to my console.
>> Fantastic!
>>
>> On Wine + MinGW (I know, I know, I'll explain in a tick why I don't think
>> they're at fault. I haven't got PhysFS to work in Windows yet.) it returns
>> 'Invalid name.'. What?
>>
>> Digging deeper in to the code, I found that windows.c's doPlatformExists
>> fails. After manually printing out the UTF-16 string to a file and then
>> using iconv to read it, it turns out my lovely character has turned to a
>> question mark, which happens to be an invalid name.
>>
>> Changing the code up so it bypasses the Unicode conversions:
>>
>> static int doPlatformExists(LPWSTR wpath)
>> {
>> LPWSTR newpath = L"Z:\\home\\jookia\\Staging\\**test-𝓲";
>>
>> if(pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_**ATTRIBUTES)
>> {
>> wpath = newpath;
>> }
>>
>> BAIL_IF_MACRO
>> (
>> pGetFileAttributesW(wpath) == PHYSFS_INVALID_FILE_**ATTRIBUTES,
>> winApiStrError(), 0
>> );
>> return(1);
>> } /* doPlatformExists */
>>
>> Will make the program write out 'File not found.' to my console.
>>
>> It seems the error is in utf8ToUcs2 not accounting for surrogates:
>>
>>
>> /* !!! BLUESKY: UTF-16 surrogates? */
>> if (cp > 0xFFFF)
>> cp = UNICODE_BOGUS_CHAR_CODEPOINT;
>>
>> I know UCS-2 technically doesn't account for surrogates, but we're using
>> UTF-16 in Windows. Commenting out the code will make a weird path due to it
>> not being converted properly, but it will bring up a 'File not found.'
>>
>> So... Are there any plans to fix this? Are patches welcome?
>>
> UCS-2 doesn't "technically" do anything. It's two bytes per codepoint and
> stops at codepoint 0xFFFF. Changing utf8ToUcs2 to use surrogate pairs
> would be terrible, since that's not what UCS-2 is.
>
> Instead, you want a utf8ToUtf16 function.
>
> ______________________________**_________________
> physfs mailing list
> physfs at icculus.org
> http://icculus.org/mailman/**listinfo/physfs<http://icculus.org/mailman/listinfo/physfs>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://icculus.org/pipermail/physfs/attachments/20121104/d537bf63/attachment.htm>
More information about the physfs
mailing list