Finger info for marco@icculus.org...


QuakeC and tokenizebyseparator

Upon doing my implementation of Source's Input/Output system, I discovered
the answer to a question I never really thought about.

The output of tokenizebyseparator needs to be carefully validated:

 string test1 = "thetarget,Open,d,4,-1";
 string test2 = "thetarget,Open,,4,-1";
 float a = tokenizebyseparator(test1, ",");
 float b = tokenizebyseparator(test2, ",");
 print(sprintf("a: %d\n", a));
 print(sprintf("b: %d\n", b));

Here's the code to a simple tokenization operation involving two strings.
They're both made up of segments separated by commas (4 in total) making
it 5 segments. The only difference between the two strings? The third
segment contains a 'd' between the commas.

However, the prints will disagree:

a: 5
b: 4

The bit with the two commas next to eachother will be regarded as a non-existing
segment.

>> ,, <<

FTE will see this and just act as if "" is not a valid segment.
This is wrong on many levels - some engines out of my/your control might do
the same exact thing, so...
be VERY careful in validating your outputs.

Why is this bad in my case?
For reasons related to my Source I/O implementation, I need to combine
similar strings into one long string (as we support multiple Outputs), but
because you can only tokenize one thing at a time. You cannot tokenize
a string while in the loop of tokenizing another - so depending on which
strings I combine, it'd completely produce unpredictable segments.
My solution: Replace every occurence of ',' with ',_' and tokenize by ','
as normal, however, use substring to cut the first char of your segment, the
underscore, out when you're processing the argv() outputs.

Update 10:50 AM
I've notified Spike about this already. He had trouble reproducing it at
first, because the whole tokenize function is currently very, very messed
up:

<Spoike> I can reproduce something... but its weirder than we thought.
<Spoike> static string test[] = {"thetarget,Open,,4,-1", "thetorget,Close,None,0,-1", "", ",,,,"};
<Spoike> for (int i = 0; i < test.length; i++) {
<Spoike> float a = tokenizebyseparator(test[i], ",");
<Spoike> print(sprintf("a: %d %s,%s,%s,%s,%s\n", a, argv(0), argv(1), argv(2), argv(3), argv(4))); }
<Spoike> a: 4 thetarget,Open,,4,-1,
<Spoike> a: 5 thetorget,Close,None,0,-1
<Spoike> a: 0 ,,,,
<Spoike> a: 3 ,,,,,,
<Spoike> d: 3 :,:,::
<Spoike> changed the print to replace with colons instead of commas

This will hopefully lead to a reform of the entire function.
I've given some suggestions to Spoike about this:

If the input string is "", it has no content. It's a string length of 0,
so in this special case we're going to return 0. This is so for-loops
don't attempt to tokenize something with no valid input.
I am not 100% convinced this is the ideal way to go, but it would aid usability
for most people. From what I'm told, Darkplaces will output 1.

Something like "foobar" will output 1. Its length is longer than 0 and thus
an actual input.

If the input string is "," it'll output 2. As we now have a real input
with a length of more than 0, a comma separating 2 segments.

"foo,bar," should always output 3. Same with ",foo,bar".
Always respect empty segments, but only when strlen() > 0.

I think this is the only thing that makes sense. Yes, treating "" as 0 makes
it seem inconsistent, but usability comes first. "" does not seem like a string
that somebody would have crafted by hand. When would it ever produce something
valid inside a for-loop tokenizing it? You're just risking the for-loop
executing when tokenizebyseparator outputs 1.
tokenize("") also outputs 0.

Update 12:53 AM
It's been addressed in the commit https://sourceforge.net/p/fteqw/code/5766/

 float w = tokenizebyseparator("test1,test2,test3,test4", ",");
 float x = tokenizebyseparator("test1,,test3,test4", ",");
 float y = tokenizebyseparator("", ",");
 float z = tokenizebyseparator(",", ",");
 float v = tokenizebyseparator(" ", ",");
 print(sprintf("4 is %d\n", w));
 print(sprintf("4 is %d\n", x));
 print(sprintf("0 is %d\n", y));
 print(sprintf("2 is %d\n", z));
 print(sprintf("1 is %d\n", v));

Use this code to test whether or not your tokenizebyseparator outputs
garbage.

-- Marco

When this .plan was written: 2020-09-21 06:57:22
.plan archives for this user are here (RSS here).
Powered by IcculusFinger v2.1.27
Stick it in the camel and go.