I recently stumbled upon a code snippet that puzzled me for a while. Intuitively, it’s very straightforward.
var seperator = "\0";
string helloString = $"Hello{seperator}World";
int seperatorIndex = helloString.IndexOf(seperator);
Console.WriteLine(seperatorIndex);
I define a string that I want to use as the separator and a string that contains that separator. IndexOf will return the 0-based index of the first occurrence of the separator. If the separator cannot be found, it returns -1. So for this code snippet, one might assume that we get 5 as the output.
We don’t, though; we get 0. But why? The docs note this bit of information.
If value is string.Empty, the return value is 0.
So maybe "\0" gets picked up as an empty string?
var seperator = "\0";
string helloString = $"Hello{seperator}World";
int seperatorIndex = helloString.IndexOf(seperator);
Console.WriteLine(string.IsNullOrEmpty(seperator));
Console.WriteLine(seperatorIndex);
Adding a check for null or empty on the separator, the output is now
False
0
So that’s not it either. I was intrigued why this was happening and jumped right ahead into the .NET Source Browser. Jumping through the methods that IndexOf uses, at some point I reached this
[RequiresUnsafe]
private unsafe int IndexOfCore(ReadOnlySpan<char> source, ReadOnlySpan<char> target, CompareOptions options, int* matchLengthPtr, bool fromBeginning) =>
GlobalizationMode.UseNls ?
NlsIndexOfCore(source, target, options, matchLengthPtr, fromBeginning) :
IcuIndexOfCore(source, target, options, matchLengthPtr, fromBeginning);
Everyting I saw before, I could make some sense of, but what are NLS and ICU?
Handling Cultures
NLS (National Language Support) and ICU (International Components for Unicode) are libraries that offer globalization APIs. When handling strings, certain operations might seem simple at first glance but have some considerable nuance. Take the word interesting for example. If we convert that to its uppercase representation, we get different results, depending on the active CultureInfo.
using System.Globalization;
string myString = "interesting";
Console.WriteLine(myString.ToUpper(CultureInfo.CreateSpecificCulture("tr-TR")));
Console.WriteLine(myString.ToUpper(CultureInfo.InvariantCulture));
When running this, the output is
İNTERESTİNG
INTERESTING
Globalization libraries do lots of heavy lifting to make working with all these edge cases easier. Many details of which I hope to dive into in later posts.
Why is \0 empty?
To focus on the example for this post, we need to know two things that are actually well documented in the .NET docs when you know what to search for.
- When using
String.IndexOf(String)by default does a culture-aware linguistic search. This means the ICU or NLS library will be part of the operation. - ICU considers the null character \0 to be a zero-weight character
What this means is that for the purposes of the IndexOf, the \0 gets treated as if it were an empty string, which then brings us back to what I mentioned at the very beginning:
If value is string.Empty, the return value is 0.
And therefore using \0 as a separator character by default will always return 0.
Potential fixes are to - Either store the separator as a char instead of a string. - or use an ordinal string comparison
helloString.IndexOf(seperator, StringComparison.Ordinal);
In the end it is all very obvious and, as mentioned before, well documented. It did throw me for a loop, though, and because I had fun untangling what the cause was, I figured someone else might too.