Exploring char and string in C#
[TOC]
1. System.Char Character
char is an alias for System.Char.
System.Char occupies two bytes, or 16 bits.
System.Char is used to represent and store a single Unicode character.
The range of System.Char is from U+0000
to U+FFFF
, with the default value of char being \0
, which is U+0000
.
Unicode representation is typically shown in the format U+____
, where U
is followed by a group of 16 hexadecimal digits.
There are four ways to assign a value to char:
char a = 'j';
char b = '\u006A';
char c = '\x006A';
char d = (char) 106;
Console.WriteLine($"{a} | {b} | {c} | {d}");
Output:
j | j | j | j
\u
indicates a Unicode escape sequence (encoding); when using a Unicode escape sequence, it must be followed by four hexadecimal digits.
\u006A Valid
\u06A Invalid
\u6A Invalid
\x
indicates a hexadecimal escape sequence, which also consists of four hexadecimal digits. If there are N leading zeros, they can be omitted. The following examples all represent the same character.
\x006A
\x06A
\x6A
char can be implicitly converted to other numeric types; integer types can be converted to ushort
, int
, uint
, long
, and ulong
, while floating-point types can be converted to float
, double
, and decimal
.
char can be explicitly converted to sbyte
, byte
, and short
.
Other types cannot be implicitly converted to char, but any integer and floating-point type can be explicitly converted to char.
2. Character Processing
System.Char provides many static methods that help recognize and process characters.
One very important enumeration is UnicodeCategory:
public enum UnicodeCategory
{
UppercaseLetter,
LowercaseLetter,
TitlecaseLetter,
ModifierLetter,
OtherLetter,
NonSpacingMark,
SpacingCombiningMark,
EnclosingMark,
DecimalDigitNumber,
LetterNumber,
OtherNumber,
SpaceSeparator,
LineSeparator,
ParagraphSeparator,
Control,
Format,
Surrogate,
PrivateUse,
ConnectorPunctuation,
DashPunctuation,
OpenPunctuation,
ClosePunctuation,
InitialQuotePunctuation,
FinalQuotePunctuation,
OtherPunctuation,
MathSymbol,
CurrencySymbol,
ModifierSymbol,
OtherSymbol,
OtherNotAssigned,
}
In System.Char, there is a static method GetUnicodeCategory()
that returns the type of a character, which corresponds to the above enumeration values.
In addition to GetUnicodeCategory()
, we can also determine a character's category through specific static methods.
Below is a list of static method usage descriptions along with their enumeration categories.
| Static Method | Description | Enumeration Representation |
| -------------------- | -------------------------------------------------- | ----------------------------------------------------------------------- |
| IsControl | Non-printable characters with values less than 0x20
, such as \r, \n, \t, \0, etc. | None |
| IsDigit | Numbers from 0-9 and digits from other alphabets | DecimalDigitNumber |
| IsLetter | Letters A-Z, a-z and other letter characters | UppercaseLetter,
LowercaseLetter,
TitlecaseLetter,
ModifierLetter,
OtherLetter |
| IsLetterOrDigit | Letters and digits | Refers to IsLetter and IsDigit |
| IsLower | Lowercase letters | LowercaseLetter |
| IsNumber | Numbers, fractions in Unicode, Roman numerals | DecimalDigitNumber,
LetterNumber,
OtherNumber |
| IsPunctuation | Punctuation marks in Western and other alphabets | ConnectorPunctuation,
DashPunctuation,
InitialQuotePunctuation,
FinalQuotePunctuation,
OtherPunctuation |
| IsSeparator | Whitespace and all Unicode separators | SpaceSeparator,
ParagraphSeparator |
| IsSurrogate | Unicode values between 0x10000 and 0x10FFF | Surrogate |
| IsSymbol | Most printable characters | MathSymbol,
ModifierSymbol,
OtherSymbol |
| IsUpper | Uppercase letters | UppercaseLetter |
| IsWhiteSpace | All separators and \t, \n, \r, \v, \f | SpaceSeparator,
ParagraphSeparator |
Example:
char chA = 'A';
char ch1 = '1';
string str = "test string";
Console.WriteLine(chA.CompareTo('B')); //----------- Output: "-1
//(meaning 'A' is 1 less than 'B')
Console.WriteLine(chA.Equals('A')); //----------- Output: "True"
Console.WriteLine(Char.GetNumericValue(ch1)); //----------- Output: "1"
Console.WriteLine(Char.IsControl('\t')); //----------- Output: "True"
Console.WriteLine(Char.IsDigit(ch1)); //----------- Output: "True"
Console.WriteLine(Char.IsLetter(',')); //----------- Output: "False"
Console.WriteLine(Char.IsLower('u')); //----------- Output: "True"
Console.WriteLine(Char.IsNumber(ch1)); //----------- Output: "True"
Console.WriteLine(Char.IsPunctuation('.')); //----------- Output: "True"
Console.WriteLine(Char.IsSeparator(str, 4)); //----------- Output: "True"
Console.WriteLine(Char.IsSymbol('+')); //----------- Output: "True"
Console.WriteLine(Char.IsWhiteSpace(str, 4)); //----------- Output: "True"
Console.WriteLine(Char.Parse("S")); //----------- Output: "S"
Console.WriteLine(Char.ToLower('M')); //----------- Output: "m"
Console.WriteLine('x'.ToString()); //----------- Output: "x"
Console.WriteLine(Char.IsSurrogate('\U00010F00')); // Output: "False"
char test = '\xDFFF';
Console.WriteLine(test); //----------- Output:'?'
Console.WriteLine(Char.GetUnicodeCategory(test));//----------- Output:"Surrogate"
If you want to satisfy your curiosity, you can click here.
3. Globalization
C# provides a rich set of methods in System.Char for character processing, such as commonly used ToUpper
, ToLower
.
However, character processing can be influenced by the user's culture.
When using methods from System.Char to process characters, you can call methods with the Invariant
suffix or use CultureInfo.InvariantCulture
for culture-independent character processing.
Example:
Console.WriteLine(Char.ToUpper('i', CultureInfo.InvariantCulture));
Console.WriteLine(Char.ToUpperInvariant('i'));
As for character and string processing, potential overload parameters and processing methods are described below.
StringComparison
| Enumeration | Value | Description |
| --------------------------- | ----- | --------------------------------------------------------------- |
| CurrentCulture | 0 | Compares strings using culture-sensitive ordering rules with the current culture |
| CurrentCultureIgnoreCase | 1 | Compares strings using culture-sensitive ordering rules with the current culture, ignoring case |
| InvariantCulture | 2 | Compares strings using culture-sensitive ordering rules with invariant culture |
| InvariantCultureIgnoreCase | 3 | Compares strings using culture-sensitive ordering rules with invariant culture, ignoring case |
| Ordinal | 4 | Compares strings using ordinal (binary) ordering rules |
| OrdinalIgnoreCase | 5 | Compares strings using ordinal (binary) ordering rules, ignoring case |
CultureInfo
| Enumeration | Description |
| --------------------- | ------------------------------------------------------------------- |
| CurrentCulture | Gets the CultureInfo object representing the culture used by the current thread |
| CurrentUICulture | Gets or sets the CultureInfo object representing the current user interface culture used for resource management at runtime |
| InstalledUICulture | Gets the CultureInfo representing the cultures installed in the operating system |
| InvariantCulture | Gets the CultureInfo object that is culture-independent (fixed) |
| IsNeutralCulture | Gets a value indicating whether the current CultureInfo represents a neutral culture |
4. System.String String
4.1 String Search
Strings have multiple search methods: StartsWith()
, EndsWith()
, Contains()
, IndexOf
.
StartsWith()
and EndsWith()
can utilize StringComparison for comparison and CultureInfo to control culture-related rules.
StartsWith()
: Checks if the string starts with a matching substring.
EndsWith()
: Checks if the string ends with a matching substring.
Contains()
: Checks if the substring exists at any position in the string.
IndexOf
: Retrieves the index of the first occurrence of the string or character; if the return value is -1
, it means no match was found.
Usage example:
string a = "痴者工良(高级程序员劝退师)";
Console.WriteLine(a.StartsWith("高级"));
Console.WriteLine(a.StartsWith("高级", StringComparison.CurrentCulture));
Console.WriteLine(a.StartsWith("高级", true, CultureInfo.CurrentCulture));
Console.WriteLine(a.StartsWith("痴者", StringComparison.CurrentCulture));
Console.WriteLine(a.EndsWith("劝退师)", true, CultureInfo.CurrentCulture));
Console.WriteLine(a.IndexOf("高级", StringComparison.CurrentCulture));
Output:
False
False
False
True
True
5
Aside from Contains()
, the other three methods have multiple overloads, such as:
| Overload | Description |
| --------------------------------------------- | -------------------------------------- |
| (String) | Checks if matches the specified string |
| (String, StringComparison) | Specifies the comparison method for matching the string |
| (String, Boolean, CultureInfo) | Controls case sensitivity and cultural rules for matching the string |
The rules related to globalization and case sensitivity will be discussed in later sections.
4.2 String Extraction, Insertion, Deletion, Replacement
4.2.1 Extraction
The SubString()
method can extract a specified number of characters from a string starting at a given index or the remainder of the string.
string a = "痴者工良(高级程序员劝退师)";
Console.WriteLine(a.Substring(startIndex: 1, length: 3));
// 者工良
Console.WriteLine(a.Substring(startIndex: 5));
// 高级程序员劝退师)
4.2.2 Insertion, Deletion, Replacement
Insert()
: Inserts characters or strings at the specified index.
Remove()
: Removes characters or strings at the specified index.
PadLeft()
: Expands the string to a specified length using a provided string on the left side.
PadRight()
: Expands the string to a specified length using a provided string on the right side.
TrimStart()
: Removes specified characters from the left side of the string until a character not meeting the criteria is encountered.
TrimEnd()
: Removes specified characters from the right side of the string until a character not meeting the criteria is encountered.
Replace()
: Replaces a contiguous group of N characters in the string with a new group of M characters.
- Remove Insert -
痴者工良我是(高级程序员劝退师)
痴者工良(
痴者工良(序员劝退师)
- PadLeft PadRight -
******痴者工良(高级程序员劝退师)
痴者工良(高级程序员劝退师)######
######痴者工良(高级程序员劝退师)
痴者工良(高级程序员劝退师)******
....痴者工良(高级程序员劝退师)
痴者工良(高级程序员劝退师)....
- Trim -
Hello | World
Hello | World
Hello | World!|
Hello | World!|||
|Hello | World!
|||Hello | World!
abc ABC&#*
abc ABC&#*
- Replace -
AbcdABCDAbcdABCD
5. String Intern Pool
The following is a personal summary by the author. Due to personal limitations, if there are any errors, please kindly provide corrections.
The string intern pool is completed at the domain level, and the string intern pool can be shared among all assemblies within the domain.
The CLR maintains a table called the Intern Pool.
This table records the references of all string instances that are declared using literals in the code.
New strings will enter the string intern pool when manipulating literals through concatenation.
Only string instances declared using literal declarations will have references to the strings in the intern pool.
Regardless of whether they are field properties or string variables declared within methods, or even the default value of method parameters, they will enter the string intern pool.
For example:
static string test = "一个测试";
static void Main(string[] args)
{
string a = "a";
Console.WriteLine("test:" + test.GetHashCode());
TestOne(test);
TestTwo(test);
TestThree("一个测试");
}
public static void TestOne(string a)
{
Console.WriteLine("----TestOne-----");
Console.WriteLine("a:" + a.GetHashCode());
string b = a;
Console.WriteLine("b:" + b.GetHashCode());
Console.WriteLine("test - a :" + Object.ReferenceEquals(test, a));
}
public static void TestTwo(string a = "一个测试")
{
Console.WriteLine("----TestTwo-----");
Console.WriteLine("a:" + a.GetHashCode());
string b = a;
Console.WriteLine("b:" + b.GetHashCode());
Console.WriteLine("test - a :" + Object.ReferenceEquals(test, a));
}
public static void TestThree(string a)
{
Console.WriteLine("----TestThree-----");
Console.WriteLine("a:" + a.GetHashCode());
string b = a;
Console.WriteLine("b:" + b.GetHashCode());
Console.WriteLine("test - a :" + Object.ReferenceEquals(test, a));
}
Output:
test:-407145577
----TestOne-----
a:-407145577
b:-407145577
test - a :True
----TestTwo-----
a:-407145577
b:-407145577
test - a :True
----TestThree-----
a:-407145577
b:-407145577
test - a :True
You can compare two strings to see if they are the same reference by using the static method Object.ReferenceEquals(s1, s2);
or by using the instance method .GetHashCode()
.
You can modify strings directly in memory using unsafe code.
Refer to https://blog.benoitblanchon.fr/modify-intern-pool/
string a = "Test";
fixed (char* p = a)
{
p[1] = '3';
}
Console.WriteLine(a);
Using *Microsoft.Diagnostics.Runtime*
, you can obtain information about the CLR.
As a result, the author found that .NET does not provide an API to view the hash table inside the string constant pool.
For more information on using C# strings and the principles of the intern pool, please refer to:
http://community.bartdesmet.net/blogs/bart/archive/2006/09/27/4472.aspx
To attempt to obtain a list of string literals in an assembly:
https://stackoverflow.com/questions/22172175/read-the-content-of-the-string-intern-pool
Documentation on the .NET Profiling API:
https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/profiling/profiling-overview?redirectedfrom=MSDN
.NET String Intern Pool and Increasing String Comparison Performance:
http://benhall.io/net-string-interning-to-improve-performance/
Learning articles on C# string intern pools:
https://www.cnblogs.com/mingxuantongxue/p/3782391.html
https://www.xuebuyuan.com/189297.html
If there are any errors in the summary or knowledge, please kindly correct them.
文章评论