C# Using System.IO.Compression to Unzip with Encoding Issues

2023年10月12日 2594点热度 0人点赞 0条评论
内容目录

System.IO.Compression is an official release of a decompression toolkit that defaults to using UTF-8 encoding for decompressing files.

However, on Windows, encoding can be quite chaotic; if files or directories have Chinese names, decompressed results may be garbled, even when the zip file is encoded in UTF-8. The .NET decompression may still produce garbled text, which necessitates using GB2312 encoding for decompression.

Unfortunately, .NET does not natively include GB2312 encoding.

What the author mentions may not be entirely accurate, as it could also relate to the compressed files. In any case, .NET's default lack of support for GB2312 encoding results in garbled Chinese file names after decompression.

First, include the System.Text.Encoding.CodePages package.

Then, execute this piece of code in any part of your code to register more encoding sets into .NET:

// Register more character encodings
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);

All available encodings are listed at the end of the article.

Next, specify the encoding when extracting the directory:

ZipFile.ExtractToDirectory("aaa.zip", "解压目录", Encoding.GetEncoding("GB2312"), overwriteFiles: true);

To determine the encoding of the compressed file names, you can use the following method:

void Main()
{
	// Register more character encodings
	Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
	GetZipEncoding(@"d:\Downloads\aaa.zip").Dump();
}

static IEnumerable<Encoding> GetZipEncoding(string path)
{
	List<Encoding> es = new List<System.Text.Encoding>();
	// Open the compressed file
	using (ZipArchive archive = ZipFile.OpenRead(@"d:\Users\BSI\Downloads\Modules.zip"))
	{
		// Iterate through each file in the compressed package
		foreach (ZipArchiveEntry entry in archive.Entries)
		{
			// Detect the encoding of the file stream
			es.Add(GetEncoding(entry.FullName));
		}
	}
	return es;
}

// Determine the encoding of a string
static Encoding GetEncoding(string str)
{
	byte[] bytes = Encoding.Default.GetBytes(str);

	if (bytes.Length >= 3 && bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF)
	{
		return Encoding.UTF8;
	}
	else if (bytes.Length >= 2 && bytes[0] == 0xFF && bytes[1] == 0xFE)
	{
		return Encoding.Unicode;
	}
	else if (bytes.Length >= 2 && bytes[0] == 0xFE && bytes[1] == 0xFF)
	{
		return Encoding.BigEndianUnicode;
	}
	else
	{
		return Encoding.Default;
	}
}

file

From the output, you can see that all files in the compressed package are UTF-8 encoded, but when extracted using:

	ZipFile.ExtractToDirectory(@"aaa.zip", "D:\\aaa");

file

This extraction will work normally:

	ZipFile.ExtractToDirectory(@"aaa.zip", "D:\\aaa", Encoding.GetEncoding("GB2312"));

file

The default encodings provided by .NET are:

utf-16
utf-16BE
utf-32
utf-32BE
us-ascii
iso-8859-1
utf-8

After including the package, all available encodings are:

shift_jis
IBM860
ibm861
IBM880
DOS-862
IBM863
gb2312
IBM864
IBM865
cp866
koi8-u
IBM037
ibm869
IBM500
x-mac-icelandic
IBM01140
IBM01141
IBM01142
IBM273
IBM01143
IBM01144
IBM01145
windows-1250
IBM01146
windows-1251
IBM01147
macintosh
windows-1252
DOS-720
IBM277
IBM01148
x-mac-japanese
windows-1253
IBM437
IBM278
IBM01149
x-mac-chinesetrad
windows-1254
windows-1255
Johab
windows-1256
x-mac-arabic
windows-1257
x-mac-hebrew
windows-1258
x-mac-greek
x-mac-cyrillic
IBM00924
iso-8859-2
iso-8859-3
iso-8859-4
iso-8859-5
iso-8859-6
IBM870
iso-8859-7
iso-8859-8
iso-8859-9
x-mac-turkish
x-mac-croatian
windows-874
cp875
IBM420
ks_c_5601-1987
IBM423
IBM424
IBM280
IBM01047
IBM284
IBM285
x-mac-romanian
EUC-JP
x-mac-ukrainian
x-Europa
ibm737
x-IA5
big5
x-cp20936
x-IA5-German
x-IA5-Swedish
x-IA5-Norwegian
koi8-r
ibm775
iso-8859-13
IBM290
iso-8859-15
x-Chinese-CNS
ASMO-708
IBM297
x-mac-thai
x-cp20001
IBM905
x-Chinese-Eten
x-ebcdic-koreanextended
x-cp20003
x-cp20004
x-cp20005
ibm850
IBM-Thai
ibm852
IBM871
x-mac-ce
IBM855
cp1025
x-cp20949
ibm857
IBM00858
x-cp20261
IBM1026
x-cp20269
utf-16
utf-16BE
utf-32
utf-32BE
us-ascii
iso-8859-1
utf-8

痴者工良

高级程序员劝退师

文章评论