Python - Read File with mixed Encoding/Codepage? - Enhance your coding expertise with Pathofdeath on @onlycoders.net

2 years ago

#30251

Pathofdeath

Python - Read File with mixed Encoding/Codepage?

i try to read a file which seems to have mixed Encodings. In this example the Character Encoding switches multiple times in one line.

The File Content is like:

  L sys17200023_gamenoteSbeQ  / ?   P E S e t V a r   P E _ 0 _ V a r T e s t 0 4   1 1   †OŒ[bŽ–µkÁ

I want to get the Result:

sys17200023_gamenote打入 /? PESetVar PE_0_VarTest04 11 來完成階段

I tried this Script to figure out which Codepage should be correct:

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437',
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857',
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869',
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125',
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256',
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr',
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2',
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1',
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7',
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13',
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u',
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman',
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213',
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7',
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass


# Searching for: sys17200023_gamenote打入 /? PESetVar PE_0_VarTest04 11 來完成階段


# // V1 > Backwards check
# find_codec("sys17200023_gamenote打入 /? PESetVar PE_0_VarTest04 11 來完成階段")
# Best Result was: utf_16_le to cp1252 s y s 1 7 2 0 0 0 2 3 _ g a m e n o t e SbeQ  / ?   P E S e t V a r   P E _ 0 _ V a r T e s t 0 4   1 1   †OŒbŽ–µk



# // V2 > Copy Pasted from the File
# find_codec(" L sys17200023_gamenoteSbeQ  / ?   P E S e t V a r   P E _ 0 _ V a r T e s t 0 4   1 1   †OŒ[bŽ–µkÁ")

# // V3 > Read the File
# with open('bin.txt', 'r') as f:
#     for line in f:
#         find_codec(line)

When i try 'V1 - Backwards' i get the correct Input String. But not when i try to Convert the Input String by myself V2 and V3.

Maybe somebody can give me a hint in the right Direction. Thanks

Edit: This is the Result from open('bin.txt', mode='rb').read()

b'\x00\x00\x00\x00\x00\x14\x00L\x00sys17200023_gamenoteSbeQ \x00/\x00?\x00 \x00P\x00E\x00S\x00e\x00t\x00V\x00a\x00r\x00 \x00P\x00E\x00_\x000\x00_\x00V\x00a\x00r\x00T\x00e\x00s\x00t\x000\x004\x00 \x001\x001\x00 \x00\x86O\x8c[\x10b\x8e\x96\xb5k\x06\xc1'

python

localization

decode

encode

codepages

0 Answers

Your Answer

Posts

Questions

Blogs