HowTo : Determine and Change File Character Encoding | Linux Commands

How to detect and change a file’s character encoding from the command line in Linux. Examples of converting text between: CP1251 (Windows-1251, Cyrillic), UTF-8, ISO-8859-1 and ASCII.

Source: HowTo : Determine and Change File Character Encoding | Linux Commands

 

From the following article you’ll learn how to detect a file’s encoding from the command line in Linux.

You will also find the best solution to convert text files between different character sets.

I’ll also show the most common examples of how to convert a file’s encoding between CP1251 (Windows-1251, Cyrillic), UTF-8, ISO-8859-1 and ASCII character sets.

Detect a File’s Encoding

Use the following command to determine what character encoding is used by a file :
$ file -bi [filename]

Option Description
-b, –brief Don’t print filename (brief mode)
-i, –mime Print filetype and encoding

Example : Detect the encoding of the file in.txt
$ file -bi in.txt
text/plain; charset=utf-8

Change a File’s Encoding

Use the following command to convert the encoding of a file :
$ iconv -f [encoding] -t [encoding] -o [newfilename] [filename]

Option Description
-f, –from-code Convert characters from encoding
-t, –to-code Convert characters to encoding
-o, –output Specify output file (instead of stdout)

Example : Convert a file’s encoding from CP1251 (Windows-1251, Cyrillic) to UTF-8
$ iconv -f cp1251 -t utf-8 in.txt

Example : Convert a file’s encoding from ISO-8859-1 to UTF-8 and save it to out.txt
$ iconv -f iso-8859-1 -t utf-8 -o out.txt in.txt

Example : Convert a file’s encoding from ASCII to UTF-8
$ iconv -f utf-8 -t ascii -o out.txt in.txt

Example : Convert a file’s encoding from UTF-8 to ASCII
As UTF-8 can contain characters that can’t be encoded with ASCII, the iconv will generate the error message “illegal input sequence at position X” unless you tell it to strip all non-ASCII characters using the -c option.
$ iconv -c -f utf-8 -t ascii -o out.txt in.txt

Option Description
-c Omit invalid characters from output

Note that if you use the iconv with the -c option, you could loose characters.

[BONUS] Change a string’s encoding from CP1251 (Windows-1251, Cyrillic) to UTF-8
$ echo “Êàêèå-òî êðàêîçÿáðû” | iconv -t latin1 | iconv -f cp1251 -t utf-8
Какие-то кракозябры

List All Encodings

List all the coded character sets known :
$ iconv -l

Option Description
-l, –list List known coded character sets

Leave a Reply