I understood that -c stands for chars or characters
and -b for bytes
for example if i have the line:
echo Alex how are you? | cut -c1
it will give A
echo Alex how are you? | cut -b1
it will also give A
I understood that one char equals one byte
so what is the difference between the two options?
The difference between -c
and -b
is how they handle multibyte characters or strings. Use the -b
when you multibyte characters (internationalization), and use the -c
want to extract characters. Try the following examples:
simple_string="Hello"
echo "$simple_string" | cut -c1-3
echo "$simple_string" | cut -b1-3
It should print Hel
. Now let us try multibyte (internationalization):
multibyte_string="こんにちは"
echo "$multibyte_string" | cut -c 1-3
echo "$multibyte_string" | cut -b 1-3
You can see the difference now. This is tested on FreeBSD 13 cut version:
GNU sed seems to have a bug it treats both of them same. From the docs:
‘-c CHARACTER-LIST’
‘--characters=CHARACTER-LIST’
Select for printing only the characters in positions listed in
CHARACTER-LIST. The same as ‘-b’ for now, but internationalization
will change that. Tabs and backspaces are treated like any other
character; they take up 1 character. If an output delimiter is
specified, (see the description of ‘--output-delimiter’), then
output that string between ranges of selected bytes.
It says but internationalization will change that. So I guess on Linux version of cut internationalization is still not supported:
Summary: What is the difference between options -c and -b option?
In summary, use the -c
option if you want to extract characters from a file/string, and use the -b
if you want to extract bytes, taking into account multibyte characters. You need to set the correct LANG, LC_ALL and LC_CTYPE environment variables that deal with multibyte (internationalization). The workaround for GNU/cut to when using multibyte is to pass it to the iconv
. For example:
echo "$multibyte_string" | cut -c -7 | iconv -c
Here is bug report for cut from 2006 Re: Cut not working with multi-byte UTF-8 characters I know it can be confusing but it is a bug and not your system issue. It works on other system correctly. LOL.
1 Like
so in language such as English can it be assumed that -c equals -b ?