Cut - what is the difference between options -c and -b on Linux?

I understood that -c stands for chars or characters
and -b for bytes
for example if i have the line:
echo Alex how are you? | cut -c1
it will give A

echo Alex how are you? | cut -b1
it will also give A

I understood that one char equals one byte
so what is the difference between the two options?

The difference between -c and -b is how they handle multibyte characters or strings. Use the -b when you multibyte characters (internationalization), and use the -c want to extract characters. Try the following examples:

simple_string="Hello"
echo "$simple_string" | cut -c1-3
echo "$simple_string" | cut -b1-3

It should print Hel. Now let us try multibyte (internationalization):

multibyte_string="こんにちは"
echo "$multibyte_string" | cut -c 1-3
echo "$multibyte_string" | cut -b 1-3

You can see the difference now. This is tested on FreeBSD 13 cut version:
Screenshot from 2023-02-01 16-57-09

GNU sed seems to have a bug it treats both of them same. From the docs:

‘-c CHARACTER-LIST’
‘--characters=CHARACTER-LIST’
     Select for printing only the characters in positions listed in
     CHARACTER-LIST.  The same as ‘-b’ for now, but internationalization
     will change that.  Tabs and backspaces are treated like any other
     character; they take up 1 character.  If an output delimiter is
     specified, (see the description of ‘--output-delimiter’), then
     output that string between ranges of selected bytes.

It says but internationalization will change that. So I guess on Linux version of cut internationalization is still not supported:
Screenshot from 2023-02-01 17-00-05

Summary: What is the difference between options -c and -b option?

In summary, use the -c option if you want to extract characters from a file/string, and use the -b if you want to extract bytes, taking into account multibyte characters. You need to set the correct LANG, LC_ALL and LC_CTYPE environment variables that deal with multibyte (internationalization). The workaround for GNU/cut to when using multibyte is to pass it to the iconv. For example:

echo "$multibyte_string" | cut -c -7 | iconv -c

Here is bug report for cut from 2006 Re: Cut not working with multi-byte UTF-8 characters I know it can be confusing but it is a bug and not your system issue. It works on other system correctly. LOL.

1 Like

so in language such as English can it be assumed that -c equals -b ?