Bytes Chars and Unicode
Rust strings are valid UTF-8 bytes; bytes() iterates raw bytes, chars() iterates Unicode scalar values, and neither directly means “visible character.”
What it is
String and &str store UTF-8.
len() returns the number of bytes.
bytes() yields each encoded byte as u8.
chars() decodes Unicode scalar values as char.
char_indices() yields byte offsets paired with decoded char values.
A Unicode scalar value can require one to four bytes in UTF-8.
A user-perceived character can be multiple scalar values, such as a base character plus combining marks.
The standard library does not provide full grapheme cluster segmentation.
For grapheme clusters, use a Unicode-aware crate such as unicode-segmentation; verify the latest version on docs.rs before adding it.
How it works
ASCII text has one byte per scalar value.
Non-ASCII text does not.
For example, é is one scalar value but two UTF-8 bytes.
日 is one scalar value but three UTF-8 bytes.
String slicing requires byte offsets that are also character boundaries.
is_char_boundary(i) checks whether a byte offset can be used as a boundary.
find, rfind, match_indices, and char_indices return byte offsets.
That is good: byte offsets can be used directly with slicing if they came from the same string and remain valid.
Case conversion and normalization are separate Unicode topics; to_lowercase on char can produce multiple char values.
Use byte APIs for protocols and file formats, scalar APIs for Rust-level Unicode values, and specialized crates for human text segmentation.
Example
fn main() {
let text = "aé日";
assert_eq!(text.len(), 6);
assert_eq!(text.bytes().count(), 6);
assert_eq!(text.chars().count(), 3);
let offsets: Vec<(usize, char)> = text.char_indices().collect();
assert_eq!(offsets, vec![(0, 'a'), (1, 'é'), (3, '日')]);
assert!(text.is_char_boundary(3));
assert!(!text.is_char_boundary(2));
assert_eq!(text.get(1..3), Some("é"));
}Best practice
- ✅ Use
as_bytesorbytesfor wire formats, ASCII checks, and binary protocols. - ✅ Use
charsfor Unicode scalar transformations where scalar values are the correct unit. - ✅ Use
char_indiceswhen you need safe offsets for later slicing. - ✅ Store byte offsets only if they are derived from the same string version.
- ✅ Use pattern APIs such as
split,find, andstrip_prefixbefore writing manual scanners. - ✅ Reach for a Unicode crate when product behavior depends on grapheme clusters or normalization.
- ✅ Document text units in public APIs: bytes, scalar values, lines, words, or graphemes.
- ✅ Prefer checked slicing with
getwhen offsets may be wrong.
Pitfalls
- ⚠️
chars().count()is O(n), not a cached length. - ⚠️
chars().nth(n)repeatedly in a loop is usually quadratic. - ⚠️
charis not “a displayed character” in all languages. - ⚠️ Byte indexes from one string are not valid for another string just because text looks similar.
- ⚠️ Unicode normalization can make visually similar strings compare unequal.
- ⚠️
make_ascii_lowercaseandeq_ignore_ascii_caseare intentionally ASCII-only. - ⚠️ Slicing at an arbitrary byte offset can panic; see String Byte Indexing.
- ⚠️
split_whitespaceuses Unicode whitespace, whilesplit_ascii_whitespaceis ASCII-only.
See also
std: Vec, String & Slices · String vs str Methods · Slicing and Range Indexing · String Byte Indexing · Assuming String Indexes Are Characters · Splitting Strings Without Collecting · Borrowing Strings and Slices · The Slice Type · Iterator Performance
Sources
- Rust standard library,
str::bytes— std, https://doc.rust-lang.org/std/primitive.str.html#method.bytes - Rust standard library,
str::charsandstr::char_indices— std, https://doc.rust-lang.org/std/primitive.str.html#method.chars unicode-segmentationcrate documentation, if used in real code verify latest version — https://docs.rs/unicode-segmentation/latest/unicode_segmentation/
