Why Telugu Character Counts Are Different: Grapheme Clusters Explained
When you need to count characters in Telugu text — for a social media post, an SMS, a form field limit, or any other constrained context — you might reach for a character counter tool or write a quick str.length check in code. But for Telugu (and indeed most Brahmic scripts), these naive approaches give you a number that does not match what a human reader would count. Here is why, and what to do about it.
Code Points vs Grapheme Clusters
Unicode represents text as a sequence of code points — integers that identify individual characters. In JavaScript, str.length returns the number of UTF-16 code units, which is roughly the number of Unicode code points for non-emoji, non-supplementary text.
But in Telugu, what a reader considers a "character" — a single visible unit of text — is often composed of multiple code points. The syllable "కి" consists of two code points: the consonant "క" (U+0C15) and the i-kara matra "ి" (U+0C3F). To a reader, this is one character. To str.length, it is two.
A grapheme cluster is the technical term for what a reader perceives as a single character unit. The Unicode standard defines rules for segmenting text into grapheme clusters, and these rules handle the Telugu matra combinations correctly.
How Much Does This Matter?
Consider a practical example. The word "నమస్కారం" (namaskāraṃ, "hello/salutations") has:
- 9 Unicode code points — what
str.lengthreturns (in JavaScript) - 6 grapheme clusters — what a reader would count: న, మ, స్, కా, ర, ం
For a 140-character Twitter limit, this difference is significant. A Telugu tweet that a user reasonably expects to fit within the limit might be counted as too long if the platform uses code point counting rather than grapheme cluster counting.
What AksharaTool's Counter Measures
AksharaTool's Character Counter tool uses the Intl.Segmenter API (supported in all modern browsers) to count grapheme clusters, giving you the number that matches human perception. It also separately shows the raw Unicode code point count for contexts where that is the relevant metric.
For most content length purposes — social media, SMS, web forms — grapheme clusters are the right unit to count. For technical purposes like database storage sizing, byte count (UTF-8 or UTF-16) is more relevant.
Platform Behaviour Varies
Unfortunately, different platforms handle this inconsistently:
- Twitter/X: Uses a custom weighted count that treats Telugu syllables approximately correctly
- WhatsApp: Counts grapheme clusters for the character counter shown to the user
- SMS (carrier dependent): Counts UTF-16 code units; Telugu SMS messages use UCS-2 encoding and are significantly shorter than Latin SMS messages per message unit
- Most web forms: Use
maxlengthwhich counts UTF-16 code units in most browsers
Try our Character Counter
Open Counter →