Why Telugu Character Counts Are Different: Grapheme Clusters Explained

📅 January 2026⏱ 7 min read🏷 How-To · Unicode

When you need to count characters in Telugu text — for a social media post, an SMS, a form field limit, or any other constrained context — you might reach for a character counter tool or write a quick str.length check in code. But for Telugu (and indeed most Brahmic scripts), these naive approaches give you a number that does not match what a human reader would count. Here is why, and what to do about it.

Code Points vs Grapheme Clusters

Unicode represents text as a sequence of code points — integers that identify individual characters. In JavaScript, str.length returns the number of UTF-16 code units, which is roughly the number of Unicode code points for non-emoji, non-supplementary text.

But in Telugu, what a reader considers a "character" — a single visible unit of text — is often composed of multiple code points. The syllable "కి" consists of two code points: the consonant "క" (U+0C15) and the i-kara matra "ి" (U+0C3F). To a reader, this is one character. To str.length, it is two.

A grapheme cluster is the technical term for what a reader perceives as a single character unit. The Unicode standard defines rules for segmenting text into grapheme clusters, and these rules handle the Telugu matra combinations correctly.

How Much Does This Matter?

Consider a practical example. The word "నమస్కారం" (namaskāraṃ, "hello/salutations") has:

9 Unicode code points — what str.length returns (in JavaScript)
6 grapheme clusters — what a reader would count: న, మ, స్, కా, ర, ం

For a 140-character Twitter limit, this difference is significant. A Telugu tweet that a user reasonably expects to fit within the limit might be counted as too long if the platform uses code point counting rather than grapheme cluster counting.

What AksharaTool's Counter Measures

AksharaTool's Character Counter tool uses the Intl.Segmenter API (supported in all modern browsers) to count grapheme clusters, giving you the number that matches human perception. It also separately shows the raw Unicode code point count for contexts where that is the relevant metric.

For most content length purposes — social media, SMS, web forms — grapheme clusters are the right unit to count. For technical purposes like database storage sizing, byte count (UTF-8 or UTF-16) is more relevant.

Platform Behaviour Varies

Unfortunately, different platforms handle this inconsistently:

Twitter/X: Uses a custom weighted count that treats Telugu syllables approximately correctly
WhatsApp: Counts grapheme clusters for the character counter shown to the user
SMS (carrier dependent): Counts UTF-16 code units; Telugu SMS messages use UCS-2 encoding and are significantly shorter than Latin SMS messages per message unit
Most web forms: Use maxlength which counts UTF-16 code units in most browsers

Practical recommendation: When writing Telugu for a character-limited context, use a grapheme cluster counter (like AksharaTool's) to get the human-readable count, but stay well within the limit to account for platform differences in counting methods.

Try our Character Counter

Open Counter →

Tagged: Telugu · Unicode