← Blog

Why Telugu Character Counts Are Different: Grapheme Clusters Explained

Advertisement · 728×90

When you need to count characters in Telugu text — for a social media post, an SMS, a form field limit, or any other constrained context — you might reach for a character counter tool or write a quick str.length check in code. But for Telugu (and indeed most Brahmic scripts), these naive approaches give you a number that does not match what a human reader would count. Here is why, and what to do about it.

Code Points vs Grapheme Clusters

Unicode represents text as a sequence of code points — integers that identify individual characters. In JavaScript, str.length returns the number of UTF-16 code units, which is roughly the number of Unicode code points for non-emoji, non-supplementary text.

But in Telugu, what a reader considers a "character" — a single visible unit of text — is often composed of multiple code points. The syllable "కి" consists of two code points: the consonant "క" (U+0C15) and the i-kara matra "ి" (U+0C3F). To a reader, this is one character. To str.length, it is two.

A grapheme cluster is the technical term for what a reader perceives as a single character unit. The Unicode standard defines rules for segmenting text into grapheme clusters, and these rules handle the Telugu matra combinations correctly.

How Much Does This Matter?

Consider a practical example. The word "నమస్కారం" (namaskāraṃ, "hello/salutations") has:

For a 140-character Twitter limit, this difference is significant. A Telugu tweet that a user reasonably expects to fit within the limit might be counted as too long if the platform uses code point counting rather than grapheme cluster counting.

What AksharaTool's Counter Measures

AksharaTool's Character Counter tool uses the Intl.Segmenter API (supported in all modern browsers) to count grapheme clusters, giving you the number that matches human perception. It also separately shows the raw Unicode code point count for contexts where that is the relevant metric.

For most content length purposes — social media, SMS, web forms — grapheme clusters are the right unit to count. For technical purposes like database storage sizing, byte count (UTF-8 or UTF-16) is more relevant.

Platform Behaviour Varies

Unfortunately, different platforms handle this inconsistently:

Practical recommendation: When writing Telugu for a character-limited context, use a grapheme cluster counter (like AksharaTool's) to get the human-readable count, but stay well within the limit to account for platform differences in counting methods.

Try our Character Counter

Open Counter →

Tagged: Telugu · Unicode