PHP strlen() vs mb_strlen(): Understanding String Length with Unicode (Code Example)

Written by

in

When working with strings in PHP, it’s common to use strlen() to determine the length of a string. However, if your application supports languages like Tamil, Hindi, Japanese, Chinese, or emojis, strlen() may not return the result you expect.

In this article, we’ll compare strlen(), mb_strlen(), and JavaScript’s length property using both English and Tamil text.

Test Data

We will use the following strings:

English String

$str1 = "Lorem ipsum dolor sit amet consectetur, adipisicing elit...";

Tamil String

$str2 = "அகர முதல எழுத்தெல்லாம் ஆதி பகவன் முதற்றே உலகு...";

Measuring String Length

Our PHP code displays the length using three different methods:

strlen($str);
mb_strlen($str);

The browser also calculates the length using JavaScript:

str.length

Results

English Text

FunctionResult
strlen()Same as character count
mb_strlen()Same as character count
JavaScript lengthSame as character count

For English text, all three methods usually return the same value because English characters occupy a single byte in UTF-8.

Tamil Text

The situation changes completely with Unicode languages.

FunctionWhat it Counts
strlen()Number of bytes
mb_strlen()Number of Unicode characters
JavaScript lengthNumber of UTF-16 code units

Since Tamil characters require multiple bytes in UTF-8, strlen() returns a much larger number than the actual number of readable characters.

For example:

தமிழ்

Depending on the encoding:

echo strlen("தமிழ்");      // Larger value (bytes)
echo mb_strlen("தமிழ்");   // 5 characters

The exact byte count depends on the UTF-8 encoding of each character, while mb_strlen() correctly reports the number of characters.

Why Does This Happen?

strlen()

strlen() simply counts bytes stored in memory.

For example:

A = 1 byte
B = 1 byte
C = 1 byte

So:

strlen("ABC") // 3

But Tamil letters occupy multiple bytes:

அ = 3 bytes
க = 3 bytes
ர = 3 bytes

Therefore:

strlen("அகர")

returns the total number of bytes rather than the number of visible characters.

mb_strlen()

The mb stands for MultiByte.

mb_strlen() understands UTF-8 encoding and counts actual Unicode characters instead of bytes.

echo mb_strlen($str, "UTF-8");

or simply

echo mb_strlen($str);

provided your internal encoding is UTF-8.

Whenever your application supports international languages, this is the recommended function.

JavaScript length

JavaScript behaves differently.

const str = "தமிழ்";
console.log(str.length);

JavaScript stores strings as UTF-16. The length property returns the number of UTF-16 code units.

For most Tamil letters, this often appears close to the visible character count, but it’s not a true Unicode character count.

Characters outside the Basic Multilingual Plane (such as many emojis) occupy two UTF-16 code units.

Example:

"😀".length

returns:

2

even though only one emoji is displayed.

Which Function Should You Use?

ScenarioRecommended Function
ASCII / English onlystrlen()
UTF-8 multilingual websitesmb_strlen()
Word limitsmb_strlen()
Form validationmb_strlen()
Database field validationmb_strlen()
JavaScript UI displaylength (with Unicode caveats)

Best Practice

If your application may contain:

  • Tamil
  • Hindi
  • Japanese
  • Chinese
  • Korean
  • Arabic
  • Emojis

always prefer:

mb_strlen($string)

instead of:

strlen($string)

Also ensure the Multibyte String extension (mbstring) is enabled in your PHP installation.

Complete Example

$str1 = "Lorem ipsum dolor sit amet...";
$str2 = "அகர முதல எழுத்தெல்லாம் ஆதி பகவன் முதற்றே உலகு.";

echo strlen($str1);
echo mb_strlen($str1);

echo strlen($str2);
echo mb_strlen($str2);

Conclusion

The difference between strlen() and mb_strlen() is simple but important:

  • strlen() counts bytes.
  • mb_strlen() counts characters.
  • JavaScript’s length counts UTF-16 code units, which usually—but not always—match the number of visible characters.

If your PHP application supports multiple languages, using mb_strlen() will help you avoid incorrect character counts, validation errors, and unexpected behavior with Unicode text.

<?php

$str1 = "Lorem ipsum dolor sit amet consectetur, adipisicing elit. Hic reprehenderit quis, alias delectus aliquam eveniet nam quam dolorem quo vitae pariatur labore quisquam vero accusantium nesciunt magni dolorum optio iure?";
$str2 = "அகர முதல எழுத்தெல்லாம் ஆதி பகவன் முதற்றே உலகு. அறிவும் ஆற்றலும் ஒழுக்கமும் ஒன்றிணைந்து வாழ்வை வளப்படுத்துகின்றன. இயற்கையின் இனிமை மனதை அமைதிப்படுத்தும். காலம் மாறினாலும் கல்வியின் மதிப்பு என்றும் நிலைத்ததே.";
?>

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Document</title>
  <style>
    div{
      margin: 10px 0;
    }
  </style>
</head>
<body>
  <div>English String = <?php echo $str1; ?></div>
  <div>php string length = <?php echo strlen($str1); ?></div>
  <div>php mb string length = <?php echo mb_strlen($str1); ?></div>
  <div>js string length = <span id="jsstr1len"></span></div>

    <div>Non English String = <?php echo $str2; ?></div>
  <div>php string length = <?php echo strlen($str2); ?></div>
  <div>php mb string length = <?php echo mb_strlen($str2); ?></div>
  <div>js string length = <span id="jsstr2len"></span></div>


<script>
  const str1 = "<?php echo $str1; ?>";
  document.getElementById("jsstr1len").innerText = str1.length;

   const str2 = "<?php echo $str2; ?>";
  document.getElementById("jsstr2len").innerText = str2.length;
</script>
</body>
</html>