Anatomy of a Short Message
The short message is 160 characters long. Sometimes it is 140, or less. Why?
It is very well known fact, that a Short Message (SMS) should contain somewhere around 140 to 160 characters. Some devices or phones insist on 160, some on less. Twitter (whose origins are also closely related to short messages), for example has a limit of 140 symbols. Speakers of languages not based on Latin script will say this number is even less. Why this ambiguity and why exactly these numbers? And are these numbers correct at all?
First, let's start with the history. SMS dates back to the time when primary function of the phone was speech calling. There were no keyboards, no touch screens, and the only way of input was to use numeric keypads by assigning several alphabet letters to each key and letting select required letter by repeated pressing of the same key. The process of text entry in this way is slow, and the number of letters is relatively small. SMS was initially a part of GSM networks only, which were deployed in European countries, so it was deemed sufficient that it can serve a limited set of Latin alphabet, numbers, and some special characters. Taking parallels to ASCII, which is able to accommodate most needs of Latin-based writing system in a 7-bit encoding space, the encoding scheme for SMS also was comprised of 7-bits, and they mostly covered same characters as ASCII, with some exceptions. For example, control characters found in ASCII range 0x00 ... 0x20 was replaced by some characters found in European languages outside ASCII range (Greek, Nordic, Spanish etc).
Not going into technical details, the resulting protocol allowed to transmit 160 characters of text, having 7-bit encoding. Thus, total space allocated for user data, is 160 x 7 = 1120 bits. That limit for SMS stands today and any further developments and variations always play from here.
With the spread of popularity of SMS, it became clear that there are several problems: 1) there are many languages which are unable to use it because of lack of support and 2) the 160 character limit is too small.
1. Language support
The solution of first problem seems easy at the first glance, but comes at the cost: let's use this 1120 bit space and represent each character with 8 bits. That essentially allows to represent any Latin alphabet character, including those found in Nordic, Spanish and other languages. The available size, is reduced, respectively to: 1120 / 8 = 140 characters exactly. That still is far from covering all languages. Russian, Chinese, Arabic, etc still are not covered. By using same method, and encoding test with UCS-2 (which is 16-bit encoder), it is now possible to cover most of the popular languages of the world. This is the most widely used standard when typing non-ASCII and/or non-Latin messages. The cost of that: message size is reduced further to 1120 / 16 = 70 characters only.
2. Longer messages
Now it becomes evident that there must be a way to send longer messages. Even 160 characters was not much, but for some languages 70 is absolutely insufficient. What about if we seamlessly split longer messages behind the scenes, transmit in separate parts, and concatenate at the receiving side. This way does not require modifications of transmission infrastructure, which does not change so frequently as user handsets and also cost much more to upgrade or replace. Implementation seems obvious, except for the fact that there is no field or indicator in the message itself which can signal that a given message is part of multi-part message and should be reassembled when received. The way to solve it, was to "eat" a small part from the beginning of message itself, and use it as a special header which would describe what kind of message it is. It is called User Data Header (UDH) , and apart from telling the receiver side that this is part X of multipart message, it has some more functions, which we will not touch here. The resulting approach, reduces the length of each part of concatenated message by at least some 48 bits, so the resulting message lengths per part are following:
For 7-bit encoded message: 160 for complete message, 153 for a part of multipart message ( 7 x 7 = 49 bits less )
For 8-bit encoded message: 140 for complete message, 134 for a part of multipart message ( 6 x 8 = 48 bits less )
For 16-bit encoded message: 70 for complete message, 67 for a part of multipart message ( 3 x 16 = 48 bits less )
Technically it solves the problem of sending messages of arbitrary length in almost any language of the world. The infrastructure does not change, it can be transparent to the content. In most cases that is the case, as we see that most mobile operators charge by message parts, regardless of how many actual characters are send through.
Below just some examples of encoding of few letters:
|Letter||Description||UTF-16||UTF-8||GSM 03.38 (7-bit)|
|ñ||Spanish small n with tilde||U+00F1||0xC3 0xB1 (c3b1)||0x7D|
|á||Small a acute||U+00E1||0xC3 0xA1 (c3a1)||Not present, available via shift table + 0x61|