Quick start manual

Data types, variables, and constants

5-13

String types

Long strings

AnsiString, also called a long string, represents a dynamically allocated string whose

maximum length is limited only by available memory.

A long-string variable is a pointer occupying four bytes of memory. When the

variable is empty—that is, when it contains a zero-length string—the pointer is nil

and the string uses no additional storage. When the variable is nonempty, it points a

dynamically allocated block of memory that contains the string value. The eight

bytes before the location contain a 32-bit length indicator and a 32-bit reference

count. This memory is allocated on the heap, but its management is entirely

automatic and requires no user code.

Because long-string variables are pointers, two or more of them can reference the

same value without consuming additional memory. The compiler exploits this to

conserve resources and execute assignments faster. Whenever a long-string variable

is destroyed or assigned a new value, the reference count of the old string (the

variable’s previous value) is decremented and the reference count of the new value

(if there is one) is incremented; if the reference count of a string reaches zero, its

memory is deallocated. This process is called reference-counting. When indexing is

used to change the value of a single character in a string, a copy of the string is made

if—but only if—its reference count is greater than one. This is called copy-on-write

semantics.

WideString

The WideString type represents a dynamically allocated string of 16-bit Unicode

characters. In most respects it is similar to AnsiString. On Win32, WideString is

compatible with the COM BSTR type.

Note

Under Win32 WideString values are not reference-counted. Under Linux, they are.

About extended character sets

Windows and Linux both support single-byte and multibyte character sets as well as

Unicode. With a single-byte character set (SBCS), each byte in a string represents one

character.

In a multibyte character set (MBCS), some characters are represented by one byte and

others by more than one byte. The first byte of a multibyte character is called the lead

byte. In general, the lower 128 characters of a multibyte character set map to the 7-bit

ASCII characters, and any byte whose ordinal value is greater than 127 is the lead

byte of a multibyte character. The null value (#0) is always a single-byte character.

Multibyte character sets—especially double-byte character sets (DBCS)—are widely

used for Asian languages.

In the Unicode character set, each character is represented by two bytes. Thus a

Unicode string is a sequence not of individual bytes but of two-byte words. Unicode

characters and strings are also called wide characters and wide character strings. The

first 256 Unicode characters map to the ANSI character set. The Windows operating