Extended Character Set Support

Extended Character Set Support

Top  Previous  Next

 

For many years, computer systems mostly used an 8-bit value (a byte) to represent a character. This allows 256 possible values which is adequate for most Western languages. The way in which these values relate to the characters that they represent is a matter of choice and two main standards exist, ASCII and EBCDIC, though the latter is largely redundant and is not considered further here.

 

Strictly, the ASCII character set defines only the first half of the available character values though the term is often misused to reference the entire 256 character set, known in QM as the 8-bit character set. Within this, the first 32 characters are defined to be the control characters such as backspace or linefeed. The next 96 characters are the letters, digits, punctuation, etc. The upper half of the 8-bit character set was originally not defined but subsequently became used in a number of different ways to provide accented letters, symbols and various graphical items. Because the actual characters represented by values 128 to 255 vary, it is important to know which character set is in use. In Windows systems, this is determined by the code page setting.

 

In today's business world, there is a need to be able to represent more than 256 different characters. Various schemes exist to do this but the underlying concept for most of these is the definition of the Unicode character set in which each character from a large range of languages is represented by a unique numeric value, the code point, conventionally written as a hexadecimal value in the form U+1234. Although the full definition of Unicode provides for more, the common usage of this (the Basic Multilingual Plane or BMP), allows 65536 character values which are stored internally as a 16-bit value (two bytes). The characters that lie outside this range are not usually required in business computer systems, however, if they are required, it is usually possible to move the ones that are needed into an area of the BMP that is reserved for application specific use (the Private Use Area, U+E000 to U+F8FF). QM reserves characters U+F880 to U+F8FF for internal use and developers should not map other characters into this area.

 

The standard version of QM works with 8-bit characters internally and is therefore limited to use of only the characters defined in this set with the appropriate interpretation of the upper half of the range. By default, sorting sequences data based on the character value which may not necessarily match local language conventions, especially regarding the placement of accented letters, and case conversion handles only the upper and lower case letters from the lower half of the character set and hence does not convert accented letters. Most of what follows relates to the Unicode version of QM but see below for details of the 8-bit character map feature.

 

The extended character set (ECS) version of QM uses 16-bit characters internally and provides support for the Unicode BMP with the exception of moving the five characters that are replaced on multivalue systems by the mark characters (251 to 255 or, in Unicode form, U+00FB to U+00FF) to an alternative location (U+F8FB to U+F8FF) in a part of the BMP that Unicode defines as the Private Use Area (PUA). This is done because many applications use, for example, CHAR(253) to reference a value mark and hence these characters must remain in their traditional locations.

 

Note that the term Extended Character Set in the context of QM is taken at its literal meaning of extending the character set beyond 8 bits. In other contexts, this term is sometimes used to refer to the upper half of the 8-bit character set.

 

The sorting sequence and case conversion rules are determined by a character map that can be selected to match local language conventions. Because these maps are set up to allow for the relocation of the five characters displaced by the marks, the only time an application developer needs to be concerned with this relocation is in the unlikely situation of needing to work directly with the code point value. For example, CHAR(252) refers to a subvalue mark. The German u-umlaut that has been displaced by this character can be programmatically created as ECHAR(63740) or, using hexadecimal notation, ECHAR(0xF8FB). There is also a QMBasic function, SWAPMARKS(), that can be used to interchange the two groups of characters in an application though this should not usually be required as all relevant external interfaces provide an option to exchange the relocated characters with their Unicode positions for compatibility with other software.

 

Outside of QM, most data storage and transmission systems are byte orientated and hence Unicode characters must be encoded in some way. QM provides a set of character encodings that can be applied to external interfaces such as directory files, socket connections, etc to transform the internal 16-bit characters to a form that is suitable for external use. All processing inside QM is performed with the 16-bit ECS representation.

 

The diagram below shows how ECS and encodings work. Everything inside the circle works with data in ECS 16-bit form, though hashed files will compact data to 8-bit form where possible. All the external interfaces shown outside the circle tend to use 8-bit data and require encoding if characters outside the 8-bit set are to be used. As described below, in some cases this encoding is automatic; in others it is up to the application to provide this encoding using conversion codes and other QM functions.

 

ecs

 

 

Characters and Bytes

 

In most cases, an application need not be concerned with the internal representation of characters. Programs that operate on the 8-bit version of QM will continue to operate unchanged on the ECS version. The QMBasic character manipulation functions that work with 8-bit characters in non-ECS systems simply work in exactly the same way with 16-bit characters in ECS systems.

 

Binary data such as images or data read using OSREAD will be manipulated within an application as a series of 16 bit characters that have values in the range 0 to 255. From the application's viewpoint, nothing is different. The fact that internally each byte of the original data is stored internally as two bytes is irrelevant. QM refers to data stored in this form as a byte string.

 

The QMBasic CHAR() function is defined to work only for character values in the range 0 to 255. In QM and many other multivalue systems, only the low order eight bits of values outside this range are used. To allow creation of characters in the full ECS range, the ECHAR() function can be used. On a non-ECS system, this behaves exactly like CHAR().

 

 
Special Key Codes for Terminal Input

 

In a Windows console session, the QMBasic KEYIN() function recognises special keys such as the cursor keys and function keys and for each of these returns a character from the upper half of the 8-bit character set to represent that key. The KEYCODE() function uses the terminfo database to provide similar functionality, extended to include recognition of special characters when using a terminal emulator. In a non-ECS mode system with 8-bit characters, there is no alternative to the character positions used for these special keys clashing with other characters. The larger character set available in ECS mode allows the special characters to be moved into the BMP Private Use Area in positions that do not clash with other characters.

 

In order to maintain compatibility with non-ECS applications and also to allow applications compiled on ECS mode systems to be run on non-ECS mode systems, the KEYIN() and KEYCODE() functions retain their non-ECS behaviour but are therefore unable to distinguish between the special keys and the characters that they replace. As an alternative, the KEYINV() and KEYCODEV() functions return a key value (character number) rather than the actual character. The special keys return values which, when considered as code points in the Unicode BMP, lie in the Private Use Area.

 

These functions are available on non-ECS mode systems but the returned values for the special keys cannot be encoded as a single character whereas on an ECS system they can. Also, they are likely to cause an application to behave incorrectly if one of the data characters replaced by the special keys is entered at the keyboard. It is, therefore, beneficial to modify programs that use KEYCODE() to use KEYCODEV() and the corresponding alternative set of character token names. This should not be a major task. Essentially, if a program contains something like

C = KEYCODE()

N = SEQ(C)

this becomes

N = KEYCODEV()

C = ECHAR(N)

and all use of the key value tokens defined in the KEYIN.H include record with names prefixed by K$ change to use prefix KV$. Any use of BINDKEY() in the same programs should also be modified to use the KV$ token names.

 

The value returned via the STATUS() function after an INPUTFIELD operation terminates on an unrecognised control character is not affected by ECS and always lies in the range 128 to 227. The INPUTFIELDV statement is identical but returns the Unicode code point value for the special keys.

 

 

 

What Must Be 8-Bit Data?

 

The following items must be formed only from the 8-bit character set:

Configuration parameter data

Pathnames and hence directory file record ids

User names and passwords

Account names

Shell commands in SH or OS.EXECUTE

QMBasic program source text with the exception of character constants.

QMBasic subroutine call names and common block names

Object oriented programming property and function names

Encryption key names

QMNet server names

Data records in directory files unless an encoding is used. This includes system files such as $SAVEDLISTS and $COMO.

Compiler output files

 

 

What Can Be 16-Bit Data?

Dictionary item names and content

Alternate key index names

Character strings manipulated by QMBasic programs

String constants in QMBasic programs and dictionary items

Record ids and data in ECS mode hashed files

Encryption key values (though only the bottom 8 bits of each character will be used)

 

 

Character Maps

 

QM uses a character mapping table to define properties of the 65536 characters that can be represented internally by 16-bit data. For each character, this table identifies the upper and lower case equivalents, the sort weight, and the character attributes such as whether it is a letter. An application can use the IS.ALNUM(), IS.ALPHA(), IS.DIGIT(), IS.GRAPH(), IS.MARK(), IS.SPACE(), IS.USER.CHAR() and IS.WIDE() functions to test the attributes of a character. The character maps also control the behaviour of the MCL, MCT and MCU conversion codes and the related UPCASE() and DOWNCASE() functions.

 

For characters defined as being digits, the mapping table also identifies the decimal value of the digit. Clearly, the Arabic numerals 0 to 9 should be assigned the numeric values zero to nine but the ability to assign other characters as digits and to set their numeric value allows characters that represent numbers in other forms to be used. For example, the Arabic-Indic digits zero to nine are assigned Unicode code point values U+0660 to U+0669. The QMBasic character string to numeric conversions will accept as input any character defined as a digit in the mapping tables and use its associated numeric value. The QMBasic numeric to character string conversions always produce numbers formed from the Arabic numerals.

 

The character maps include a user definable attribute that can be used in any way that an application designer finds useful.

 

A range of standard character maps are available for download from the QM web site. A developer can use the EDIT.MAP command to modify an existing map or to create a new map. Map names are case insensitive.

 

A QM system can load up to eight mapping tables defined by the ECSMAP configuration parameter into shared memory allowing, for example, different sorting rules depending on the local conventions for users in different countries. Additional mapping tables can be loaded into the private memory of a QM process if the required table is not in shared memory. Because the tables are over half a megabyte in size, there is considerable advantage in loading the most commonly used tables in shared memory.

 

A QM process will use the first map loaded as the default base map but this can be changed by use of the ECS.MAP command, typically from within the LOGIN script, or by use of the SET.ECS.MAP() function in a QMBasic program.

 

Where a hashed file has indices, the map used to sequence the indices internally can be specified when creating the file. The sort rules in this map will be used during all index operations for all users of the file but may be different from the map in use for other activity of each user's process such as sorting displayed output. An ECS mode file with alternate key indices cannot be opened unless the relevant character map is available. Modifying a map in a manner that changes the sort order may cause files that use that map to behave incorrectly. The safest approach is to rebuild the indices after modifying the map.

 

The base map is included in the standard QM download. Character maps for specific regional variations can be obtained via the downloads page of the openqm.com web site. The downloaded map should be saved with an uppercase name in the ecs-maps subdirectory of the QMSYS account. Downloaded maps should be treated as templates that should be reviewed before use to verify that they meet local requirements. Users who find a need to modify a map in a manner that might be of interest to other users are encouraged to send a brief description of the change to the OpenQM support email address.

 

 

Double Width Characters

 

Unicode defines some characters as being "wide" to indicate that they will occupy two columns on a display or printer. The character maps include a wide attribute that will be set on such characters.

 

The presence of wide characters has an implication on format codes, headings/footings, and on several QMBasic statements. The display width functions described below are supported in non-ECS systems but behave exactly as their standard single width character equivalents since no characters have the double width attribute in the 8-bit character set.

 

The FMTDW() and FMTDWS() functions apply format codes in a similar way to FMT() but the width is specified in terms of the display width instead of the number of characters.

 

The FOLDDW() and FOLDDWS() are similar to FOLD() but base the width of each string fragment on the display width rather than the number of characters.

 

The INPUTDW statement is similar to INPUT but the length limit is based on display width.

 

The SUBSTRDW() performs substring extraction based on the display width of the extracted data.

 

 

Conversion Codes

 

There are three conversion codes that relate to transformation of data between its internal ECS form and various external forms:

 

The BS conversion code used as an output conversion transforms a character string to a byte string where each ECS character becomes two characters with values in the range 0 to 255. Most applications should not need to use this conversion except as described below. Used as an input conversion, byte pairs are combined to form an ECS character. The default behaviour of the BS conversion code is that the character pair representation of the data is in the byte ordering used by the hardware of the computer system on which the conversion is performed. Two variations, BSL and BSH, provide a low byte first and high byte first ordering respectively.

 

Although primarily intended for use on ECS mode systems, the BS conversion is also available on non-ECS systems. In this case, use as an input conversion with data that lies outside the 8 bit character range will return the low order 8 bits of each character and set a STATUS() value of 3.

 

 

The MXUC conversion is similar to MX0C in that it translates character values to or from hexadecimal but the hexadecimal values are four digits, high byte first. The MBUC and MOUC conversions provide similar capabilities for binary (16 digits per character) and octal (6 digits per character) respectively.

 

Again, these codes are available in non-ECS systems. Performing an input conversion with data that would result in characters outside the 8 bit range will return a null string and a STATUS() value of 1.

 

 

The Xname conversion code applies a character encoding, translating between ECS data and an external representation such as UTF-8. The name may be followed by a period and one or more case insensitive character qualifiers that control the behaviour of the conversion. Used as an output conversion, data is converted from ECS to the specified form. Used as an input conversion, data is converted from its external form to ECS. This code is available on non-ECS systems but will be limited to the 8-bit character set. Invalid input data will result in the replacement character (U+FFFD) on ECS systems or a question mark on non-ECS systems and a STATUS() value of 1.

 

Users can add their own translation codes as described with the Xname conversion code.

 

 

Encryption

 

The ad hoc data encryption function, ENCRYPT(), encrypts a series of bytes. So long as the data passed to it contains no characters outside of the 8-bit set, no special action is needed by the programmer and the resultant encrypted string is fully compatible with the non-ECS version of QM. If data that may contain ECS characters is to be encrypted, it must first be converted to a byte string using the BS conversion code. Conversely, the DECRYPT() function decrypts data to a byte string which must be converted back to characters with the BS conversion code if the original data was encrypted in this way. Note that an application that decrypts data must know how it was encrypted. Note also that byte ordering may become significant if this differs between the system where the data is encrypted and that on which it is decrypted. The BS conversion code has options to force a specific byte ordering.

 

Field and record level encryption within QM data files is unaffected by ECS and encrypted data in non-ECS mode files is compatible between both system types.

 

 

Special Encodings

 

The Base64 and MD5 encodings provided by the B64 conversion code and the MD5() function operate on byte sequences and hence may require use of the BS conversion code as described above for encryption.

 

 

Transliteration

 

The ECS character maps include the ability to define a transliteration character or character pair for each codepoint. These are used by the QMBasic TRANSLITERATE() function to construct a representation of an ECS string using only characters from the 8 bit character set. This function maps each ECS character to one or two 8 bit characters. If it is necessary to perform more complex replacements, a QMBasic subroutine should be used, possibly as a user written conversion code.

 

 

Hashed Files and Indices

 

Files that are to store ECS data must be created with the ECS keyword to the CREATE.FILE command. Existing non-ECS files can be converted to ECS mode using CONFIGURE.FILE. Files that are not in this mode, including those created by earlier versions of QM can be accessed by the ECS version of QM but cannot store ECS data characters. Any attempt to write a record containing such characters to a non-ECS file will fail.

 

Creating a file in ECS mode does not result in a file twice as large as its non-ECS equivalent. Each record is stored in either 8-bit or ECS mode depending on the data in the record. The record ids are similarly held in whichever format is required. The only exception to this is that ECS mode files that use field level encryption always store the records in ECS format. This is necessary in order to correctly maintain the data in fields to which the user is denied access when updating a record.

 

Alternate key indices for ECS mode files behave similarly, storing the index data in ECS form only if it contains extended characters. The indexed values are always stored in ECS form for best performance.

 

 

Directory Files

 

Encoding rules can be specified for directory files using the same names as in the Xname conversion code outlined above. The encoding name can be specified in field 7 of the F-type VOC entry that defines the file. This can be overridden by use of the ENCODING clause to the QMBasic OPEN or OPENPATH statements and this in turn can be overridden by use of the ENCODING clause in a read or write operation. If no encoding is specified or it is overridden by use of ENCODING "NULL", the data is treated as a byte stream, writing only the low order byte from each character and returning a STATUS() value of ER_ECS_DATA if this includes characters outside the 8 bit range. Note that use of a null string as the encoding name in any QMBasic statement that supports the ENCODING clause is equivalent to not having the ENCODING clause.

 

When using a directory file, the application developer must determine how the mark characters are to be handled. If the data in the file uses characters 251 to 255 in their Unicode definition to represent the accented characters found in European languages, the encoding option to swap the mark characters into the private use area must be enabled. Alternatively, if these characters are to be treated as the multivalue marks, this option must not be enabled.

 

Encoding can be used on the non-ECS version of QM but will be restricted to the 8-bit character set.

 

 

Sequential File I/O, including devices such as serial ports

 

Encoding rules can be specified for sequential files in a similar way to directory files, via the VOC F-type item or by use of the ENCODING clause to OPENSEQ, READSEQ and WRITESEQ. Note that READBLK and WRITEBLK are byte string operations and are not affected by encoding settings.

 

 

Select Lists

 

Select lists in memory (numbered lists or Pick style select list variables) may contain ECS characters. When saving a select list to disk, the target file must either be an ECS mode hashed file or a directory file with an encoding defined.

 

 

Terminal I/O

 

When using a QMConsole session on Windows systems, the font must be set to Lucida Console if characters outside the 8-bit set are to be shown correctly. Encodings set by the PTERM command are accepted but ignored.

 

For direct telnet or serial port connections to QM or entry from the operating system shell other than on Windows, the terminal connection normally operates in 8-bit mode but can be switched to UTF-8 by use of the PTERM command or by entering QM with the -utf8 command line option.

 

Except when the connection is set to operate in binary mode, QM will relocate characters 251 to 255 in the input data to code points U+F8FB to U+F8FF. These are the accented characters common in European languages that are displaced by the conventional definition of the mark characters. The opposite transformation will occur on output to the terminal. Thus a user entering, for example, a u-umlaut (ü) will see this character reflected back to their screen correctly even though internally it has been transformed to be the character with code point value U+F8FC. In cases where a user needs to be able to enter the field mark, value mark or subvalue mark from the keyboard, these can be entered using Ctrl-^, Ctrl-] or Ctrl-\ respectively if this feature is enabled (see the PTERM command).

 

 

Sockets

 

Socket connections opened using OPEN.SOCKET or ACCEPT.SOCKET.CONNECTION are byte string interfaces. Data written to a socket should, where necessary, be encoded into an appropriate format for transmission such as UTF-8 using, for example, OCONV(). Writing data with characters outside the 8-bit range will transmit the least significant 8 bits of the character value. Similarly, incoming data may need to be converted from its transmission format.

 

 

Printing

 

The SETPTR command includes an ENCODING option to set the character encoding to be used for output to individual print units. If no encoding is specified, only the low order 8 bits of each character are output. The encoding can also be set with the QMBasic SETPU statement. Printer encoding is available on non-ECS systems but restricted to the 8 bit character set.

 

 

Data Editing

 

The ^nnn notation of the ED editor and MODIFY is extended to allow ^Xnnnn to enter a four hexadecimal digit character value.

 

The "quote char" function of the SED editor is extended to allow Xnnnn to enter a four hexadecimal digit character value.

 

 

QMBasic

 

QMBasic program source code must be written in the 8-bit character set. The only exception is that string constants may contain characters from the extended character set. This will require that the source file is created or encoded in a manner that supports ECS characters.

 

Programs compiled on the non-ECS version of QM will run unchanged on the ECS version. Programs compiled on the ECS version will run on the non-ECS version so long as they do not attempt to use any ECS specific features.

 

 

QMNet

 

QMNet is unaffected by ECS except that a non-ECS mode system cannot open an ECS mode file on the remote server.

 

 

QMClient

 

The QMClient C API includes a set of wide character functions that can accept or return ECS data. Connections are compatible between ECS mode and non-ECS mode systems with the exception that an ECS mode file cannot be opened by a non-ECS mode client. Data returned from an ECS mode server for subroutine calls and executed commands initiated from a non-ECS mode client must not contain characters outside the 8-bit set.

 

 

Replication

 

The data replication system can replicate ECS files so long as the target file on the subscriber system is either an ECS mode hashed file or a directory file with an encoding.

 

When replicating directory files, the state of the mark mapping mode is taken into account such that the write on the subscriber is performed with the same mark mapping state as the corresponding write on the publisher. Replicating a hashed file to a directory file target will perform the write on the subscriber with mark mapping enabled.

 

Writes performed to directory files on the subscriber always use any encoding set in the VOC entry for the file on the subscriber system. Encodings specified for the publisher write are independent of how the data is written on the subscriber.

 

 

Related Settings

 

Alternative day and month names can be set for non-English languages by use of the SET.LANGUAGE command. This requires at a minimum that messages 1500 to 1502 (month names, day names, ordinal dates) have been translated to the relevant language. See Multi-language applications for more details.

 

Alternative national language currency and numeric format options can be set by use of the NLS command.

 

 

Character Maps on Non-ECS Mode Systems

 

Non-ECS mode QM systems support minimal character mapping functionality, allowing developers to change the uppercase/lowercase pairing, sort order and character attributes. This is particularly relevant to applications that need to support the accented characters found in European languages but do not need the full capabilities of ECS mode.

 

When a QM process starts, the map specified by the CHARMAP configuration parameter is loaded or, if this parameter is not set, a default map is loaded. The CHAR.MAP command can be used to load an alternative map dynamically but users should beware that, because non-ECS systems have only a single active character map this might invalidate alternate key indices that were created or updated using a different map. Indices may have to be rebuilt. As a general rule, an application should use just one character map.

 

The default character map treats only upper and lower case A to Z as letters and sorts based on the 8-bit character value. The EDIT.MAP command can be used to create or modify a character map. The map will be stored in the ascii-maps subdirectory of the QMSYS account.

 

As an example, an application using Windows code page 1252 and needing to use the Spanish Ñ and ñ characters could use create a map in which these characters are given the correct uppercase/lowercase pairing, marked as being letters and placed in the correct sort order.