Bilag K - danMARC2 character repertoire

Bilag J - Funktionskoder for ophav

Bilag K - danMARC2 character repertoire

1. Introduction

The danMARC2 character repertoire was constructed in 1993 – when the Danish union catalogue Danbib was formed by merging the union databases for public libraries and for research libraries.

It was decided to use unicode as reference character set. The 8-bit character set ISO 8859-1 (Latin 1) was chosen as the basic character set, and a technique for expanding the character repertoire to all unicode characters in the basic multilingual plane (BMP) was introduced. BMP consists of 256 rows, each with 256 cells. This gives a total of more than 65.000 cells. Each may in principle contain a character. Unfortunately the introduced praxis for diacritic marks was not (and is still not) fully in line with unicode.

2. Scope

This document gives a description of the danMARC2 character repertoire with focus on syntax and mapping to and from Unicode.

Link to a more elaborated explanation with the historic background (in Danish): http://www.danbib.dk/index.php?doc=tegnsaet

3. Character repertoire

Unicode	danMARC2	Comment
U+0020 to U+00FF	20 to FF	BMP row 0 corresponds to ISO 8859-1 (Latin 1). Three caracters have special meaning - see below
U+002A	@*	* - asterisk
U+0040	@@	@ - commercial at
U+00A4	@¤	¤ - currency sign
U+0100 to U+FFFF	@0100 to @FFFF	Standard representation of Unicode characters outside BMP row 0
Characters outside BMP		Cannot be represented in danMARC2

4. Ordinary characters

The basic character set ISO 8859-1 corresponds to row zero in the basic multilingual plane.

Unicode characters from row zero (apart from the three special characters mentioned below) is mapped to the basic character set ISO 8859-1, and unicode characters from other rows are represented by a sequence of five characters from ISO 8859-1, where the first character is "@". The following two characters gives the hexadecimal value of the row, and the last two characters gives the hexadecimal value of the cell.

Example: Greek letters are placed in row three. Greek capital letter omega  (U+03A9) is represented by the sequence @03A9 in danMARC2.

5. Three special characters
Asterisk is used as subfield notation in some download formats. If an asterisk is needed in the bibliographic record it must be represented by two characters "@*".

Commercial at is used to denote special encoding. If a commercial at is needed in the bibliographic record it must be represented by two characters "@@".

Currency sign is used as a sorting mark. If a currency sign is needed in the bibliographic record it must be represented by two characters "@¤".

6. Additional characters
One of the characters in the Danish alphabet (å, Å) does also exist in an old form (aa, Aa). To be able to distinguish the old å from double a, the sequences "@å" and "@Å" is used.
An alternative representation exists, using the medieval characters U+A733 and U+A732.

Praxis in danMARC2 has been to use the character sequences @UD8 and @UD9 for superscript and subscript. Praxis is now changed to use the superscript and subscript characters from unicode. The characters are represented in danMARC2 using standard @-notation.

7. Diacritics
All combining characters from unicode may be used in denMARC2, but a special praxis is used for historical reasons. See below.

Praxis for swapping single version and combining version
A number of diacritical marks have two representations in unicode – a single version and a combining version. They may also exist as composite characters, e.g. ä.
Praxis in danMARC2 is to use the composite characters when this is possible, and else to use the single version as the combining version and vice versa for most of the diacritics, i.e. to use the opposite version than dictated by unicode.

Danish name	Single version	Code	Combinatory version	Code	Danish praxis
circumflex	circumflex accent	U+005E	combining circumflex accent	U+0302	swap
understreg	low line	U+005F	combining low line	U+0332	swap
grave	grave accent	U+0060	combining grave accent	U+0300	swap
tilde	tilde	U+007E	combining tilde	U+303	don't swap
umlaut	diaeresis	U+00A8	combining diaeresis	U+308	swap
macron	macron	U+00AF	combining macron	U+304	swap
aigu	acute sign	U+00B4	combining acute accent	U+301	swap
cedille	cedilla	U+00B8	combining cedilla	U+0327	swap
hacek	caron	U+02C7	combining caron	U+030C	swap
breve	breve	U+02D8	combining breve	U+0306	swap
overcirkel	ring above	U+02DA	combining ring above	U+030A	swap
højrekrog	ogonek	U+02DB	combining ogonek	U+0328	swap

Praxis for swapping placement of combining characters
Unicode dictates the combining character to be placed after the character it modifies. DanMARC2 praxis is to put the combining character in front of the character it modifies.
Normal praxis in Danish records is to use only one combining character. Never the less, if more than one is needed, this mirroring of placement is generalised as shown:

Unicode: <basis character><combining character 1>…< combining character n>
danMarc2: < combining character n>…< combining character 1><basis character>

Note: The sequence of the combining characters is important. If a letter with circumflex and acute shall be represented in unicode, the sequence of the combining characters determines which one is topmost.

Graphics	Comment	Unicode sequence	danMARC2
ấ	acute is topmost	U+0061 U+0302 U+0301	@0301@0302a
á̂	circumflex is topmost	U+0061 U+0301 U+0302	@0302@0301a

See further description in the unicode documentation.

8. Mapping from danMARC2 to Unicode

dM2	Unicode
20 to FF	U+0020 to U+00FF	ISO 8859-1 (Latin-1) svarer række 0 i BMP.
*		Special character in dM2
@		Special character in dM2
¤		Special character in dM2 (obs: may be changed!)
@0100 to @FFFF	U+0100 to U+FFFF	Standard mapping
@*	U+002A	asterisk
@@	U+0040	commercial at
@¤	U+00A4	currency code – not used in Danish records (to be confirmed)
@å	U+A733	gammelt dansk å - alternative form in dM2 @A733
@Å	U+A732	gammelt dansk Å - alternative form in dM2 @ A732

Three characters (*, @ and ¤) have special meaning in danMARC2, and are not mapped to unicode.

Old Danish å and Å is represented as U+A733 and U+A732 in unicode. As a consequence they may have two representations in danMARC2, @å, @A and @A733, @732

Forlag: Dansk BiblioteksCenter as

2. udgave, Maj 1998.

Til top