Forrige afsnit Bilag J - Funktionskoder for ophav    

Bilag K - danMARC2 character repertoire

1. Introduction

The danMARC2 character repertoire was constructed in 1993 - when the Danish union catalogue Danbib was formed by merging the union databases for public libraries and for research libraries.

It was decided to use unicode as reference character set. The 8-bit character set ISO 8859-1 (Latin 1) was chosen as the basic character set, and a technique for expanding the character repertoire to all unicode characters in the basic multilingual plane (BMP) was introduced. BMP consists of 256 rows, each with 256 cells. This gives a total of more than 65.000 cells. Each may in principle contain a character. Unfortunately the introduced praxis for diacritic marks was not (and is still not) fully in line with unicode.

2. Scope

This document gives a description of the danMARC2 character repertoire with focus on syntax and mapping to and from Unicode.

Link to a more elaborated explanation with the historic background (in Danish): http://www.danbib.dk/index.php?doc=tegnsaet

3. Character repertoire

 Unicode  danMARC2  Comment
U+0020 to U+00FF 20 to FF BMP row 0 corresponds to ISO 8859-1 (Latin 1). Three caracters have special meaning - see below
U+002A @* * - asterisk
U+0040 @@ @ - commercial at
U+00A4 ¤ - currency sign
U+0100 to U+FFFF @0100 to @FFFF Standard representation of Unicode characters outside BMP row 0
Characters outside BMP   Cannot be represented in danMARC2

 

4. Ordinary characters

The basic character set ISO 8859-1 corresponds to row zero in the basic multilingual plane.

Unicode characters from row zero (apart from the three special characters mentioned below) is mapped to the basic character set ISO 8859-1, and unicode characters from other rows are represented by a sequence of five characters from ISO 8859-1, where the first character is "@". The following two characters gives the hexadecimal value of the row, and the last two characters gives the hexadecimal value of the cell.

Example: Greek letters are placed in row three. Greek capital letter omega  (U+03A9) is represented by the sequence @03A9 in danMARC2.


5. Three special characters
Asterisk is used as subfield notation in some download formats. If an asterisk is needed in the bibliographic record it must be represented by two characters "@*".

Commercial at is used to denote special encoding. If a commercial at is needed in the bibliographic record it must be represented by two characters "@@".

Currency sign is used as a sorting mark. If a currency sign is needed in the bibliographic record it must be represented by two characters "@¤".


6. Additional characters
One of the characters in the Danish alphabet (å, Å) does also exist in an old form (aa, Aa). To be able to distinguish the old å from double a, the sequences "@å" and "@Å" is used.
An alternative representation exists, using the medieval characters U+A733 and U+A732.

Praxis in danMARC2 has been to use the character sequences @UD8 and @UD9 for superscript and subscript. Praxis is now changed to use the superscript and subscript characters from unicode. The characters are represented in danMARC2 using standard @-notation.


7. Diacritics
All combining characters from unicode may be used in denMARC2, but a special praxis is used for historical reasons. See below.

Praxis for swapping single version and combining version
A number of diacritical marks have two representations in unicode - a single version and a combining version. They may also exist as composite characters, e.g. ä.
Praxis in danMARC2 is to use the composite characters when this is possible, and else to use the single version as the combining version and vice versa for most of the diacritics, i.e. to use the opposite version than dictated by unicode.

Danish name Single version Code Combinatory version Code Danish praxis
circumflex circumflex accent U+005E combining circumflex accent U+0302 swap
understreg   low line U+005F  combining low line U+0332 swap
grave grave accent U+0060 combining grave accent U+0300 swap
tilde tilde U+007E combining tilde U+303 don't swap
umlaut diaeresis U+00A8 combining diaeresis U+308 swap
macron macron U+00AF combining macron U+304 swap
aigu acute sign U+00B4 combining acute accent U+301 swap
cedille cedilla U+00B8 combining cedilla U+0327 swap
hacek caron U+02C7 combining caron U+030C swap
breve breve U+02D8 combining breve U+0306 swap
overcirkel ring above U+02DA combining ring above U+030A swap
højrekrog ogonek U+02DB combining ogonek U+0328 swap


Praxis for swapping placement of combining characters
Unicode dictates the combining character to be placed after the character it modifies. DanMARC2 praxis is to put the combining character in front of the character it modifies.
Normal praxis in Danish records is to use only one combining character. Never the less, if more than one is needed, this mirroring of placement is generalised as shown:

Unicode:      <basis character><combining character 1>... < combining character n>
danMarc2:   < combining character n>... < combining character 1><basis character>

Note: The sequence of the combining characters is important. If a letter with circumflex and acute shall be represented in unicode, the sequence of the combining characters determines which one is topmost.
 

Graphics Comment Unicode sequence danMARC2
ấ acute is topmost U+0061 U+0302 U+0301 @0301@0302a
̂ circumflex is topmost U+0061 U+0301 U+0302 @0302@0301a

See further description in the unicode documentation.
 

8. Mapping from danMARC2 to Unicode

dM2

Unicode

 

20 to FF

U+0020 to U+00FF

ISO 8859-1 (Latin-1) svarer række 0 i BMP.

*

 

Special character in dM2

@

 

Special character in dM2

¤

 

Special character in dM2 (obs: may be changed!)

@0100 to @FFFF

U+0100 to U+FFFF

Standard mapping

@*

U+002A

asterisk

@@

U+0040

commercial at

U+00A4

currency code - not used in Danish records (to be confirmed)

U+A733

gammelt dansk å - alternative form in dM2 @A733

U+A732

gammelt dansk Å - alternative form in dM2 @ A732

Three characters (*, @ and ¤) have special meaning in danMARC2, and are not mapped to unicode.

Old Danish å and Å is represented as U+A733 and U+A732 in unicode.  As a consequence they may have two representations in danMARC2, @å, @A and @A733, @732
 

 

 

 


 




 

 

                                 

_