Bilag K - danMARC2 character repertoire
1. Introduction
The danMARC2 character repertoire was constructed in 1993 – when the Danish
union catalogue Danbib was formed by merging the union databases for public
libraries and for research libraries.
It was decided to use unicode as reference character set. The 8-bit character
set ISO 8859-1 (Latin 1) was chosen as the basic character set, and a technique
for expanding the character repertoire to all unicode characters in the basic
multilingual plane (BMP) was introduced. BMP consists of 256 rows, each with 256
cells. This gives a total of more than 65.000 cells. Each may in principle
contain a character. Unfortunately the introduced praxis for diacritic marks was
not (and is still not) fully in line with unicode.
2. Scope
This document gives a description of the danMARC2 character repertoire with
focus on syntax and mapping to and from Unicode.
Link to a more elaborated explanation with the historic background (in Danish):
http://www.danbib.dk/index.php?doc=tegnsaet
3. Character repertoire
Unicode |
danMARC2 |
Comment |
U+0020 to U+00FF |
20 to FF |
BMP row 0 corresponds to ISO 8859-1 (Latin 1). Three caracters
have special meaning - see below |
U+002A |
@* |
* - asterisk |
U+0040 |
@@ |
@ - commercial at |
U+00A4 |
@¤ |
¤ - currency sign |
U+0100 to U+FFFF |
@0100 to @FFFF |
Standard representation of Unicode characters outside BMP row 0 |
Characters outside BMP |
|
Cannot be represented in danMARC2 |
4. Ordinary characters
The basic character set ISO 8859-1 corresponds to
row zero in the basic multilingual plane.
Unicode characters from row zero (apart from the three special characters
mentioned below) is mapped to the basic character set ISO 8859-1, and unicode
characters from other rows are represented by a sequence of five characters from
ISO 8859-1, where the first character is "@". The following two characters gives
the hexadecimal value of the row, and the last two characters gives the
hexadecimal value of the cell.
Example: Greek letters are placed in row three. Greek capital letter omega
(U+03A9) is represented by the sequence @03A9 in danMARC2.
5. Three special characters
Asterisk is used as subfield notation in some
download formats. If an asterisk is needed in the bibliographic record it must
be represented by two characters "@*".
Commercial at is used to denote special encoding.
If a commercial at is needed in the bibliographic record it must be represented
by two characters "@@".
Currency sign is used as a sorting mark. If a currency sign is needed in the
bibliographic record it must be represented by two characters "@¤".
6. Additional characters
One of the characters in the Danish alphabet (å,
Å) does also exist in an old form (aa, Aa). To be able to distinguish the old å
from double a, the sequences "@å" and "@Å" is used.
An alternative representation exists, using the medieval characters U+A733 and
U+A732.
Praxis in danMARC2 has been to use the character sequences @UD8 and @UD9 for
superscript and subscript. Praxis is now changed to use the superscript and
subscript characters from unicode. The characters are represented in danMARC2
using standard @-notation.
7. Diacritics
All combining characters from unicode may be used
in denMARC2, but a special praxis is used for historical reasons. See below.
Praxis for swapping single version and combining version
A number of diacritical marks have two representations in unicode – a single
version and a combining version. They may also exist as composite characters,
e.g. ä.
Praxis in danMARC2 is to use the composite characters when this is possible, and
else to use the single version as the combining version and vice versa for most
of the diacritics, i.e. to use the opposite version than dictated by unicode.
Danish name |
Single version |
Code |
Combinatory version |
Code |
Danish praxis |
circumflex
|
circumflex accent |
U+005E
|
combining
circumflex accent |
U+0302
|
swap |
understreg
|
low line |
U+005F |
combining low
line |
U+0332 |
swap |
grave |
grave accent |
U+0060 |
combining grave accent |
U+0300 |
swap |
tilde |
tilde |
U+007E |
combining tilde |
U+303 |
don't swap |
umlaut |
diaeresis |
U+00A8 |
combining diaeresis |
U+308 |
swap |
macron |
macron |
U+00AF |
combining macron |
U+304 |
swap |
aigu |
acute sign |
U+00B4 |
combining acute accent |
U+301 |
swap |
cedille |
cedilla |
U+00B8 |
combining cedilla |
U+0327 |
swap |
hacek |
caron |
U+02C7 |
combining caron |
U+030C |
swap |
breve |
breve |
U+02D8 |
combining breve |
U+0306 |
swap |
overcirkel |
ring above |
U+02DA |
combining ring above |
U+030A |
swap |
højrekrog |
ogonek |
U+02DB |
combining ogonek |
U+0328 |
swap |
Praxis for swapping placement of combining characters
Unicode dictates the combining character to be placed after the character it
modifies. DanMARC2 praxis is to put the combining character in front of the
character it modifies.
Normal praxis in Danish records is to use only one combining character.
Never the less, if more than one is needed, this mirroring of placement is
generalised as shown:
Unicode: <basis character><combining
character 1>…< combining character n>
danMarc2: < combining character n>…< combining character
1><basis character>
Note: The sequence of the combining characters is important. If a letter
with circumflex and acute shall be represented in unicode, the sequence of
the combining characters determines which one is topmost.
Graphics |
Comment |
Unicode sequence |
danMARC2 |
ấ |
acute is topmost |
U+0061 U+0302 U+0301 |
@0301@0302a |
á̂ |
circumflex is topmost |
U+0061 U+0301 U+0302 |
@0302@0301a |
See further description in the unicode
documentation.
8. Mapping from danMARC2 to Unicode
dM2 |
Unicode |
|
20 to FF |
U+0020 to U+00FF |
ISO 8859-1 (Latin-1) svarer række 0 i BMP. |
* |
|
Special character in dM2 |
@ |
|
Special character in dM2 |
¤ |
|
Special character in dM2 (obs:
may be changed!) |
@0100 to @FFFF |
U+0100 to U+FFFF |
Standard mapping |
@* |
U+002A |
asterisk |
@@ |
U+0040 |
commercial at |
@¤ |
U+00A4 |
currency code – not used in
Danish records (to be confirmed) |
@å |
U+A733 |
gammelt dansk å - alternative form in dM2 @A733 |
@Å |
U+A732 |
gammelt dansk Å - alternative form in dM2 @
A732 |
Three characters (*, @ and ¤) have special meaning in
danMARC2, and are not mapped to unicode.
Old Danish å and Å is represented as U+A733 and U+A732 in
unicode. As a consequence they may have two representations in danMARC2, @å,
@A and @A733, @732
|