Friday, 23 July 2010

disunification (1)

Consider the following pairs of symbols: a а ä ӓ æ ӕ c с e е è ѐ ë ё i і j ј o о p р s ѕ x х y у. Can you see any difference between the members of each pair? Nor can I. Nor can anyone.

However in each pair the first symbol is a letter of the Latin alphabet, while the second is Cyrillic.

Correspondingly, the two members of each pair have different Unicode encodings. While Latin a is U+0061, Cyrillic а is U+0430. While Latin j is U+006A, Cyrillic ј (used in writing Serbian) is U+0458. And so on.

This situation is convenient in that it keeps all the basic Latin letters together in the block 0021–00FF (I give Unicode numbers in the usual hexadecimal form) and all the Cyrillic letters together in the block 0400–04FF. But it is also highly inconvenient, because it opens up potential breaches in security. Now that non-ASCII letters are allowed in URLs, the fact that two differently coded letters look identical could be exploited for malicious purposes, for phishing or scamming. While www.facebook.com is a website you know and love (or not), “www.fасеbооk.com” would be somewhere quite different. (In the latter case, the Latin a,c,e,o have been replaced by the identical-looking Cyrillic equivalents.)

That is why they tell you not to click on links in emails, but to type them into the browser yourself.

It’s not quite as bad as that, because the domain name authorities will (we hope) refuse to register such deceptive domain names. On the other hand there is nothing to stop someone using this sort of thing as their Facebook name.

By the time it came to encoding IPA symbols, the Unicode consortium had become aware of this danger and resolved to take a much more conservative line. The new policy was that if two characters (“glyphs”) look the same, then normally they should have the same encoding. That’s why although most phonetic symbols are located in the IPA Extensions block (0250–02AF) some aren’t. We use the basic Latin a b c… rather than having special IPA ones. We also use the “Latin-1 Supplement” coding for the characters æ ç ð ø (U+00E6, U+00E7, U+00F0, U+00F8) since they occur in the ordinary spelling of Danish, French, Icelandic, and Norwegian. We also use the “Latin Extended-A” coding for the ħ (U+0127) used in Maltese orthography, for the œ (U+0153) used in French, and even for the ŋ (U+014B) used in spelling Sami and Mende. None of these is repeated in the IPA Extensions block, though ћ is separately coded for Cyrillic (Serbian, U+045B).

Worse, the phonetic symbols β θ χ (U+03B2, U+03B8, U+03C7) are to be found only in the “Greek and Coptic” block, since they are treated as identical with the Greek letters beta, theta and chi.

Fortunately, our IPA ɫ is not lumped in with Polish ł, nor ɪ (lax front unrounded vowel, small cap i) with Turkish dotless ı or Greek iota ι.

Meanwhile — rather incredibly, and going to the other extreme — our phonetic schwa ə is among the IPA symbols at U+0259, while the identical-appearing ǝ and ә are respectively LATIN SMALL LETTER TURNED E (U+01DD) of the Pan-Nigerian alphabet and CYRILLIC SMALL LETTER SCHWA (U+04D9) as used in Azerbaijani orthography.

The problem we face in all such cases is that of the “unification” versus “disunification” of identical-looking symbols.

More on this next week. Meanwhile, you might like to read Michael Everson’s discussion here.

56 comments:

  1. John, one of the reasons that phonetic schwa ə is encoded separately is that it has a capital Ə, whilst the Pan=Nigerian schwa ǝ has a capital Ǝ. The Cyrillic letter Әә is no different from the other Cyrillic clones.

    (Interestingly, in Lucida Grande bold, Cyrillic ј and Latin j look slightly different, the latter having a longer tail. But that's an artificial distinction, in my view.

    ReplyDelete
  2. The unification of the phonetic symbols β θ χ with the Greek letters beta, theta, and chi is unfortunate from a design perspective.

    A font supporting both Latin and Greek alphabets often employs different design styles for the two. In a classical text typeface, the Latin letters will be upright and with full serifs, with vertical strokes bolder than the horizontals, while the Greek letters may be slightly inclined, have no 'serifs' except natural entry and exit strokes, and if a traditional pen model is used the horizontal strokes may be bolder than the verticals. In that case, the symbols β θ χ will not harmonize stylistically with the Latin-based phonetic symbols. The problem is worse if one ends up mixing and matching different fonts for different phonetic symbols to be used next to each other.

    Separate glyphs can be made as phonetic alternates, but they require OpenType-savvy software and extra attention from users.

    ReplyDelete
  3. I now see that Michael Everson has discussed exactly this problem in his blog entry, and addresses sorting problems as well as extended use of the symbols (e.g. χ used in the transcription of Chukchi with a capital letter different from the usual Greek letter). I agree with him that disunification is desirable here.

    ReplyDelete
  4. The notion that IPA isn't encoded separately from Latin for confusability reasons is historically untrue. IPA is as old in the Standard as Latin and Greek themselves, and it has always been considered an extension of Latin, not a sui generis script. The name of the IPA block is "IPA Extensions", which means "IPA Extensions to Latin". Indeed, the Xerox Standard Character Set, which is the ancestor of Unicode, already had exactly the same unifications and disunifications in this area as Unicode does.

    The unification of beta, theta, and chi is more problematic. The most that can be said is that the use of purely Greek shapes for IPA purposes is always acceptable, if not always historically ideal; whereas the use of Greek alpha, epsilon, gamma, and phi would not be.

    ReplyDelete
  5. It's worth noting that the Cyrillic у has a different uppercase form from the Latin y (У and Y respectively). The lowercases also look different in this entry box (using a fixed-width font), with the Cyrillic having a curve at the bottom and the Latin a bar.

    ReplyDelete
  6. I'd also submit that the distinction you see on the Cyrillic and Latin isn't really artificial. It's designed to mesh more nicely with the proportions of Latin or Cyrillic type, which can clearly be different for the same font.

    ReplyDelete
  7. Ryan

    I thought the distinction was historical:

    Latin Y(y) from Greek Υ(υ)
    Cyrillic У(у) from a ligature of Greek Ο+Υ

    and in the old Russian alphabet

    Cyrillic Ѵ(ѵ) from Greek Υ(υ)

    ReplyDelete
  8. Michael Everson, Cyrillic ј having a shorter descender than the Latin j in Lucida Grande could very well be compensation for the fact that more Cyrillic lowercase letters are confined to the x-height in a typical text than Latin lowercase letters, so that pulling in the Cyrillic descender leads to more harmonious proportions overall.

    ReplyDelete
  9. In my font the Latin ħ and the Cyrillic one look different.

    ReplyDelete
  10. army 1987

    Surely Latin ħ and Cyrillic ђ are entirely different letters with upper case Ħ and Ђ respectively. Or am I missing something?

    ReplyDelete
  11. David Crosbie, Serbian Cyrillic uses ћ (uppercase Ћ) for the voiceless affricate written 'ć' in Serbian Latin. The lowercase can be identical in form to the letter ħ (uppercase Ħ) used in Maltese, and this is probably what army1987 is referring to.

    The Cyrillic letter ђ (uppercase Ђ) used in Serbian derives from ћ but differs in having a curling tail, and represents the voiced counterpart to ћ, written 'đ' in Serbian Latin and often transcribed 'dj'.

    ReplyDelete
  12. My wife and her friend have been exchanging drafts of Russian teaching materials for many years. When they started, there were rival standard Cyrillic codes for DOS (and therefore also Windows), and no discoverable standard for Apple OS. Over the years, I've helped them convert old drafts into fonts that would preserve (and often improve) the appearance.

    The latest problem was switching to Unicode fonts. Fortunately, this did not involve too many substitutions. However, it must be the case that the texts are full of characters that look Cyrillic but electronically are not.

    Does anyone know of a software program that offers to disambiguate identical looking characters?

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. Jeongseong

    Thanks for that. I knew about Serbian, but I've only ever seen Maltese in what must have been simple transcription.

    ReplyDelete
  15. Wow, I never even considered such a security risk. Excellent post.

    ReplyDelete
  16. jeongseong, army 1987

    Only one font is my system, Helvetica Neue Ultralight Italic, has Serbian ђ without a descender. And it lacks Maltese ħ altogether.

    ReplyDelete
  17. David: Apparently the comment system likes to take characters in angled brackets and delete them. There was supposed to be a j in the first sentence of the second post, after "Cyrillic and Latin". You're almost certainly right about the difference in Latin and Cyrillic y being historical, though I have to wonder whether it's because of the oy digraph itself, or just a symptom of the Cyrillic trend of generally preserving the same shape for majuscule and minuscule letters.

    ReplyDelete
  18. Ryan (and everyone): Please note that if you wish to use angle brackets in a comment you must replace "<" with "&lt;" and ">" with "&gt;".
    (Quiz: how did I post THIS comment?)

    ReplyDelete
  19. Ryan

    I haven't tried every font in my system, but it does seem that in every font that has both Latin small j and Cyrillic small je, the shape is the same. I get pairs with different shapes when the there is no je in the chosen font and so another is substituted. The same seems to be true for upper case letters.

    I'm pretty sure that the immediate origin of Russian У is the ligature form of ΟΥ — a sort of 8 with the top sliced off. (The digraph version was also used.) Byzantine sound values for Υ and ΟΥ must have been closer to Modern Greek values than Classical. The letter Υ wasn't needed at all in Cyrillic — except to copy Greek words.

    ReplyDelete
  20. John

    I can't do your quiz.

    But I can write ≺a≻ ≺b≻ ≺c≻ ≺d≻ ≺e≻ ...

    ReplyDelete
  21. Ryan

    I take it back. Cyrillic je is ever so slightly narrower than the Roman letter.

    ReplyDelete
  22. By "Cyrillic ħ" I meant the ћ (U+045B) mentioned in the post.

    ReplyDelete
  23. Irrespective of unicode unification or disunification, graphemes are clearly meant to be used only where context can tell what writing system they belong to.
    The size of this context is usually a document, occasionally a paragraph or a single word, but not usefully smaller than that.

    Furthermore, given a writing system, we would not generally expect it to contain any confusable symbols.

    Thus, a natural way to reduce security worries when allowing unicode in domain and user names is to require that all graphemes should be defined within a single writing system (or "script", in unicode terms, http://en.wikipedia.org/wiki/Scripts_in_Unicode).

    ReplyDelete
  24. army 1987

    Aha! Thanks for putting me right. I was relying in the first instance on the Mac OS Character Viewer. Its comprehensive (as I thought) display of Cyrillic characters includes letter djeђ but not letter tsheћ.

    The upper case characters are easily distinguished from the Maltese letter.

    dje ђ Ђ
    tshe ћ Ћ
    Maltese ħ Ħ

    ReplyDelete
  25. army 1987

    I forgo to add that, in the fonts that I've tried, Maltese ħ has a curved 'hump' identical to that of Roman h, while Cyrillic ћ (tshe) has a smaller, lower curved hump.

    I presume that this is because the Maltese letter is based on H, but the Cyrillic letter is based — at least in part — on T.

    ReplyDelete
  26. Confusability and spoofing between Latin, Greek, and Cyrillic has been handled by the relevant authorities. In general, the rule is that script-mixing within a URL element is disallowed. So this isn't what this is about.

    The point of my contribution is that treatment of IPA text is disadvantaged by the unification if three IPA letters with three Greek letters.

    ReplyDelete
  27. Michael

    Could you lobby Apple and Microsoft to present IPA to the user as a 'language' with its own sorting sequence?

    ReplyDelete
  28. David, if you're worried that the Russian teaching materials you mention are full of characters that look Cyrillic but electronically are not, would it help to scan them with Russian OCR software? I'm sure it would still be a hard slog, but at least you'd be able to see what was wrong.

    ReplyDelete
  29. David: I think it's a little odd to say that the form (as opposed to the value) of modern Cyrillic У comes from the ОУ ligature and not merely the second element of the digraph. A page from the Savvina Kniga url=http://upload.wikimedia.org/wikipedia/commons/f/fd/Scepkin1903Sava142ob.jpg has ОУ on the fourth line from the bottom, on the left. A page from the Ostromir Gospel http://upload.wikimedia.org/wikipedia/commons/6/62/OstromirGospel.jpg has ОУ on the first and second lines, on the right. Sure, the value of the letter in modern Slavic languages comes from the digraph, but I'm just not convinced that the У does, when it's so clearly evidenced in some of the oldest Cyrillic manuscripts.

    ReplyDelete
  30. Ryan

    I merely quote what i have read. I don't see how the second part of the ΟΥ digraph could give rise to anything but yzhetsaѴѵ.

    ReplyDelete
  31. As for the specific shapes of modern Cyrillic letters, blame the introduction of Russian civil script and type under Peter I with the tsar personally approving his preferred letter shapes (making them look more like Latin letters in most cases). Just how else does one derive the shape of the modern Я from the earlier Cyrillic forms?

    ReplyDelete
  32. Clayton Burns25 July 2010 00:58

    etc I once read in a book that philosophy is the study of etc (sounds interesting).

    Is there something such as factitious phonetic symbol disorder? Perhaps not... perhaps I dream or exaggerate...

    When reporters can't tell the difference between "minutia" and "minutiae" (or hazard a pronunciation), or even discover the singular for "bacteria," it seems odd to bother about etc.

    It's a possible case of involution.

    I have the OPI descriptors. I am ready to discuss them.

    ReplyDelete
  33. David,

    Re my above post of 17:11 yesterday. I wasn't attempting to answer your earlier question "Does anyone know of a software program that offers to disambiguate identical looking characters?" and if I understand you correctly, I don't know of any such program. But I imagine any Cyrillic OCR software would at any rate interpret anything ambiguous in your Russian teaching materials as Cyrillic, which is presumably what you want in those texts. At least that would be a start. No doubt it would introduce other errors, but they would be identifiable by eye, and it might not be that much of a slog to edit them.

    ReplyDelete
  34. David,

    Re: ȣ

    I don't see how the second part of the ΟΥ digraph could give rise to anything but yzhetsa — Ѵѵ.

    Well David, I recognized your ligature from the Greek one it was based on, which appears as large as life in Wikipedia as the subject of an article in its own right, now interpreted as a ligature of Latin o and u, which is why I can put it as the topic of this comment and hope it will appear as such. If it does I hope you'll agree it's recognizable, and acceptable in the absence of authentic versions from most fonts. Its Cyrillic name is Uk, and Wikipedia has an article on that too, saying "The simplification of the ligature оу to у was first brought about in the Old East Slavic texts and only later taken over into South Slavic languages."

    And that's what it seems to have been: a simplification of both the horizontal digraph, with or without ligature, and the vertical ligature, rather than the evolution of a new letter from it. Because the horizontal digraphs оу and оѵ coexisted, and the о can only have become redundant when у became fully distinct from ѵ. The article tells us that Tailed Izhitsa may be used as a part of the digraph in the Old Church Slavonic orthography, which did not have modern у.

    So my guess would be that that was originally a flourish, perhaps specifically to mark the sequence as a digraph, or perhaps just another case of making the majs and mins more alike, which became specified for the digraph (and therefore ultimately the [u] when the redundant о was dropped) because the tailless one became specified for [i] except in the digraph (apart from the cases when it was [v]!)

    I can't find any support for this guess for the moment. I think our best hope is for Michael Everson to come round again.

    ReplyDelete
  35. Mallamb

    Yes, i've come across this and another character. The Mac Character Viewer labels them

    LATIN CAPITAL LETTER OU Ȣ Unicode 0222
    LATIN SMALL LETTER OU ȣ Unicode 0223

    I don't find it surprising that Bulgarian should abandon its divergent shape in favour of the Russian shape. It's not so much Russia's political clout, but its status as 'the Third Rome'. Pretty every literate Bulgarian would be a cleric, and would pay serious attention to how the Metropolitan of Moscow (was he already a Patriarch?) wrote the letter.

    Earlier, it had been Bulgaria that was the big cheese. The Ostromir Gospels that Ryan cites were, I read, copied from Bulgarian works.

    So my guess would be that that was originally a flourish, perhaps specifically to mark the sequence as a digraph

    More likely to mark it as a single letter — alternating as a numeral 400.

    In one chapter of The Slavonic Language (Routledge), Paul Cubberly reproduces various Cyrillic from various periods. As far as I can tell:

    1. Eleventh century manuscripts from Bulgaria and Russia both use two letters.

    2. A fourteenth century Bulgarian manuscript still uses two letters. The Russian seems to be already using у with the great long descender — presumably what the Wikipedia article means by 'tailed izhetsa'.

    3. Fifteenth cursive writing from Belorussia (i.e. Lithuania-Poland) is beyond me. Sixteenth century cursive documents from Russia and from Dubrovnik both use the vertical ligature. A seventeenth century Bulgarian cursive uses o with a non-touching ѵ above it.

    4. A seventeenth century Bosnian Cyrillic text has a great tall vertical ligature. The printed Belorussian religious text from the sixteenth century and the printed civil text from eightieth century Russia are two short and/or too difficult for me. But it's not unlikely that they used something based on the vertical ligature.

    The fourteenth century Russian test is the odd one. Perhaps Russian scribes first simply dropped the o, but later decided that having 'tailed' and 'tailless' izhetsa was too confusing.

    Thanks for your advice on de-Romanising the teaching materials. Yes, it's possible to do your way or by Word's Replace feature. Since Lena has retired, and since an acceptable form of their materials has been published, it's not really worth my while — especially as preserving the layout is much more important. And there's also PDF to play with, which didn't use to exist when they were writing.

    So it's no longer important to be able to print stuff from text in the current version of Word. And sending electronic text to a third party never really demanded the care it seemed to. However careful the proof-reading, the publisher came up with more typos.

    ReplyDelete
  36. As they do. Wasn't it JW who was having a rant on here about printing being farmed out to India, where chaos reigned even supremer?

    Perhaps Russian scribes first simply dropped the o, but later decided that having 'tailed' and 'tailless' izhetsa was too confusing.

    Well what I was speculating above was that the reason they eventually dropped the o was that having 'tailed' and 'tailless' izhitsa was not too confusing. After all, it's what they finished up with. May we not suppose that the tailed one was the precursor of у? I have just looked at the Wikipedia stub on у, which says that у "evolved as a specifically East Slavic short form of the digraph оу used in ancient Slavic texts to represent /u/", which is consistent with my speculation about the details of the way it evolved, but later says that Greek upsilon "was parallelly also taken over into the Cyrillic alphabet in another form, as izhitsa". This seems like an odd use of the word "parallelly" in that context, and your observations from Paul Cubberly's The Slavonic Language don't look consistent with any such parallelism.

    ReplyDelete
  37. mallamb

    1. My principal source Russian and the Slavonic Lanuages by Entwistle and Morrison.

    Gk. ου=OB о у, ligature ȣ = Cyrillic у.

    I'm pretty sure I've read the same in other works.

    2. The ligature form clearly replaced the tailed izhetsa as a widespread general standard before yielding to modern у.

    3. Bulgarian seems to have replaced оу with a digraph not unlike the ligature, before adopting the Russian у.

    The Slavonic Language is not by Cubberly, by the way, only the one chapter. But has has written his own Russian: A Linguistic Introduction published by CUP. In the Routledge chapter he lists only оу run together horizontally and the vertical ligature ȣ in the Old Church Slavonic column of his table. In his CUP book, the printers allowed him four alternative symbols — the same two, plus у plus the curly 'sliced-off 8' version of the vertical ligature.The Church Slavonic character set included in Alphabetum font supplies three of these, placing them in the Unicode Private Use Area. The one not supplied there is the one that has 'public' existence as Unicode 0223 ȣ.

    ReplyDelete
  38. mallamb

    Here's an example of the typeface used by the Synod for Church Slavonic as recently as 1885 for a требник.

    The left hand page has examples of big and small izhetsa (I'm not sure whether upper and lower case is appropriate.) And there are a few examples of small uk (Big uk has proved hard to find.)

    ReplyDelete
  39. mallamb

    Aha!

    Look at this left hand page. Initially, the grapheme was (in 1885 anyway) represented not by uk but by о+izhitsa. Look towards the bottom for уже не умераетъ.

    That's why I couldn't find a 'capital ' uk.

    ReplyDelete
  40. David,

    I have no special powers to persuade Apple or Microsoft of anything… but the main sorting algorithms (as for file names in the OS) are initialized on startup, and you can only have one running at a time. So if you have multilingual data including IPA and Greek... either the Greek sorts correctly or the IPA sorts correctly under the scenario you mention.

    Disunification would be better for supporting this functionality.

    ReplyDelete
  41. This comment has been removed by the author.

    ReplyDelete
  42. U+0223..0224 ȣ LATIN LETTER OU is not for Cyrillic.

    Please see U+A64A..A64B CYRILLIC LETTER MONOGRAPH UK at unicode.org/charts/PDF/UA640.pdf

    ReplyDelete
  43. Yes I realized that, but could not find a font with them in it. I explained I was using U+0223..0224 ad hoc, and so has David. I don't think this was a particularly grave sin as Wiki says "The ligature, in both majuscule and minuscule forms, is occasionally used to represent minuscule оf "У" in the Romanian Transitional Alphabet, as the glyph for monograph Uk (ꙋ) is rarely available in font sets."

    David, I don't know whether you are saying aha because you think you've proved me wrong, or because you've found some support for my speculations, but I did say о+tailless izhitsa and о+ tailed izhitsa had coexisted (and they both coexisted at various times with the various forms of the ligature). I never imagined till as late as 1885, though, when the modern у was still presumably not in the Church Slavonic alphabet. So it could have evolved simply by the dropping of the о and raising of the status of the allographic tailed izhitsa to the status of a distinct grapheme, rather than by some mysterious (and as yet unattested) redesign of U+A64A..A64B CYRILLIC LETTER MONOGRAPH UK to function in that status.

    Often enough Wikipedia and ignorance can give one ideas which the wood of multiple well-researched sources obscures with its luxuriant trees, but I think you have found a very significant specimen lurking in the undergrowth.

    ReplyDelete
  44. Michael

    I know nothing of Windows, but Apple Snow Leopard allows you to change the priority of 'languages' within an application with effect from when the application is opened. (You do need to start up again for it to take effect within Finder.)

    My thinking is that if IPA could be defined as an Apple 'language', and if you placed it above Greek in the preference list, it would sort in English order, thenIPA. For a short task, one could even place IPA above English.

    Is Unicode run by a committee of organisations? That would probably make them less open to persuasion than Apple or Microsoft.

    I'm not at all surprised that U+0223..0224 ȣ LATIN LETTER OU were not intended for Cyrillic. They don't even look right.

    ReplyDelete
  45. If you did that, then Greek words containing the three letters would be sorted incorrectly.

    ReplyDelete
  46. mallamb

    Aha! could just as well have been Eureka!. That use of оу was a heady mixture of the unexpected and the explanatory.

    ReplyDelete
  47. Heady and taily. Quite a freak-out in fact. Thanks for that.

    ReplyDelete
  48. Michael

    I followed your link to the Cyrillic Extended-B set. Of all my fonts, only Geneva has any number of characters from the set — not including uk. Even the extensive Alphabetum font has hardly any characters in this area.

    Could you recommend a font?

    ReplyDelete
  49. I can recommend Code2000 for Ꙋꙋ (U+A64A..A64B CYRILLIC LETTER MONOGRAPH UK). Can you see the minuscule after monograph Uk in my post of 12:18 above? It doesn't seem to be in any of my versions of the fonts JW has specified for this blog, yet it displays OK whether I allow pages to choose their own fonts instead of my selections or not.

    Perhaps we should just give up trying to understand these mysteries. Even John's distinction between voiced velar plosive, ɡ, U+0261 and ordinary lower-case g, U+0067 isn't working under either condition.

    This Code2000 where I found Uk is very useful for Word, at any rate on PC. I think I have written about it on here before.

    ReplyDelete
  50. Everson Mono, a lovely monowidth font which I use for my e-mail, has a complete set of Cyrillic characters.

    ReplyDelete
  51. Michael

    Thank you. Very impressive. I don't think I'll be using Everson, however, as Alphabetum looks so good.

    A640A turns out to be a very unfamiliar character indeed. I was puzzled until I saw that it's intended as a 'capital'. A64B is very familiar, though.

    Alphabetum supplies a character similar to Everson's A640A as well as a larger version of A640B.

    ReplyDelete
  52. I like monowidth e-mail. Better for ASCII art ;-)

    ReplyDelete
  53. Oh, and it's Everson Mono. I'm Everson.

    ReplyDelete
  54. Michael

    I didn't twig!

    I'll uninstall the font while deciding whether to buy it.

    ReplyDelete
  55. Michael

    I will test it, I promise. But I don't have a suitable project on at the moment.

    ReplyDelete