Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept and resolve BCP47 language names (WIP) #1641

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Omikhleia
Copy link
Member

@Omikhleia Omikhleia commented Dec 2, 2022

Some "food for thought" regarding #1631

I fancied checking what adding \language[main=en-GB]1 to the SILE manual would imply...

Regard this a (kind of working but ugly) draft attempt.

It raises some interesting questions for the future:

  • Technical debt and/or code smell (apparently unused setting font.script, remainders of no-longer properly initialized SILE.languageSupport.languages, etc.)
  • Architecture design and internal API consistency (too many things with too many side effects stuffed in language resource file, hard to manage state-oriented API of fluent, uncertain need for cldr in the current implementation, some language files return a table while most do not, etc.)

How to move forward certainly requires a longer and tougher discussion.

Footnotes

  1. Or anything else you'd like. I played with "sr-Latn", which currently resolves to "sr" (= sr-Cyrl actually), no real interest obviously, except to see chapter titles in Cyrillic and hyphenations not working for the English text :)

@Omikhleia Omikhleia marked this pull request as draft December 2, 2022 01:09
core/font.lua Outdated Show resolved Hide resolved
core/languages.lua Outdated Show resolved Hide resolved

-- BEGIN OMIKHLEIA HACKLANG
-- Disabled for now, see further below.
-- local cldr = require("cldr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below, if ICU actually does what we need here that's fine, otherwise cldr-lua should be extended to provide the necessary functionality.

if res then
return res, langbcp47
end
langbcp47 = langbcp47:match("^(.+)-.*$") -- split at dash (-) and remove last part.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this matching/parsing bit needs to be in cldr-lua.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wherever it eventually ends, the utility is also useful for other packages. E.g. I wish I had it for my "smartquotes" package.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ICU bits should probably get moved from the font code to some base language support module, and yes should be setup in a way that every package can use them. This bit which isn't really ICU still looks like it goes in CLDR to me and could also return information about what the segments mean.

core/languages.lua Outdated Show resolved Hide resolved
core/languages.lua Outdated Show resolved Hide resolved
@alerque
Copy link
Member

alerque commented Oct 1, 2024

I think there are two important next steps for SILE.

One is having the tooling worked out to build Rust functions to both crates & Lua rocks (so SILE can still be used as a library and we have a two way bridge to implement anything however works best). That is kind of held up waiting on mlua v0.10 because it is going to make that effort much more ergonomic.

The other one is this. Handling everything with full BCP-47 locales internally and falling back gracefully to shorter ISO language codes where necessary has became what I see as the next big push I need to make. I don't know whether we can/should do this separately from the other item above, but I'm going to be looking into it.

@Omikhleia
Copy link
Member Author

I think there are two important next steps for SILE.

I've started disentangling the hyphenation patterns in #2102 (and see also #2104 for what could be the next logical step here - I've kept those in a branch for now).

One implication is that the language/xxx files are much simpler (only some bootstrapping logic), some maybe this comment is then easier to address: #1641 (comment)

(So as to have a homogeneous implementation there, while currently some return tables, some affect globals, etc.)

@alerque
Copy link
Member

alerque commented Oct 1, 2024

Thanks for the pointers on what end of the stick to grab first, I'll go start digging. (I'm also expecting a son to be born in the next few days so don't be surprised if I drop off the radar for a while.)

@alerque alerque added enhancement Software improvement or feature request refactor Code quality improvements labels Oct 1, 2024
Copy link
Member

@alerque alerque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments are kind of rambling, I'm just poking around. I rebased your commit as best I could on master. I think it's broken now because I didn't rewire up the bits that changed, just figured out what hunks to pull out of the merge conflict. I'll keep going on it trying to sort out the parts.

core/font.lua Outdated
-- BEGIN OMIKHLEIA HACKLANG
-- BAD STUFF WARNING: This SILE.languageSupport.languages[] is broken, (nearly always returning nil),
-- see comment in languages.lua
-- ON THE OTHER HAND, setting font.script doesn't seem to be used in any sensible way...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script parameter isn't used for all languages or fonts, but some locale and font combinations do make use of it and the shaper then needs this info to correctly shape them. We're passing this on to HarfBuzz if it is set, that's all we need to do with it.

core/font.lua Outdated Show resolved Hide resolved
-- BEGIN OMIKHLEIA HACKLANG
-- Either done too soon or plain wrong (a BCP47 language can e.g. contain a script for instance, sr-Latn
-- and I don't see that in https://github.com/alerque/cldr-lua/blob/master/cldr/data/locales.lua
-- I don't really know why we would want that CLDR filtering here BTW...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From memory (obviously I'll be testing this here) one of the reasons for this is because using icu.canonicalize_language() gets us the canonical language code, but it does not validate that what we got could even be recognized, and just passes through garbage if you git it garbage. If my recollection is correct here, we should be doing both. First trying to parse a canonical name, then checking if that is a known language in the database and hence we can reasonably be expected to know anything about it.

@alerque alerque force-pushed the bcp47-language-attempt-draft branch from 261c295 to e9999ac Compare October 4, 2024 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Software improvement or feature request refactor Code quality improvements
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

2 participants