format-integer for Spanish

Added by Martin Honnen over 2 years ago

I hope everyone is having a quite evening, as Christian started playing with format-integer and languages I couldn't resist using it as well, I tried it for Spanish and find that the code

declare namespace output = "";

declare option output:method 'xml';
declare option output:indent 'yes';

  for $year in (1492, 1976, 1984)
  return <format-integer number="{$year}">{format-integer($year, 'w', 'es')}</format-integer>

gives the output shown below

C:\Users\Martin Honnen\OneDrive\Documents\xslt\blog-xslt-3-by-example\format-integer>java -cp "C:\Program Files\Saxonica\SaxonEE11-1J\saxon-ee-11.1.jar"  net.sf.saxon.Query -t format-integer-es-test1.xq
SaxonJ-EE 11.1 from Saxonica
Java version 1.8.0_272
Using license serial number xxxxx
Analyzing query from format-integer-es-test1.xq
Analysis time: 515.1488 milliseconds
<?xml version="1.0" encoding="UTF-8"?>
   <format-integer number="1492">mil cuatro­cientas noventa y dos</format-integer>
   <format-integer number="1976">mil nove­cientas setenta y seis</format-integer>
   <format-integer number="1984">mil nove­cientas ochenta y cuatro</format-integer>

, although when I save it to a file the editor shows a soft hypen between e.g. nove and cientas that copy/pasting here in the textarea swallows.

Aside from the soft hyphen, what astonishes me is the -as in cientas, the male form (and I think normal, regular form) would be cientos and I can't just imagine that the feminism has invaded the ICU or Saxon far enough to make it cientas instead of cientos :).

So I tried running ICU4J but it seems to give e.g. novecientos. Just out of curiosity, what makes Saxon EE 11.1 output cientas?

format-integer-test-result-saxonee11.1J.xml (306 Bytes) format-integer-test-result-saxonee11.1J.xml format-integer test with language es and SaxonJ 11.1 EE

Replies (4)

Please register to reply

RE: format-integer for Spanish - Added by Michael Kay over 2 years ago

We look at all the spellout options available for the chosen locale, and apply some heuristics to choose among them. Part of the algorithm is to apply a preference list, which reads

private static final String[] preferences = {"-verbose", "", "-native", "-neuter", "-feminine", "-masculine"};

and it's the order of this list that's causing us to prefer feminine over masculine.

I've no idea why it was written that way - have to consult the original author (John Lumley).

The localisation theory (in both F+O and ICU) recognizes that there are cardinal numbers and ordinal numbers, but it doesn't recognize that cardinal numbers can be used both adjectivally ("39 steps") and nominatively ("Step 39"). Dates (like 1984) are a nominative context - and have the additional complication that the spellout form is "nineteen eighty-four", not "one thousand nine hundred [and] eighty-four" (the [and] being present in en-GB but not en-US).

F+O only recognises that gender might be relevant for ordinal numbers, but ICU also allows gendered forms for cardinals. I've no idea whether, in a language like Spanish that offers both forms, they are relevant in both adjectival and nominative contexts.

I seem to recall an idea that you should be able to specify the ICU name for the numbering scheme as a modifier in the pattern, for example "wc($spellout-cardinal-masculine)". But I can't see any evidence of this being implemented in the code - perhaps it was just an idea.

RE: format-integer for Spanish - Added by Michael Kay over 2 years ago

In fact there is a trick that I'd forgotten, you can specify o(2=dos) to select the numbering sequence in which 2 is represented by "dos", Unfortunately this isn't good enough for Spanish, where the gender distinction only kicks in for a few numbers like 1 and 31 and 200. Perhaps we should allow you to specify any value that can be used as a discriminant e.g. o(1=una).

RE: format-integer for Spanish - Added by Michael Kay over 2 years ago

John Lumley has reminded me of another feature documented at!localization/ICU-numbers-and-dates

By using language code es-x-scm you should be able to get the numbering scheme spellout-cardinal-masculine. (scm being an an initialism for spellout-cardinal-masculine).

RE: format-integer for Spanish - Added by Martin Honnen over 2 years ago

Yes, that works, !mil gracias!


    Please register to reply