A rather long and rambling post about various things that are likely to cause headaches when dealing with the world outside your own country. More accurately, things that have caused me headaches in the past, so others can learn from my pain.
This is also a bit of a meta-post, because it covers a lot of other posts I've written about i18n and l10n. Most of those posts are based on my experience working on software that needed to consider all this stuff and the Java scribbles I've used to demonstrate some of the complexities.
Anyway, there's a lot of different stuff in here, all based on stuff I've learnt from doing far too much work on making multiple projects, webapps and websites work for various different languages and countries.
A lot of this is also covered in other websites and articles, the Wikipedia article is a good place to start.
So, might as well start with the most obvious stuff...
Writing your app/website/whatever to support one language is easy (or at least, as easy as it gets). Enjoy it. Get all your business logic right, get the UI/UX working as well as you can, get people using it and act on their feedback.
Sadly, the first step you take into supporting another language is where you have to do the majority of the hard work. This isn't a case of getting more complex the more languages you support - a large chunk of the complexity hits you the moment you jump from one language to two. That's the point when you need to extract all your language strings to a dictionary (resource files, language DB, i18n template system, etc), convert all your number/date/currency formatting to use proper locale formatters, add a language picker, add support for recording language & country choice for the users, etc.
That first chunk of i18n is going to need at least one release all to itself. Ideally, all the i18n work should be done without adding the second language. Convert your mono-lingual app/website into it's i18n version, with the same language. Then check it works. Once that's done then add your second language and hit it with serious testing.
After the second language, there's likely to be more i18n complexity with each new language/country (right-to-left, font choices/sizes, colours, etc), but that first step takes a disproportionate amount of time & effort. So prepare for it and work with the business teams to plan the right time to introduce that second language.
When thinking about i18n, the obvious bit is translating all the words into the new language. And yes, that's where the majority of the effort lies. You need to extract all the text, put it in a dictionary, send it to translators, answer all their questions, check and import the resulting translations, then check and fix the app with the new language support.
Not necessarily all text - you may have some cases where you deliberately don't translate some pieces of text. That could be things like: company names, specific legal wording, quotes/comments from other people, text about translations, etc.
But, because the words are the obvious bit, I won't talk about that much more - from now on I'll talk about all the bits which are less obvious, and have left their scars on my memory.
To quote a colleague of mine, who says this whenever any translation issue occurs:
Where did you forget to use unicode?
More specifically - the rule of thumb is make sure everything is
UTF-8 There may be outlying cases where you're either dealing with embedded systems, or older binary formats, or various comms protocols, but in those cases, build wrappers around them for any text going in or out.
No silver bullet
Although UTF-8 does help a lot, it doesn't get rid of all the awkwardness. There are still plenty of encoding related things to worry about. Things like: Zero-width characters, variable byte sizing, font choices and fallbacks, similar/identical characters, BiDi considerations, etc. I'll write a bit more about this in future.
This gets a separate post all to itself, but to repeat the basic points:
- Most popular languages are spoken in many countries
- Countries often have more than one language
- Some flags cause real offence/annoyance to other countries
- Flags can be confusingly similar, or unrelated to their given language
So don't use flags or other pretty icons to represent languages, if you're displaying a language picker, show each of the language names in their language using a suitable font, and prettify the styling instead.
Languages and countries have names, this bit shouldn't come as a surprise. It also shouldn't be surprise that different languages have different names for other languages and countries. The most obvious example I see of getting this wrong is language pickers which show the list of supported languages in one specific language.
There's no point showing someone their own language using another language's name for it. If you want a Japanese visitor to your site to pick your carefully translated Japanese version - don't ask them to pick "Japanese", ask them to pick "日本語".
Similarly, bear in mind that country and language names are not always related (obvious example - Americans speak English, not American), countries have different names in different languages (e.g. Britain in French is "Royaume-Uni") and language names are different in other languages (e.g. "Arabic" - or as a native speaker might say -" العربية", or as the French say - "Arabe").
Java 9 and later versions have done a lot of work to handle the names of languages and countries, across a wide range of languages. I did a bit of Java scribbling to show what names Java supports for languages and countries in a separate blog: Country names in different languages
Just as an aside: Between Java 8 and Java 9, as part of all their i18n work, the number of supported locales (languages+variants) jumped from 160 to 736 (the number of countries stayed at about 250). Each of those locales has formatting rules for various types, and translations for all sorts of information. Java 9 i18n team, I salute you!
There are many articles about how hard this is. One of my favorites is the infamous Falsehoods Programmers Believe About Names
The rough summary is - names are complex and it's best not to try to match that complexity in your code, because you'll fail. Instead, only store what you need. If you just need to know how someone wants to be addressed - ask them that, and store the answer, as the write it. By all means give them an id that your system can use, but display their name as they use it.
And if getting their name right is hard, actually using it when you're addressing them gets even harder. One of the simpler examples is T-V addressing in French ("tu" and "vous"); other languages have much more interesting rules (Polish, Russian, Chinese, ...).
The best answer here is rather than figuring out the rules yourself, pass the buck - use their name as it appears on something else. If you're dealing with a process/system where you have access to some formal documentation addressed to the person, use the name as it appears there. If they've already given you their name on a letter or email, use that.
Addresses - what are they for? Mainly, for finding places and more usually - people. One of the big reasons for addresses is so we can send bits of paper (and other stuff) to them. Most countries have organisations set up to do this, maybe with a nice friendly bloke called Pat, in a van, with a cat.
The trouble is that Pat in England and 皮皮鲁 in China don't speak each other's languages. So if I post a letter in the UK, for someone in China, what language do I use for the address? If I write it in English, Pat can get it on the plane to China but 皮皮鲁 won't know what to do with it. If I write the address in Chinese, Pat will be lost.
The answer is that addresses need to be partially translated if we expect them to be used internationally. The country name needs to be written in the language of the person posting the letter (e.g. "China"), so Pat can read it and the rest of the address needs to be written in the language of the person receiving the letter (e.g. "中国北京市东城区东单东长安街33号 邮政编码: 100006" - actually that also includes "China" in Chinese), so 皮皮鲁 can get in his van to deliver it.
Luckily, Pat and 皮皮鲁 and the posties of the world are usually very helpful, they'll try to figure out what the address means, however it's written. But that's not an excuse for making their life harder and risking your letters being misdelivered or stuck in the "dunno" box at the sorting office.
As an aside - address formatting is a whole different pile of fun. In that address above (Beijing central post office) it goes from widest to smallest geographic area (中国 = China, 100006 is the postcode for the building). Other countries do things differently - your address fields and validation logic will become interesting - write it all down!
Dates and times are hard. Once you start digging into how they actually work around the world, especially if you need to deal with dates a fair way into the past, you're starting down a very deep rabbit hole after Lewis Caroll's white rabbit.
There are many articles talking about how hard this topic gets, but just to get you started, here's another "falsehoods ... " article: https://infiniteundo.com/post/25326999628/falsehoods-programmers-believe-about-time
On the back of some work I did figuring out how to handle DST in different countries, I wrote a separate blog post about TimeZones in Java
And just in case you thought that although time might be hard, it can't be hard to decide what year it is, I also did some poking into Java Chronologies.
Quick summary - try to only deal with the ISO (Gregorian) calendar, and use the all the available date and time library functions you can. In java, that means the newer
java.time API (which pretty much copies the earlier
One final comment on time - Leap Seconds. These have been happening for years, but every time a new one is added operating systems and NTP servers everywhere have a flurry of new problems to address, including knowing which "timezones" don't acknowledge them (i.e. TAI). Here's a detailed article written about the last one, in 2016 Preparing for the 2016 Leap Second
Once you've extracted all the text, sorted out all the date and time differences, figured out how to address people and locations there's still the problem of how to stick all that information back onto the page/screen in a format the user finds familiar and useful. I've mentioned some of these elsewhere in this article, but some of the things that will need their formatting to be locale aware are:
- dates and times - everyone has their own way of writing these, sometimes 3 or 4 ways!
- numbers in general - decimal separators, thousands (or otherwise) separators, negative
- monetary amounts - often different to general numbers, with fewer decimals, extra symbols (or not)
- measurements - especially those including fractions (e.g. 4 yards, 3 feet, 2 inches, 320 thou)
- Ordinal numbers - the 2nd floor, or the 2º floor in Italy
- Honourifics and titles - do you call Fred Bloggs with a PhD "Doctor Bloggs", "Fred Bloggs PhD" or "Dr. Fred Bloggs". What about in French?
- Place addresses - largest to smallest area (e.g. China), or smallest to largest (e.g. UK), post/zip codes before or after city/region, etc.
I'll just throw out a few examples that have bitten me in the past...
A quick bit of Java 10 JShell
var df = DateFormat.getDateInstance(DateFormat.SHORT, Locale.CHINA); df.format(new Date()); $29 ==> "2018/6/16" var df = DateFormat.getDateInstance(DateFormat.MEDIUM, Locale.CHINA); df.format(new Date()); $23 ==> "2018年6月16日" var df = DateFormat.getDateInstance(DateFormat.LONG, Locale.CHINA); df.format(new Date()); $25 ==> "2018年6月16日"
Yup, the medium and long forms are identical, and both have one more character than the short form.
You weren't counting on all country flags being rectangles for some nice HTML styling were you?
I wrote a separate article about localising currencies in Java.
One of the points I made there is that not all the "currencies" used around the world are actually currencies used to actually buy & sell things. Hence they don't all have exchange rates, which can cause problems, especially if you're dealing with complex test cases.
Another quick bit of Java 10 JShell, showing the same number formatted in Ecuadorian Spanish and Guatamalan Spanish.
var nf = NumberFormat.getInstance(Locale.forLanguageTag("es-ec")); nf.format(1234567.89); $40 ==> "1.234.567,89" var nf = NumberFormat.getInstance(Locale.forLanguageTag("es-gt")); nf.format(1234567.89); $46 ==> "1,234,567.89"
That's two different Spanish variants, from countries fairly close to each other, one uses dots for thousands and commas for decimals and the other does it properly (ahem ).
"plus one year, minus one day" is not the same as "minus one day, plus one year". Start from the 1st March 2012. Add a year, go back one day, you're on 28th Feb 2013. Now start from the same day, go back one day and add a year, you're on the 29th Feb 2013 - which doesn't exist, exception throwing time!
And that's just dealing with the ISO calendar - so not strictly i18n related. However, there are different calendars in use around the world and throughout history (as I mentioned above). If you need to do any date translation between calendars, or even worse, date calculations across calendars, be afraid - and write a big pile of test cases.
Eventually you start to get the rules straight, settle on some traslations, use the right formatting tools, you've validated everything everywhere and generally started to get that backlog of i18n issues down to a reasonable level.
Then in sneaks real life. Countries change their name, Libya changes their flag, Tonga irregularly changes their TimeZone, Netherlands Antilles stops existing, a number of currencies disappear overnight when the Euro appears, New Zealand might add a new official language, and on and on.
I have no advice about this, other than to say it isn't going to stop, you can't prevent people changing. Just raise the ticket, sort out the new rules, get the business to agree and make the relevant changes when you need to.
All the stuff above is actually fairly straightforward - because on the whole there are fairly well defined rules to follow. They may be complex rules with lots of edge cases and unexpected interactions, but at least they are specified.
The real complexity appears as we get to more subtle, cultural things like colour choices, familiar print/screen layouts, semantics and grammar, forms of address. With this stuff there's no single right answer, although there are an awful lot of wrong answers - any of which can start an argument.
If you're being serious about i18n, then on top of all the issues to do with just being understood and not confusing people, you're going to need to consider some complex political and historical issues. This is a massive topic, but to give a few quick examples of the kind of stuff you'll be getting hit with...
- Is "Palestine" a country? (Depends who you ask and who you want to write hit pieces about you on social media)
- Do the Taiwanese speak "Chinese"? (and which variant, in which country)
- Does the Isle of Man count as European, or British? (depends if you're buying duty free, setting up a financial deal, or going on a day trip)
- Is Costa Rica in the USA? (depends whether they've just had a disaster and the US is trying to avoid helping)
- Can someone give themselves the username "Jehovah"? Or "Goebbels"? (Depends on your religion, country, legal system, and your parents)
As a techie, and a fairly agnostic one at that, I'd say the only safe answers to this level of complexity is to push the questions back to whoever is giving the requirements for the software you're building. Get the answers in writing, get them precise, build specific tests, and get those tests to pass.
I might personally be tempted to argue for one answer or another if there's a fairly clear ISO standard or UN mandate, but if the business or customer wants to clearly state the answer they want to use, I'll usually document the discussion and the decision then get on with building it.