MediaWiki talk:Gadget-dictionaryLookupHover.js/how to adapt to another language

From Wikinews, the free news source you can write!
Jump to: navigation, search

How to adapt the wiktionary lookup xslt to other languages[edit]

The basic idea is to take the english script, and translate the variables marked translate these (If for some reason you try and translate based on a script from a language other than english, which I don't recommend you do as English is easiest to translate, whatever you do, do not try to translate based on the French script, as its formatted in a harder to translate way than the other scripts).

This more or less works. However this script basically screen-scrapes the html, and the html of different languages are different, so sometimes you have to change more than this.

It is assumed you have a basic knowledge of regex, and JavaScript ( a very basic knowledge. a lot of this can be done without such knowledge)

p.s. I know this is an ugly script. Its actually mostly using xslt as a vector to execute JavaScript on the results of the api, rather than actually using xslt.

The easy parts to translate (part 1; xslt)[edit]

Lets start with the xslt variables. look for:

 <!-- for translation. Also see JS -->
 <xsl:variable name="dir">ltr</xsl:variable>
 <xsl:variable name="more">» More</xsl:variable>
 <xsl:variable name="error">Error: </xsl:variable>
 <xsl:variable name="copyright"> © <a href="http://en.wiktionary.org/wiki/">Wiktionary</a>. Released under <a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license copyright">CC-BY-SA 3.0</a></xsl:variable>
 <xsl:variable name="contentLang" select="'en'"/> <!-- make sure quoted-->
<!-- END XSLT VARIABLES TO TRANSLATE. SEE JS as well -->

The part you translate is the part enclosed by the tag or by the select attribute

<xsl:variable name="some_name">Translate this part</xsl:variable>
<xsl:variable name="some_name" select="'translate this part'"/>

Note, for the variables that use select, it is important that they have the double and single quotes as shown above.

dir (text direction)[edit]

For left to right languages (English, French, Spanish, etc):

<xsl:variable name="dir">ltr</xsl:variable>

For right to left (Hebrew, Arabic, etc)

<xsl:variable name="dir">rtl</xsl:variable>

more[edit]

This variable is used for the text of the link to display more info (aka the link to the full definition). In English we use » More.

<xsl:variable name="more">» More</xsl:variable>

error[edit]

This is used to introduce that an error has occurred. This is displayed in case of an API error (most commonly if someone tries to lookup an illegal title, such as <.) It is important to have a space at the end of this variable, as the api error message is added directly after this message (which is not translated). In English we use:

<xsl:variable name="error">Error: </xsl:variable>

copyright[edit]

This is the copyright statement, it is one of the more complicated messages. It is important to have a space before the copyright (©) sign. be careful when translating the links. rel="license copyright" should not be translated. nor should href or a (basically don't translate stuff enclosed by < and >). However the urls may need to be translated (to your local wiktionary site, and to the translated cc license) Here's what we use on English:

<xsl:variable name="copyright"> © <a href="http://en.wiktionary.org/wiki/">Wiktionary</a>. Released under <a href="http://creativecommons.org/licenses/by-sa/3.0/" rel="license copyright">CC-BY-SA 3.0</a></xsl:variable>

and in French:

<xsl:variable name="copyright"> © <a href="http://fr.wiktionary.org/wiki/"> Wiktionnaire</a>. Paru en <a href="http://creativecommons.org/licenses/by-sa/3.0/deed.fr" rel="license copyright"> CC-BY-SA 3.0 </a></xsl:variable>

contentLang[edit]

  • Two or three letter language code. This is used in the lang attribute, and is also assumed to be the start of the url for the project (aka http://<whatever the contentLang is>.wiktionary.org ).

Note. It is important this has both the double quotes and the single quotes, and that the double quotes enclose the single quotes. For English we use:

<xsl:variable name="contentLang" select="'en'"/> 

The easy parts to translate (part 2; JavaScript)[edit]

Look for:

 function setup () {
 //Stuff to translate:
 var preferLang = {'en': 'English', 'fr': 'French', 'de': 'German', 'qqqAny': null}; //for now.
 var extractSeeAlso = /<div class=\"disambig-see-also(?:-2)?\">[\s\S]*?<\/div>/; //no subexpressions!
 var see_also_process = function (sa) {return sa;}
 var createLink = '» Create'; // text only.
 var not_found = "Could not retrieve definition of $1.";
 
 //END stuff to translate (there is one more translation block below)

If you find this section confusing, just give a translation for the text of the create link (the text show in place of more if the article does not exist), and the not found link. (aka in english the Could not retrieve definition of <some word>).

perferLang[edit]

This is an associative array (or object in js speak) mapping lang code to language name, in whatever language your working in. Keep the 'qqqAny': null and make sure you have a comma after each pair, except for the last one. Its important to make sure it can map your lang code to your language. Its a good idea to be able to map some other common languages, but its not critical. This step is generally one that can be done with google translate. (note: some language projects, like French, need a different mapping scheme. Almost every other language does it this way). Here's what it looks like in English:

var preferLang = {'en': 'English', 'fr': 'French', 'de': 'German', 'es': 'Spanish', 'it': 'Italian', 'pt': 'Portuguese', 'ja': 'Japanese', 'pl': 'Polish', 'ru': 'Russian', 'nl': 'Dutch', 'qqqAny': null}; //for now.

and in Dutch (This should probably include more langs)

var preferLang = {'nl': 'Nederlands', 'en': 'Engels', 'qqqAny': null};

createLink[edit]

This one is easy. This is the text of the create link (which replaces the more link if the article does not exist). Note this accepts text input, so feel free to use < without escaping if you so desire. In English we use:

var createLink = '» Create'; // text only.

extractSeeAlso and see_also_process[edit]

This part is overly technical and requires knowladge of regex and html. - see bottom. If you're not sure about it, leave it to user:Bawolff.

not_found[edit]

The could not find the word your clicked on text. $1 is replaced with the word in question. For example in English we use:

var not_found = "Could not retrieve definition of $1.";

The hard part to translate[edit]

Note, this part is technical, and requires knowladge of regex and HTML If you are not familiar with these things, thats ok, you can leave this part for user:Bawolff. Generally this is adapting to different formating, and not really translating.

There is quite a variaty of formatting differences between Wiktionary editions. Sometimes you need to do more than what is listed above. (however often you don't). This requires a fairly decent knowledge of regex, as well as a limited knowledge of HTML. Look for the section:

  var subSectRegex = new RegExp('<h2>[^<]*<span[\t\r\n >][^<]*<a[\t\r\n >][^<]*</a[\t\r\n >][^<]*</span[\t\r\n >][^<]*<span class="mw-headline" id="' + preferLang[preferLangCode] + '"[^>]*>[\\s\\S]*$');
  var extractCurLangName = /<span class="mw-headline" id[^>]+>([\s\S]*?)<\/span>/; //first subexpression

extractSeeAlso and see_also_process[edit]

Note: This part is technically from the section before, but is included here as it is more technical.

This is used for extracting the text from the see also box, which varies with almost every language. (this is the part you need regex knowledge for, and one of the harder parts to translate). Most languages use either a SeeAlso the looks like the French see also box, or the English see also box. the extractSeeAlso is a variable containing a regex that should match the See also text. Keep see_also_process the same as it is, unless you need to further process the result of the regex (for example if you use subexpressions, getting the first sub expression). Here is what it looks like for English:

var extractSeeAlso = /

[\s\S]*?<\/div>/; //no subexpressions!
var see_also_process = function (sa) {return sa;}

For nl (which is like fr), where we use sub-expressions, and require further processing:

var extractSeeAlso = /]*>([\s\S]*?)<\/td>[\s\S]*?<\/table>/; //Modified elsewhere! var see_also_process = function (sa) { return sa[1].replace(/<a(?:[\t\r\n ][^>]*)?><img(?:\/|(?:[\t\r\n ][^>]*)?)><\/a>/, );}

subSectRegex[edit]

This deletes everything before the language section we're interested in. If the Wiktionary uses a different scheme for organizing sections than the normal one (like fr) than you might need to change this. (or if MediaWiki parser changes. This part isn't the most robust).

extractCurLangName[edit]

Extracts the full language name. Might have to change if the Wiktionary uses fancy templates for the lang name.

Other notes[edit]

If a language is very different in how they format there page from what English wiktionary does, some other things might have to be changed. (for example on ru, we strip out examples that are on the same line as the definition).

Also:

<meta name="generator" content="Wiktionary Extract XSLT 1.08-EN"/>

Should have the EN translated to your language code (This is not very important, just to keep track of the different versions of this script)