Transolution XLIFF Filter Documentation

Version: 0.1
Date: 2005-08-03
Authors: Fredrik Corneliusson

Introduction

Filter to convert from tagged formats such as HTML and XML to XLIFF and back. The filter is configurable using a ini-file so it can be adapted to a number of different formats.

  • sgml2xliff.py - Script to convert to XLIFF
  • xliff2sgml.py - Script to back convert XLIFF to original format

As of this moment ini files for the following formats exists in transolution/filters/filter_settings:

  • DOCBOOK (docbook.ini)
  • HTML (HTML.ini)
  • Staroffice/Openoffice content.xml files. (OOffice.ini)

The filter produces sentence segmented XLIFF files.

Here are some example XLIFF files created with the filter: example_xliffs

Usage

First of all you need have the transolution directory in your python path or you have to be in the directory containing the transolution folder. If you can start the XLIFF editor everything should be OK. To get help on the arguments the script uses:

>python sgml2xliff.py -h
usage: sgml2xliff.exe [options] inifile path

options:
 -h, --help            show this help message and exit
 -e ENCODING, --encoding=ENCODING
                       set source file encoding
 -r, --recursive       Process files recursive.
 -f FMASK, --fmask=FMASK
                       File mask to use when running recursive.
 -l SLANG, --slang=SLANG
                       Language of source document(s).
 -s SKIPWORDS, --skipwords=SKIPWORDS
                       a file containing a the words not to segment after.
                       Example: vs. Mr.
 -z, --xlz             Create a xlz (zip file containing xliff and skeleton
                       files).

Tip

If the source file does not have newlines as OO's content.xml it is a good idea to insert newlines where appropriate e.g. after paragraph tags (</text:p>) before conversion to XLIFF. Otherwise the editor gets sluggish when you view the document with skeleton context turned on.

Convert to XLIFF

For example to convert a html file:

>python sgml2xliff.py -z ./transolution/filters/filter_settings/html.ini ../test/test.html

The -z switch is used to create .xlz file (a zip containing the XLIFF and the skeleton file).

Convert back to original format

>python xliff2sgml.py ../test/test.html.xlz

Filter INI files

Configuration file syntax

The configuration file consists of sections, led by a "[section]" header and followed by "name= value" entries.

Tags Section

In this section you define how tags be treated.

The syntax for the Tags section is

TAGNAME=TYPE,FLAGS

The TAGNAME should be just the name as it appears in the file (case insensitive), if the tag contains colon you have to replace the colon with &colon; e.g.

text:p => text&colon;p

The TYPE can be either "External" or "Internal". The rule is that if a Tag can be present in sentences it should be "Internal" e.g.:

Here's some <b>bold</b> text.

If you don't want to have the tag in translation segments you set the TYPE to External. e.g.

<header>This is a header</header>

If you want all content between the start and end tag to be set to Internal or External you set the Grouped FLAG. e.g.

<script language="javascript">
  var Open = "";
  function preload() { Open = new Image(16,16); Closed = new Image(16,16); }
</script>

This all that is supported at the moment and the other stuff such as Translatable Attributes is just there as I plan to implement support for it in the future.

The ini file Tags section for the tags above would be

[Tags]
b=Internal
header=External
script=External,Group

FilterSettings section

The FilterSettings-section has two settings. If you set KeepLineBreaks to True all line breaks in the file will be kept in the translation segments. If it's not set line breaks will be removed from sentences. DefaultTagStyle is the style to use for tags that are not specified in the Tags-section.

[FilterSettings]
KeepLineBreaks=False
DefaultTagStyle=External

Skip words

Sometimes abbreviations (Mr., etc.) cause the filter to segment. The solution to this problem is to specify a text file with the skip words to the filter (-s or --skipwords). There is a very incomplete English abbreviation file under transolution/filters/skipwords/en_skipwords.txt. Just add every abbreviation you don't want to segment after and specify it to the filter.