Handling Special Characters

I am wading through the SOPAC/locum code, trying to understand it so I can implement it for a client.

I noticed in the connector file (locum_iii_2006.php), that the prepare_marc_values() function is removing anything between braces, and that there is a comment indicating uncertainty about how III is encoding.

Looking at a record returned by my client's database, I found the following as the author's name: "Chopin, Fr{u00E9}d{u00E9}ric,". I'm betting III is using Unicode since Unicode 00E9 is é, which correctly translates Chopin's first name as Frédéric.

This raises the question of how best to deal with such Unicode characters. There are probably others, but I see the following possibilities:

1) Continue to elide them. This has the advantage of being easy, but the obvious disadvantage of somewhat mangled information: e.g. Frdric Chopin.

2) Translate the characters to ASCII as part of harvesting. It would probably be pretty straightforward to come up with roughly 20 often-used Unicode characters, and use string replace to substitute for them, eliding any remaining unknown Unicode characters. This would give us significantly improved, but still somewhat imperfect, information, such as Frederic Chopin. This approach would work in any ASCII-friendly environment, but would also mean that the imperfections would be uncorrectable downstream in the system.

3) Translate the characters to HTML as part of harvesting. This would take approximately the same amount of work, but would result in the correct display of the characters within an HTML environment. However, if non-HTML display is desired, it would either be incorrect, or need to be corrected somewhere downstream. Continuing our example, we would then have Fr&#233d&#233ric Chopin, or Frédéric Chopin. (Semicolons omitted to display &#233 instead of é.)

4) Leave some or all of the Unicode in place, eliding those that are deemed not worth maintaining, and make it the responsibility of downstream elements of the system to deal with them as they deem appropriate. Thus we would be back to Fr{u00E9}d{u00E9}ric Chopin. This obviously requires little or no work on the part of the connector, but raises the requirements for downstream clients, such as SOPAC. On the other hand, it has the advantage of enabling correct display in a variety of environments.

My suspicion is that it is reasonable to assume display in an environment which supports HTML, in which case I would suggest that option 3 is the best one.

Does this sound correct?

Thank-you for articulating the options here. This is one of the issues I want to resolve before locum leaves beta, though since it's a connector-side issue and specific to III, I can't really attribute the issue to Locum.

Ideally, I would like users to be able to search for "frederic chopin" and get the same hit as they would if they searched for "frédéric chopin". I do think that this is a connector-side issue that should not be shunted downstream to the application and my thinking trends toward #3. But there is another option I've been pondering which would be to store the record with the unicode and create an additional field for title, author, etc like "title_nouni" that would have the non-unicode alternate.

Thoughts?

The issue of searching definitely makes things a bit more complicated.

It occurs to me that there is a fifth option which I missed in my original entry:

5) Translate the characters to Unicode as part of harvesting. III does not actually store the characters as Unicode, but as ASCII representations of Unicode characters. As such, they are ready to be translated into something else, but are not in a good state for either display or searching. Thus Fr{u00E9}d{u00E9}ric would be translated to Frédéric. In order for this to work correctly, the database column storing the information would have to be UTF-8 encoded, as would any web pages displaying the information.

In terms of display, I think it could work either to translate the ASCII representations of Unicode into HTML encoding, or into UTF-8 Unicode encoding: i.e. either option 3 or option 5.

In terms of search, I think things are a bit more complicated. Since "Frederic" does not match either Frédéric or Frédéric, a work around is needed to support that search term. Perhaps the easiest would be to have a separate column which is ASCII only, excluding special encodings. That is, we would also do 2 from above.

Then there is the separate case of searching for "Frédéric". That obviously will match Frédéric, but will not match Frédéric. Thus if we are using HTML encoding for display, we will need to either translate the search text into HTML encoding (or ASCII) before querying the database, or have a separate column which is UTF-8 encoded, and is only used for searching, not for display.

Pulling things together, in order to properly display extended characters, and to allow searching to find text with extended characters, whether or not the search term includes the extended characters, I think we will need to take one of the following approaches:

1) Have an HTML-encoded column for display, and an ASCII-encoded column for searching. Run search strings through an ASCII encoder, and then run a query against the ASCII column. This essentially converts a search for "Frédéric" into one for "Frederic".

2) Have an HTML-encoded column for display, and an ASCII-encoded column for searching. Run search strings through an HTML encoder, and then run a query against both the HTML and ASCII columns. This essentially converts a search for "Frédéric" into one for "Frédéric".

3) Have an HTML-encoded column for display, and both an ASCII-encoded column, and a UTF-8-encoded column for searching. Search text will go directly into a query to run against both the ASCII and UTF-8 columns.

4) Have a UTF-8-encoded column for display, and an ASCII column for searching. Search text will go directly into a query to run against both the ASCII and UTF-8 columns.

I'm not really sure how to distinguish among these options. Here are some possibly relevant thoughts:

Although I doubt it would be an issue in most cases, options 1 and 2 obviously require more processing of each search request.

Option 1 would be somewhat less accurate, but also more forgiving than the other options. Whereas all of the options would correctly match searches for Frédéric and Frederic, only this option would also match searches for Fredéric and Fréderic. However, unlike the other options, it would also match searches for similar looking words with different accenting, e.g. Frędĕrĩck, since all extended characters would be converted to their corresponding ASCII character.

Option 3 presumably requires significantly more processing during harvesting since it needs to convert into three encodings rather than into two. I have no idea whether this would be a problem.

Allowing search terms to include extended characters requires that the HTML page with the search form be UTF-8 encoded, or the search term will probably not be correctly returned to the server.

Option 4 seems like the one most likely to fail badly if UTF-8 encoding is not correctly implemented either in the database, and/or in HTML.

Although I have designated which columns would be used for display, and which for search, once the columns existed in the database, they would open up the possibility of various approaches, some of which might be left to the discretion of clients. Thus option 3 potentially opens up greater possibilities for future flexibility.

Hopefully some of this proves useful.

I'd suggest leaving the Unicode in place if at possible and look at using Sphinx's charset table directive. There's a whole bunch of sample tables on the Sphinx wiki, and someone's already done the heavy lifting.

Perfect!

That still leaves the issue of correctly handling the unicode tokens in III's marc output, but that's not such a big deal and will require only a little bit of tweaking to the connector piece.

It will require a full re-harvest though.

Here's a function that coverts the {uXXXX} strings into the correct unicode characters by first converting them to HTML entities:


// convert unicode
$matches = array();
preg_match_all('/\{u[0-9a-fA-F][0-9a-fA-F][0-9a-fA-F][0-9a-fA-F]\}/', $string, $matches);
foreach ($matches[0] as $match_string) {
$code = hexdec($match_string);
$character = html_entity_decode("&#$code;", ENT_NOQUOTES, 'UTF-8');
$string = str_replace($match_string, $character, $string);
}

We'll get this added to the connector at some point.