Tuesday, January 13, 2015

How to utf8 with perl

I have had a good deal of trouble working with utf8 with perl scripts.

It is extremely easy to mess things up - in fact all you have to do try to add two strings together (aka concatenate).

As such I have had to come up with a simplified solution for my team.

I will share:



#!/usr/bin/perl

use open qw( :std :locale );
use Encode qw(decode is_utf8);

my $a = "\xc2\xa9\xc2\xae\xe2\x84\xa2";
my $b = "Any old thing.";

print "a:",$a,"\n";
print "b:",$b,"\n";

my $c = jibe($a,$b);
print "c:",$c,"\n";

sub jibe {
  my($s,$t) = @_;
  my $r = join('', (is_utf8($s)?$s:decode('utf8',$s)), (is_utf8($t)?$t:decode('utf8',$t)));
  return $r;
}

And the output is:

$ perl ./simple_utf8.pl
a:©®�
b:Any old thing.
c:©®™Any old thing.


The key thing here is that the jibe sub will allow you to safely concatenate two strings and produce a proper utf8 result.

If you include the use statements and the jibe subroutine you can use this in your code.

It is designed to analyze your input & only decode as necessary - so it gives you a bit more license to mix & match input strings.

No comments:

Post a Comment