Perl and UTF-8/Unicode RegEx

kainjow · Nov 29, 2005

I've got a Perl script that simply parses out HTML from the standard input, and then outputs the result. However, for UTF-8/Unicode text (still not 100% clear on the difference between these encodings...), the output is all garbled. Anyone have any ideas?

Here's the Perl code:

Code:

#!/usr/bin/perl
$str = "";
while ($line=<STDIN>)
{
	$str .= $line;
}
$str =~ s/<script[^>]*>(.*?)<\/script>//gsi; # remove <script>
$str =~ s/<(?:[^>'"]*|(['"]).*?)*>//gsi; # remove html
print $str;

Perl and UTF-8/Unicode RegEx

kainjow

Registered