After running exitwp
to import my blog, I ran rake generate
to build it, and got the following issue:
$ rake generate
(in /Users/benjiegillam/Documents/Blog/octopress)
## Generating Site with Jekyll
Configuration from /Users/benjiegillam/Documents/Blog/octopress/_config.yml
unchanged sass/screen.scss
Building site: source -> public
/Users/benjiegillam/Documents/Blog/octopress/plugins/raw.rb:11:in `gsub': invalid byte sequence in UTF-8 (ArgumentError)
from /Users/benjiegillam/Documents/Blog/octopress/plugins/raw.rb:11:in `unwrap'
[...]
However, converting the file was perfectly valid UTF-8 as confirmed by an iconv -c
conversion followed by a diff -u
.
After a quick bit of hacking in the octopress/plugins/raw.rb
file to spit out the content that was being converted, I found the file at fault. After some iteration I got to the root of the issue - Octopress’ default markdown parser, rdiscount
, REALLY doesn’t like UTF-8 characters in URLs. I’ve built a test here:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
It was converting Dara_Ó_Briain.jpg
to Dara_?%93_Briain.jpg
, where the ? is an invalid UTF-8 character. (Should be Dara_%C3%93_Briain.jpg
)
Solution?
Annoyingly pandoc
(a tool employed by exitwp
) seems to be converting the link from Dara_%C3%93_Briain.jpg
to Dara_Ó_Briain.jpg
in the markdown file, which is then breaking when it is rdiscount
ed. As it only affected 2 characters in my entire blog history I’ve not bothered with an automated fix - I just manually re-encoded the characters. I’ve commented on this issue with pandoc so that hopefully they will fix it.