Telephone +44(0)1524 64544
Email: info@shadowcat.co.uk

mstpan 4 - HTML

Thu Dec 4 21:00:00 2014

mstpan 4 - HTML

Could've been worse. Could've been postscript.

Parsing

Regular expressions

Are you Tom Christiansen? No? Then don't.

HTML::TreeBuilder

The old stalwart. Works, reasonably easy to use, and built atop HTML::Parser which is disturbingly good at taking near-tag-soup and producing something that actually makes sense.

Oh, and as is commonly the case for *ML things I'm going to recommend, you can add HTML::TreeBuilder::XPath if you need more complex query capabilities.

Plus there's things like WWW::Mechanize::TreeBuilder integrating it into your user agent, which you can fully expect me to mention again when I get as far as web scraping.

Mojo::DOM

If you're using Mojolicious, use this. If you aren't, seriously consider it anyway. It provides a very clean, mostly-functional API that makes extracting bits of HTML and poking them with a stick relatively pleasant.

Uses CSS selectors for most of the traversal rather than custom methods or XPath. This is probably a feature most of the time.

XML::*

XML::LibXML and XML::Twig will both, in theory, work with HTML. I tend to prefer HTML-specialised tools, but if you find yourself loving them more than the HTML-specialised options, try them and see if they're comfortable with your document corpus.

Generating

CGI.pm

I ... I ... I ... implodes.

Please don't. Please don't.

Template Toolkit

The old stalwart of templating stuff in general. Very well known and well used, so examples abound. Plus there's a huge ecosystem of plugins, of the usual level of varying quality.

However, it also has some serious oddities to it, like a ... unique ... understanding of what is and isn't an array and a tendency to call methods in list context (DBIx::Class' _rs methods were added as a result of this).

Then again, I've often found that if you're hitting TT's edges cases, it often means you're trying to be too intelligent in your templates and should extend your domain model somehow.

If the people you're working with already know it, it's a great choice. If you're starting from a clean sheet, consider alternatives.

Text::Xslate

Primarily written by insane and brilliant Japanese hackers, Text::Xslate was designed to be a high performance, auto-HTML-escaping template engine that would be obviously the correct approach to generating text or HTML in new build projects.

Honestly? I think it succeeded.

Unless Template Toolkit has a plugin or extension or some sort that you can't live without, or you already have a team that know TT, I think Xslate is clearly superior.

I also have the web design ability of a dead squirrel though, so YMMV.

HTML::Mason / Mojo::Template

Lumped together here because I really, really dislike templating systems that use embedded perl for their logic. Without a significant amount of discipline, it becomes far too easy to end up with more logic in the template than is a remotely good idea. I've worked on exactly one project using a perl-based templating system where that didn't happen. It got cancelled anyway.

HTML::Zoom

An insane experiment of mine to try and provide pure functional templating that can stream in order to make your layout 'pop' while DBIx::Class or whatever is running the query that's the slow part of the request.

Uses CSS selectors to transform straight HTML - so there's no logic in the template files at all, just significant tags/ids/classes. I'm not entirely fond of the API, but then again (a) I wrote it (b) over a year ago, so mostly I'm just pleased I don't hate the entire thing enough to be obsessed with writing a version 2 yet.

It's proven pretty useful to me for collaborating with people who have the programming ability of a dead squirrel without us annoying each other to death in the process, but again, YMMV.

Next?

Files. Quite how I'm going to manage to talk about files without boring my entire remaining readership to death, I've no idea yet.

Fortunately, I'll be spending much of tomorrow on a train, so presumably I'll figure it out then.

-- mst, out.