Telephone +44(0)1524 64544
Email: info@shadowcat.co.uk

mstpan 3 - XML

Wed Dec 3 21:00:00 2014

mstpan 3 - XML

There's a reason why Cthulhu's called 'lord of the angles'.

XML::Simple

No. No no no no no no no. XML is not a perl data structure. XML::Simple is only simple for very very simple cases, and then you're lost in a maze of twisty little weirdly named options none of which will actually do what you expect.

If the worst thing that happens to you using this is being eaten by a grue, consider yourself lucky.

For bonus points, XML::Simple claims to round trip ... but only in the sense that if you jump off a tall building with an overly long bungee cord, your mangled remains will probably end up somewhere back near the top.

XML::Twig

If you thought you wanted XML::Simple, you probably actually wanted XML::Twig.

Or an alcohol problem.

XML::Twig is excellent at whipuptitude, has a friendly streaming interface, and comes with tools like xml_grep that can often obviate the need to write a specific script at all.

It runs atop XML::Parser, which uses the venerable but reasonably performant Expat library.

Plus, the cut/copy/paste methods (literally called that) it provides make simple modification of documents nice and easy.

However, to get the most out of it, you'll still want to learn XPath. Which brings me to ...

Digression: XPath

XPath is stupendously powerful, all things considered. It can match almost anything you're going to want it to (certainly narrowly enough to handle the remainder of your operations in perl space in 99% of cases).

It's also possibly the single most unmemorable set of syntax I've ever used on a regular basis.

My recommendation would be to read the XPath RFC from end to end, at least twice. Once you've done that, you'll have enough of a mental index to be able to remember roughly where to look things back up again later.

If this sounds intensely painful, load the RFC onto your tablet and read it down your favourite bar while working on your alcohol problem (yes, there's a running theme here - welcome to XML).

Seriously, though, if you're doing a lot of XML document mangling it's absolutely a worthwhile investment. XSLT, on the other hand, only becomes readable once you render it as lisp (look up SXSLT and SXML if you don't believe me).

XML::LibXML

This is basically the ten ton gorilla of XML libraries. It implements the DOM and SAX standards, backend onto libxml2 (and XML::LibXSLT wraps libxslt if you're masochistic and lisp-deprived), which seem to've pretty much become 'the' C/C++ XML library.

Now, admittedly, the DOM and SAX standards are ... baroque isn't quite the word, stultifying might be a better adjective ... but they're also well known, comprehensive, and extremely predictable.

If you need a heavyweight solution, this is almost always the answer. Plus its XPath support is excellent, and is a wrapper around the libxml2 code, so doing complex searches of large documents performs suprisingly well. This does, of course, require you to learn XPath, but in my experience this results in significantly less liver damage than trying to use raw DOM for everything.

Template::Semantic

This module, built atop XML::LibXML, gives you the ability to apply transforms to XML documents using both XPath and CSS selectors.

If you're doing transformation/templating type stuff and keep finding yourself considering staring into the abyss that is XSLT, I'd strongly recommend looking at this module as a potentially less SAN-reducing alternative.

You can in theory use this for HTML templating as well, but see the HTML entry (coming later) for what I consider to be better alternatives.

XML::Toolkit

Been handed a set of XML documents with no schema and a specification that appears to have been considered to be 'rough guidelines' by the implementors?

You wanted XML::Toolkit then.

From its documentation: "The original intention of XML::Toolkit was to round-trip XML documents with an unkonwn schema through an editor and back out to disk with very few semantic or structural changes." I'm reliably informed it also works rather well for documents with an unknown schema too.

Feed it some XML, and it can spit out a set of Moose classes that represent that XML. Think of it as DBIx::Class::Schema::Loader for XML, except with a different set of "everything is on crack" related problems.

XML::Rabbit

To stretch my metaphor still further, this one's sort of like DBIx::Class::Candy for Moose+XML.

Mostly, it's designed to provide you a nice way to build an object model that addresses exactly the parts of the document you actually care about. You do this by specifying them with ... wait for it ... yep, XPath expressions. I did warn you.

Plus, if you're trying to do additional insane things, it provides access to a (read-only) copy of the XML::LibXML node backing any particular object, so you're well into "anything is possible with zombo.com" territory.

Final note: xmllint

On debian, xmllint is part of libxml2-utils. It'll brute force its way through an XML document and tell you what warnings and errors are involved in parsing it, and is usually pretty effective at finding most of them in a single run.

If your input document appears to be made of old man wee and fail, this tool is a life saver.

If you manage to somehow mangle your output document, doubly so.

Next?

My notes say HTML. Because who doesn't love web development?

-- mst, out.