Sunday, February 12, 2017

Xml Adventure

So it looks like I forgot to post in 2016. I'm going to try to rectify that this year.

I was faced with a challenge at work to migrate documents from a legacy store in to a new content store. Iterate over the old archived directory, pick up the file, write it to content via HTTP. No problems. As always, I'd make every effort to catch and report exceptions, but really, what if I missed something? What if somehow the operation finished without getting EVERY document from the archive?

Write a simple application that iterates over the old directory and checks that the file made it to content sounds like a good solution. My first solution was that for every document, make an HTTP GET call to the content provider and see if I get a 200 return. Of course, not optimal. Pretend you didn't just read that. For small subsets, it would be fine. But for over 15 million records, it would take a long time.

Second solution, get the content directory returned as xml, iterate over the archived directory, see if the filename is in the xml. My brain instinctively goes to what is familiar for me; XmlDocument class and search for a node with XPath. Once again, I know, reading a document that size into memory is not optimal, but, hey, memory is cheap. I know enough about XPath that I avoid the expensive searches (https://blogs.msdn.microsoft.com/xmlteam/2011/09/26/effective-xml-part-2-how-to-kill-the-performance-of-an-app-with-xpath/). So I code it up and notice that performance is about the same as the HTTP GET. Wow, I think, XPath sucks looking up nodes over large documents.

I probably shouldn't use XmlDocument and XPath anyway, LINQ and XDocument would be preferable anyway and demonstrate how hip I am. So code up a LINQ query and find out . . . performance was not noticeably any better than XPath.

All I need is the filename string from an attribute of a node. Why not create an enumerable of strings and use that to see if the file exists. This would also optimize my process as it wouldn't be necessary to read the entire xml document into memory, I could use xmlreader and just pull out the string I care about as I iterate the reader and add it to the string list. Problem solved, the lookup performance is fantastic.

No comments: