Saturday, February 25, 2017

Wibbly Wobbly Timey Wimey Thready Computer Sciency stuff

The file migration adventure continues. Tasked with moving files from a legacy store to a new content platform has been a challenge.  Unfortunately, some manipulation of the files is required, so there is no simple, out of the box solution.

After experimenting with some threading, I came to the conclusion that the fastest way to migrate the files was to iterate over the directories, and for each directory, start a console application that migrated the 50,000 files over to the content platform. I could "parallelize" the solution by starting more instances of the console application.

My first thought was to use Powershell. I experimented a bit, and all of my Powershell solutions seemed rather clumsy. Powershell is not my strong suit. So I abandoned the effort and decided to write a console application to start the console application.

It was important for me to be able to configure how many instances of the processes to start. And I didn't want to start more than the configured amount. My first solution was better than powershell, but still rather clumsy. It involved a task list. I would start the console application in a C# Process that was part of a List<Task>. After all of the tasks were complete (Task.WaitAll()), another batch of processes would begin. The problem is that some of the tasks finish a little faster and I discovered I was wasting some time waiting for all of the tasks to complete. Here some code:

while (directories.Count > 0)
{
    var miniDirs = directories.Take(threadsToStart).ToList();
    foreach(var dir in miniDirs)
    {
       Console.WriteLine(dir);
       tasks.Add(Task.Run(delegate
       {
            ProcessStartInfo pInfo = new ProcessStartInfo();
            pInfo.FileName = "Process.exe";
            pInfo.Arguments = volume + " " + dir.Split('\\')[5];
            Process.Start(pInfo).WaitForExit();
                        
        }));
        directories.Remove(dir);
      }
      Task.WaitAll(tasks.ToArray());
      tasks = new List<Task>();
 }

Not elegant. I needed some computer sciency stuff. Semaphore to the rescue (actually SemaphoreSlim https://msdn.microsoft.com/en-us/library/system.threading.semaphoreslim(v=vs.110).aspx). Recalling that Semaphore wait and release threads, I decided to move my Process.Start code into a method and start that method in a thread. The method would wait for the process to finish
WaitForExit()
and then release the semaphore allowing another thread to start.

private static void StartProcess(string volume, string dir)
{
   semaphore.Wait();
   ProcessStartInfo pInfo = new ProcessStartInfo();
   pInfo.FileName = @"Application.exe";
   pInfo.Arguments = volume + " " + dir.Split('\\')[5];
   Process.Start(pInfo).WaitForExit();
   semaphore.Release();
}

I could start each thread in a foreach loop and be sure that only the number of threads I used in the SemaphoreSlim constructor would be started. Here's the loop:

foreach (var directory in directories)
{
   Thread thread = new Thread(() => StartProcess(volume, directory));
   thread.Start();
}
Much more elegant and a thread will start when any of the processes complete. I probably could have just asked Ethan Frei and saved myself some time!

Sunday, February 12, 2017

Xml Adventure

So it looks like I forgot to post in 2016. I'm going to try to rectify that this year.

I was faced with a challenge at work to migrate documents from a legacy store in to a new content store. Iterate over the old archived directory, pick up the file, write it to content via HTTP. No problems. As always, I'd make every effort to catch and report exceptions, but really, what if I missed something? What if somehow the operation finished without getting EVERY document from the archive?

Write a simple application that iterates over the old directory and checks that the file made it to content sounds like a good solution. My first solution was that for every document, make an HTTP GET call to the content provider and see if I get a 200 return. Of course, not optimal. Pretend you didn't just read that. For small subsets, it would be fine. But for over 15 million records, it would take a long time.

Second solution, get the content directory returned as xml, iterate over the archived directory, see if the filename is in the xml. My brain instinctively goes to what is familiar for me; XmlDocument class and search for a node with XPath. Once again, I know, reading a document that size into memory is not optimal, but, hey, memory is cheap. I know enough about XPath that I avoid the expensive searches (https://blogs.msdn.microsoft.com/xmlteam/2011/09/26/effective-xml-part-2-how-to-kill-the-performance-of-an-app-with-xpath/). So I code it up and notice that performance is about the same as the HTTP GET. Wow, I think, XPath sucks looking up nodes over large documents.

I probably shouldn't use XmlDocument and XPath anyway, LINQ and XDocument would be preferable anyway and demonstrate how hip I am. So code up a LINQ query and find out . . . performance was not noticeably any better than XPath.

All I need is the filename string from an attribute of a node. Why not create an enumerable of strings and use that to see if the file exists. This would also optimize my process as it wouldn't be necessary to read the entire xml document into memory, I could use xmlreader and just pull out the string I care about as I iterate the reader and add it to the string list. Problem solved, the lookup performance is fantastic.