Saturday, February 25, 2017

Wibbly Wobbly Timey Wimey Thready Computer Sciency stuff

The file migration adventure continues. Tasked with moving files from a legacy store to a new content platform has been a challenge.  Unfortunately, some manipulation of the files is required, so there is no simple, out of the box solution.

After experimenting with some threading, I came to the conclusion that the fastest way to migrate the files was to iterate over the directories, and for each directory, start a console application that migrated the 50,000 files over to the content platform. I could "parallelize" the solution by starting more instances of the console application.

My first thought was to use Powershell. I experimented a bit, and all of my Powershell solutions seemed rather clumsy. Powershell is not my strong suit. So I abandoned the effort and decided to write a console application to start the console application.

It was important for me to be able to configure how many instances of the processes to start. And I didn't want to start more than the configured amount. My first solution was better than powershell, but still rather clumsy. It involved a task list. I would start the console application in a C# Process that was part of a List<Task>. After all of the tasks were complete (Task.WaitAll()), another batch of processes would begin. The problem is that some of the tasks finish a little faster and I discovered I was wasting some time waiting for all of the tasks to complete. Here some code:

while (directories.Count > 0)
{
    var miniDirs = directories.Take(threadsToStart).ToList();
    foreach(var dir in miniDirs)
    {
       Console.WriteLine(dir);
       tasks.Add(Task.Run(delegate
       {
            ProcessStartInfo pInfo = new ProcessStartInfo();
            pInfo.FileName = "Process.exe";
            pInfo.Arguments = volume + " " + dir.Split('\\')[5];
            Process.Start(pInfo).WaitForExit();
                        
        }));
        directories.Remove(dir);
      }
      Task.WaitAll(tasks.ToArray());
      tasks = new List<Task>();
 }

Not elegant. I needed some computer sciency stuff. Semaphore to the rescue (actually SemaphoreSlim https://msdn.microsoft.com/en-us/library/system.threading.semaphoreslim(v=vs.110).aspx). Recalling that Semaphore wait and release threads, I decided to move my Process.Start code into a method and start that method in a thread. The method would wait for the process to finish
WaitForExit()
and then release the semaphore allowing another thread to start.

private static void StartProcess(string volume, string dir)
{
   semaphore.Wait();
   ProcessStartInfo pInfo = new ProcessStartInfo();
   pInfo.FileName = @"Application.exe";
   pInfo.Arguments = volume + " " + dir.Split('\\')[5];
   Process.Start(pInfo).WaitForExit();
   semaphore.Release();
}

I could start each thread in a foreach loop and be sure that only the number of threads I used in the SemaphoreSlim constructor would be started. Here's the loop:

foreach (var directory in directories)
{
   Thread thread = new Thread(() => StartProcess(volume, directory));
   thread.Start();
}
Much more elegant and a thread will start when any of the processes complete. I probably could have just asked Ethan Frei and saved myself some time!

No comments: