Saturday, February 25, 2017

Wibbly Wobbly Timey Wimey Thready Computer Sciency stuff

The file migration adventure continues. Tasked with moving files from a legacy store to a new content platform has been a challenge.  Unfortunately, some manipulation of the files is required, so there is no simple, out of the box solution.

After experimenting with some threading, I came to the conclusion that the fastest way to migrate the files was to iterate over the directories, and for each directory, start a console application that migrated the 50,000 files over to the content platform. I could "parallelize" the solution by starting more instances of the console application.

My first thought was to use Powershell. I experimented a bit, and all of my Powershell solutions seemed rather clumsy. Powershell is not my strong suit. So I abandoned the effort and decided to write a console application to start the console application.

It was important for me to be able to configure how many instances of the processes to start. And I didn't want to start more than the configured amount. My first solution was better than powershell, but still rather clumsy. It involved a task list. I would start the console application in a C# Process that was part of a List<Task>. After all of the tasks were complete (Task.WaitAll()), another batch of processes would begin. The problem is that some of the tasks finish a little faster and I discovered I was wasting some time waiting for all of the tasks to complete. Here some code:

while (directories.Count > 0)
{
    var miniDirs = directories.Take(threadsToStart).ToList();
    foreach(var dir in miniDirs)
    {
       Console.WriteLine(dir);
       tasks.Add(Task.Run(delegate
       {
            ProcessStartInfo pInfo = new ProcessStartInfo();
            pInfo.FileName = "Process.exe";
            pInfo.Arguments = volume + " " + dir.Split('\\')[5];
            Process.Start(pInfo).WaitForExit();
                        
        }));
        directories.Remove(dir);
      }
      Task.WaitAll(tasks.ToArray());
      tasks = new List<Task>();
 }

Not elegant. I needed some computer sciency stuff. Semaphore to the rescue (actually SemaphoreSlim https://msdn.microsoft.com/en-us/library/system.threading.semaphoreslim(v=vs.110).aspx). Recalling that Semaphore wait and release threads, I decided to move my Process.Start code into a method and start that method in a thread. The method would wait for the process to finish
WaitForExit()
and then release the semaphore allowing another thread to start.

private static void StartProcess(string volume, string dir)
{
   semaphore.Wait();
   ProcessStartInfo pInfo = new ProcessStartInfo();
   pInfo.FileName = @"Application.exe";
   pInfo.Arguments = volume + " " + dir.Split('\\')[5];
   Process.Start(pInfo).WaitForExit();
   semaphore.Release();
}

I could start each thread in a foreach loop and be sure that only the number of threads I used in the SemaphoreSlim constructor would be started. Here's the loop:

foreach (var directory in directories)
{
   Thread thread = new Thread(() => StartProcess(volume, directory));
   thread.Start();
}
Much more elegant and a thread will start when any of the processes complete. I probably could have just asked Ethan Frei and saved myself some time!

Sunday, February 12, 2017

Xml Adventure

So it looks like I forgot to post in 2016. I'm going to try to rectify that this year.

I was faced with a challenge at work to migrate documents from a legacy store in to a new content store. Iterate over the old archived directory, pick up the file, write it to content via HTTP. No problems. As always, I'd make every effort to catch and report exceptions, but really, what if I missed something? What if somehow the operation finished without getting EVERY document from the archive?

Write a simple application that iterates over the old directory and checks that the file made it to content sounds like a good solution. My first solution was that for every document, make an HTTP GET call to the content provider and see if I get a 200 return. Of course, not optimal. Pretend you didn't just read that. For small subsets, it would be fine. But for over 15 million records, it would take a long time.

Second solution, get the content directory returned as xml, iterate over the archived directory, see if the filename is in the xml. My brain instinctively goes to what is familiar for me; XmlDocument class and search for a node with XPath. Once again, I know, reading a document that size into memory is not optimal, but, hey, memory is cheap. I know enough about XPath that I avoid the expensive searches (https://blogs.msdn.microsoft.com/xmlteam/2011/09/26/effective-xml-part-2-how-to-kill-the-performance-of-an-app-with-xpath/). So I code it up and notice that performance is about the same as the HTTP GET. Wow, I think, XPath sucks looking up nodes over large documents.

I probably shouldn't use XmlDocument and XPath anyway, LINQ and XDocument would be preferable anyway and demonstrate how hip I am. So code up a LINQ query and find out . . . performance was not noticeably any better than XPath.

All I need is the filename string from an attribute of a node. Why not create an enumerable of strings and use that to see if the file exists. This would also optimize my process as it wouldn't be necessary to read the entire xml document into memory, I could use xmlreader and just pull out the string I care about as I iterate the reader and add it to the string list. Problem solved, the lookup performance is fantastic.

Wednesday, June 10, 2015

Lua hash table

Table[d] = i


Creates a hash map.

Thursday, April 9, 2015

Reset Azure VM admin password

Maybe it's just me, but I don't keep track of my azure VM logins and passwords very well. Here is an msdn article on how to reset using azure powershell.

http://blogs.technet.com/b/keithmayer/archive/2014/06/26/microsoft-azure-virtual-machines-reset-forgotten-admin-password-with-windows-powershell.aspx

Monday, April 14, 2014

Mapping ComplexType mixed=True Node

I was pretty proud of my previous solution to a node that may or may not have embedded html. Until I tried to use BizTalk mapper to map said node. Simply linking the node to the target schema node resulted in just the text before the first embedded element being mapped. Aha, I thought, Mass Copy functoid to the rescue . . . but that resulted in just the embedded elements being copied and NONE of the text. I could not find a solution using functoids, so I turned to the reliable Scripting functoid. The following code in an Inline XSLT Call Template functoid worked:

<xsl:template name="CopyMergedTest">
    <xsl:param name="param1" />
    <xsl:element name="<ElementToCopyTo>">
         <xsl:apply-templates/>
    </xsl:element>
</xsl:template>
<xsl:template match="text()">
    <xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
</xsl:template>

Having the final <xsl:template match="@* | node()"> was necessary to copy the embedded tags.

Monday, March 31, 2014

Element with embedded html schema

An incoming document will have a text element, but embedded in that text may or may not be html. The html may or may not have multiple nodes. Since the schema will treat html tags as elements, it is necessary for us to handle this case as if we were receiving xml elements.

To create the xsd schema, it was necessary to use the xs:any element.

So the schema looks like this:

<xs:element minOccurs="0" maxOccurs="1" name="Element">
<xs:complexType mixed="true">
  <xs:sequence>
 <xs:any minOccurs="0" maxOccurs="unbounded" processContents="skip"/>
  </xs:sequence>
</xs:complexType>
</xs:element>

The mixed attribute on xs:complexType tells us that the element can contain text as well as elements. Setting the minOccurs and maxOccurs on the xs:any element tells us that there may not be any embedded elements (in this case html) and there may be an unbounded upper limit. The processContents="skip" will not validate against a schema, allowing us to have any elements.

http://msdn.microsoft.com/en-us/library/aa547371.aspx

Wednesday, January 8, 2014

Cross Domain MSDTC

I spent a lot of time troubleshooting an error from a BizTalk send port:

 

System.Runtime.InteropServices.COMException: The MSDTC transaction manager was unable to push the transaction to the destination transaction manager due to communication problems. Possible causes are: a firewall is present and it doesn't have an exception for the MSDTC process, the two machines cannot find each other by their NetBIOS names, or the support for network transactions is not enabled for one of the two transaction managers.

 

I thought I would summarize my findings.

 

What is MSDTC? Microsoft Distributed Transaction Coordinator. In a nutshell, MSDTC coordinates transactions between the BizTalk Server and SQL Server. For instance, if you have a polling statement in SQL that updates the database, you would not want it to commit if the message fails to get added to the BizTalk MessageBox database. Using MSDTC, the update call will rollback if the messagebox call fails. Without MSDTC, there is a risk of losing messages in this scenario.

 

MSDTC uses NetBIOS name resolution. This was one of the first hurdles I had to overcome as the names were not resolving to the IP address of the servers. To troubleshoot this, an entry was made in the hosts file to resolve the sql server NetBIOS name. No matter what you read, this is not the preferred method of dealing with name resolution. If the IP of the server changes, every host file would have to be modified causing an administration nightmare Because MSDTC uses NetBIOS resolution, not a fully qualified domain name, a way needed to be found to resolve the 'short' name with DNS. There is a setting in Advanced TCP/IP Settings in which you can append a suffix to the short name for resolution. Adding [domain] to this suffix list resolved the name resolution (http://technet.microsoft.com/en-us/library/cc959339.aspx). When this was resolved, we ran into firewall issues.

 

MSDTC uses port 135 to initiate the connection. After that, Remote Procedure Call dynamically allocates a port between 1024 and 65535. Obviously, from a security standpoint, opening that range of ports between servers is not suggested. There is a way to limit the port range using registry settings. At some point, this was done on the [biztalk server] box but was not on the BizTalk server. Changing the registry settings requires a reboot of the BizTalk server. After that was done on dev and the firewall was open to the new range of ports, everything worked correctly. We currently are allocating 200 ports for MSDTC connections between biztalk and [sql server] in development (http://support.microsoft.com/kb/250367).

 

The ability to turn off MSDTC is available in the binding settings of the send port. If the UseAmbientTransactions is set to ‘false,’ your send port will make its sql call without using transaction coordination. This is not a solution, be careful turning this off and know that distributed transaction are not required. This would be safe if your sql call included no updates, just gets.

 

There are also some setting in the Component Services->Computers->My Computer->Distributed Transaction Coordinator->Local DTC that need to be configured. In Properties->Security Network DTC Access needs to be checked as well as allowing all of the inbound and outbound connections. Also, since this is cross domain, No Authentication Required should be checked. Those settings need to match on both servers.