External File Upload Optimizations for Windows Azure

26. April 2010

I’m wrapping up a bit of the work we’ve been doing on data movement optimizations for cloud computing and the latest set of data yielded some interesting points I thought I’d share. The work done here is not really rocket science but may, in some ways, be slightly counter-intuitive and therefore seemed worthy of posting.

Summary: for those who don’t like to read detailed posts or don’t have time, the synopsis is that if you are uploading data to Azure, block your data (even down to 1MB) and upload in parallel. Set your block size based on your source file size, but if you must choose a fixed value, use 1MB. Following the above will result in significant performance gains… upwards of 10x-24x and a reduction in overall file transfer time of upwards of 90% (eg, uploading a 1GB file averaged 46.37 minutes prior to optimizations and averaged 1.86 minutes afterwards).

Detail: For those of you who want more detail, or think that the claims at the end of the preceding paragraph are over-reaching, what follows is information and code supporting these claims. As the title would indicate, these tests were run from our research facility pointing to the Azure cloud (specifically US North Central as it is physically closest to us) and do not represent intra-cloud results… we have performed intra-cloud tests and the overall results are similar in notion but the data rates are significantly different as well as the tipping points for the various block sizes… this will be detailed separately).

We started by building a very simple console application that would loop through a directory and upload each file to Azure storage. This application used the shipping storage client library from the 1.1 version of the azure tools. The only real variation from the client library is that we added code to collect and record the duration (in ms) and size (in bytes) for each file transferred. The code is available here.

We then created a directory that had a collection of files for the following sizes: 2KB, 32KB, 64KB, 128KB, 512KB, 1MB, 5MB, 10MB, 25MB, 50MB, 100MB, 250MB, 500MB, 750MB, and 1GB (50 files for each size listed). These files contained randomly-generated binary data and do not benefit from compression (a separate discussion topic). Our file generation tool is available here.

The baseline was established by running the application described above against the directory containing all of the data files. This application uploads the files in a random order so as to avoid transferring all of the files of a given size sequentially and thereby spreading the affects of periodic Internet delays across the collection of results.  We then ran some scripts to split the resulting data and generate some reports. The raw data collected for our non-optimized tests is available via the links in the Related Resources section at the bottom of this post.

For each file size, we calculated the average upload time (and standard deviation) and the average transfer rate (and standard deviation). As you likely are aware, transferring data across the Internet is susceptible to many transient delays which can cause anomalies in the resulting data. It is for this reason that we randomized the order of source file processing as well as executed the tests 50x for each file size. We expect that these steps will yield a sufficiently balanced set of results.

Once the baseline was collected and analyzed, we updated the test harness application with some methods to split the source file into user-defined block sizes and then to upload those blocks in parallel (using the PutBlock() method of Azure storage). The parallelization was handled by simply relying on the Parallel Extensions to .NET to provide a Parallel.For loop (see linked source for specific implementation details in Program.cs, line 173 and following… less than 100 lines total). Once all of the blocks were uploaded, we called PutBlockList() to assemble/commit the file in Azure storage. For each block transferred, the MD5 was calculated and sent ensuring that the bits that arrived matched was was intended. The timer for the blocked/parallelized transfer method wraps the entire process (source file splitting, block transfer, MD5 validation, file committal). A diagram of the process is as follows:

ParallelAzureUploadDirect

We then tested the affects of blocking & parallelizing the transfers by running the updated application against the same source set and did a parameter sweep on the block size including 256KB, 512KB, 1MB, 2MB, and 4MB (our assumption was that anything lower than 256KB wasn’t worth the trouble and 4MB is the maximum size of a block supported by Azure). The raw data for the parallel tests is available via the links in the Related Resources section at the bottom of this post.

This data was processed and then compared against the single-threaded / non-optimized transfer numbers and the results were encouraging. The Excel version of the results is available here.

Two semi-obvious points need to be made prior to reviewing the data. The first is that if the block size is larger than the source file size you will end up with a “negative optimization” due to the overhead of attempting to block and parallelize. The second is that as the files get smaller, the clock-time cost of blocking and parallelizing (overhead) is more apparent and can tend towards negative optimizations. For this reason (and is supported in the raw data provided in the linked worksheet) the charts and dialog below ignore source file sizes less than 1MB.

RateImprovement

(click chart for full size image)

The chart above illustrates some interesting points about the results:

  • When the block size is smaller than the source file, performance increases but as the block size approaches and then passes the source file size, you see decreasing benefit to the point of negative gains (see the values for the 1MB file size)
  • For some of the moderately-sized source files, small blocks (256KB) are best
  • As the size of the source file gets larger (see values for 50MB and up), the smallest block size is not the most efficient (presumably due, at least in part, to the increased number of blocks, increased number of individual transfer requests, and reassembly/committal costs).
  • Once you pass the 250MB source file size, the difference in rate for 1MB to 4MB blocks is more-or-less constant
  • The 1MB block size gives the best average improvement (~16x) but the optimal approach would be to vary the block size based on the size of the source file.

 

RateImprovement2 (click chart for full size image)

The above is another view of the same data as the prior chart just with the axis changed (x-axis represents file size and plotted data shows improvement by block size). It again highlights the fact that the 1MB block size is probably the best overall size but highlights the benefits of some of the other block sizes at different source file sizes.

DurationReduction

This last chart shows the change in total duration of the file uploads based on different block sizes for the source file sizes. Nothing really new here other than this view of the data highlights the negative affects of poorly choosing a block size for smaller files.

 

Summary

What we have found so far is that blocking your file uploads and uploading them in parallel results in significant performance improvements. Further, utilizing extension methods and the Task Parallel Library (.NET 4.0) make short work of altering the shipping client library to provide this functionality while minimizing the amount of change to existing applications that might be using the client library for other interactions.

 

Related Resources

Cloud Computing, Theory, General Development ,

“Live” Monitoring Of Your Worker Roles in Azure

3. September 2009

image

I’ve been working for a bit on some larger-scale jobs targeting the Windows Azure platform and early last week had assembled a collection of worker roles that were supposed to be processing my datasets for a number of days moving forward. Unfortunately, they wouldn’t stay running. As always, they “worked on my machine”, so I naturally assumed that the problem was with the Azure platform :). I then proceeded to do what I thought was the correct action… go to the Azure portal and request that the logs be transferred to my storage account so I could review them and fix the problem. What I learned, is that there were two problems with this solution:

  1. The time delay between requesting the logs and actually being able to review them is prohibitive for productive use. In my experience, the minimum turn around was 20 minutes and was most often 30 or longer. I’m not sure why this was happening – is this by design, or a temporary bug, or an artifact of the actual problem with my code, or what, but I know it was too long.
  2. Logs appear to get “lost”. In my scenario, my worker roles were throwing an exception that was un-caught my by code. Near as I can tell, when this happens, the Azure health monitoring platform assumes that the world has come to an end, shuts down that instance, and spins a new instance. While this (health monitoring and auto-recovery) is a good thing, one side effect (caveat is the fact that this is my experience and may not be reality) is that the logs were stored locally and, when the instance was shutdown/recycled, those log files went to the great bit-bucket in the sky. I was stuck in a failure mode with no visibility as to what was going wrong nor how to fix it.

After pounding my head for a bit, I came up with the following solution – trap every exception possible and use queues. The first aspect allowed my worker roles to stay running. This may not always be the right answer, but for my use case, I adapted my code to handle the error cases and trapping / suppressing all exceptions proved to be a good answer. Further, doing so allowed me to grab the error message and do something interesting with it.

The second step (using queues) solved the (my) impatience problem. I created a local method called WriteToLog that did two things: write to the regular Azure log, and write to a queue I created called status (or something similarly brilliant). I replaced all of my “RoleManager.WriteToLog()” calls with calls to the local method and I then wrote a console app that would periodically (every few seconds) pop as many messages as it could (API-limited to 32) off of the status queue, dump the data as a local csv for logging and write the data to the screen. This allowed me to drastically reduce the feedback loop between my app and me, enabling me to fix the problems quickly.

There are certainly some downsides to this approach (do queues hit a max?, what is the overhead introduced by logging to a queue, once a message is dequeued, it is not available for other clients to read, etc), but it was a nice spot fix. A better implementation would have a flag in the config file or something similar that would control the queue-logging.

As you can see from the image above, I also wrote a little winform app to display the approximate queue length so I’d have an idea of the progress and how much work remained.

Cloud Computing, General Development , ,

SilverLight and Paging with Azure Data

20. August 2009

If you’ve been watching by blog at all lately, you know that I’ve been playing with some larger data sets and Azure storage, specifically Azure table storage. Last week I found myself working with a SilverLight application to visualize the resulting data and display it to the user, however I did not want to use the ADO.NET Data Services client (ATOM) due to the size of data in transmission. Consequently, I set up a web role that proxied the data calls and fed them back to the caller as JSON. Due to the limitation on Azure table storage of only returning 1,000 rows at a time, I needed to access the response headers in my SilverLight client to determine after each request if there were more rows waiting… and that was the rub… every time I tried to access the response headers collection (tried both with a WebClient and HttpWebRequest), I received a System.NotImplementedException.

I pounded my head on this for a few days with no success until a helpful twitterer (@silverfighter) provided me a link that got me rolling. The root of the problem was my ignorance of how SilverLight’s networking stack functioned. As I (now) understand it, by default any networking calls (WebClient or HttpWebRequest) are actually handled by the browser and not .NET. This results in you getting access to only what the browser object hands you, which in my case, did not include the response headers.

The key here is that SilverLight 3 provides you the ability to tell the browser that you’d rather handle those requests yourself. By simply registering the http protocol (you can actually do it as granular as a site level) as handled by the Silverlight client, “magic” happens and you suddenly have access to the properties of the WebClient (ResponseHeaders) and HttpWebRequest (Response.Headers) objects that you would have expected to. The magic line you need to add prior to issuing any calls is as follows:

bool httpResult = WebRequest.RegisterPrefix("http://", WebRequestCreator.ClientHttp);

(yes… that’s it…)

The links to the appropriate articles are as follows:

http://blogs.msdn.com/carlosfigueira/archive/2009/08/15/fault-support-in-silverlight-3.aspx 

http://msdn.microsoft.com/en-us/library/dd470096(VS.95).aspx

http://blogs.msdn.com/silverlight_sdk/archive/2009/08/12/new-networking-stack-in-silverlight-3.aspx

Cloud Computing, General Development ,

Azure Blob Storage Blob IDs and “+”

30. July 2009

I’ve been kicking the tires on Azure’s blob storage and am working on uploading a 1.2GB+ NetCDF file. I stumbled across a couple of samples online that were very helpful in avoiding the de facto client library that ships with the SDK however I found myself bit by something (likely due to my error somehow) that I thought I’d pass along.

When processing a larger file, my upload process would always fail at block #248. At first, I assumed it was a network transience issue and simply re-ran the upload, however, after having it fail on the exact same block 3 times, I decided that it wasn’t the network. In digging a bit into things, I found that the problem had to do with the encoding of the block IDs. The offending piece of code is here:

image

 

where i is an integer representing the index of the current block within the file and blockIds is an array of IDs used to build the block ID list as part of a putBlockList operation.

The Azure SDK would indicate that this code snippet is perfectly valid… block IDs need to be a base 64-encoded string uniquely identifying the block within the blob. Further, each ID (within a blob) must be of the same length prior to encoding (same number of bytes). In this scenario, BitConverter.GetBytes returns a 4-byte array of values for all numbers within the range (in my case, 0 – 314). The following is an example of the resulting string for some numbers:

  • 246: 9gAAAA==
  • 247: 9wAAAA==
  • 248: +AAAAA==

There continues about 4 that begin with a ‘+’ sign, and a similar number that begin with ‘\’. Every other index in my collection began with a normal alpha character. After doing some poking around I found some indications that others were having similar problems and went down the path of encoding the line differently (i.e. HttpServerUtility.UrlTokenEncode, etc) to no avail. What I ended up with is simply prefixing my values with a standard “safe” string (“BlockId”)

image

This yielded a blockId that was unique, consistent length (notice the formatting of the indexer in the ToString() method), and “safe” in that it always began with a URL-safe character.

I’m certain that there is likely a better way to solve this problem, but this did the trick for me and maybe it will be helpful to someone else.

Cloud Computing, General Development ,

2009 BJU Programming Contest

24. March 2009

Bob Jones University

I had the privilege of being one of the alumni-judges at the annual Bob Jones University Computer Science departments programming contest. This was the first time I’ve participated in this type of contest and I found it very interesting. The CS department had a fairly slick harness for executing the contest and supporting the judging in multiple languages and multiple platforms. As with anything of this nature, there were a few bumps in the road, but nothing of any consequence.

The contest turned out wonderfully… we  had around 35 contestants (I lost count because we overflowed the one room and had to use a different room). There were 10 problems of various difficulties to be solved in a 3-hour time window. The contestants could solve the problems in any order, and could choose both their platform (Windows/Linux) and their language (C#, Java, Python, C++, Ruby). To my surprise, many of the contestants switched between languages rather than using just one as I would have expected. Every contestant solved at least one problem properly and all of the problems were solved by at least one person. The distribution of problems solved was pretty balanced as well.

As a judge, my job was to monitor the queue for submitted answers, run the submissions through the test harness and reply on the results back to the contestant. I was a bit amazed (though I shouldn’t have been) at the wide variety of coding styles and levels of verbosity to solve the same problems. The contestants could also submit questions to the judges, and the favorite for the day was “can I leave and not come back?”.

I’d like to congratulate the winners and all of the contestants for a fine job and look forward to participating in next year’s event.

General Development

.NET is a Smorgasbord?

26. November 2008

Like many other .NET devs I often find myself expecting to be current in all of the existing and up-coming tools/technologies in the Microsoft/.NET platform. Frankly, I don't know how that is possible, especially with the pace at which MSFT (not to mention the surrounding ecosystem) is releasing tools and platforms. Over the past few years, my approach has been to know "enough" about the various tools/technologies so that I can be conversant, and also know when a particular toolset applies to my current project, thereby warranting a "deeper" dive into that area. Such has been the case for me with WPF and WCF (much of my work over the past while has been in the SharePoint/web space meaning WPF - until SilverLight - didn't have much of a play and we hadn't yet seen a need to switch from standard ASMX for our services). They fell into the bucket of tools I had seen while walking along the smorgasbord, but I simply hadn't decided I needed to consume them yet.

Scott Hansleman describes the .NET Framework and the MSFT tool suite as making it easy to "fall into success" (I'm sure I'm not quoting him correctly, but the idea is the same). Essentially, the tool set, while robust and quite capable, is approachable and relatively easy to simply build something. Especially when you compare it to other languages such as C++ -- in C#/.NET it is relatively easy to build "okay" code, and not that hard to build good code and almost (yes, there are plenty of exceptions) hard to write *bad* code. It is much easier (at least in my opinion) to write bad C/C++ code and much harder to write good C/C++ .code. I agree with him 100% - once you have a core competency on the platform, picking up the basics of the "new" stuff becomes almost trivial

I was recently working on a project (someone else did most of the coding - I did some of the design and proof-of-concept work) and I was able to see this in action. We were building a security-focused app, being deployed to a mixed environment of XP and Vista machines, and we had a 6-7 week window to build it, test it, and have it deployed. We ended up building a Windows Service that hosted a WCF service, a desktop application using WPF, a webpart for SharePoint and an IIS-hosted WCF service. We made heavy use of the cryptography libraries which, oddly (to me) were one of the areas that the other developer had prior experience with, however neither he nor I had done any real work with WCF and WPF. The technologies offered us quite a bit as far as functionality and form, even for two guys who weren't "experts" in them - that's where the "magic" lies - I'm reasonably comfortable with the MSFT dev stack, and I'm handed two completely new-to-me technologies, and with a relatively small amount of effort, I'm able to use them in my application and reap the benefits they bring. Now, certainly there's quite a bit more functionality that WPF/WCF bring to the table than what we used or "grok'd" during this project, but it did what we needed to and quickly - making me want to dig further into those technologies and to use them for other projects.

General Development , ,

Finally back where I want to be...

26. May 2008

It's frustrating to me to find myself redoing things that I've done before or re-solving problems. Over the years at Planet I've been involved with different software teams each with different levels of rigor, however most all of them have had, at minimum, an automated build process of some sort (at least for the past 4 years or so). Some of these systems were elaborate msbuild driven systems while others were a cobbling together of batch scripts or PowerShell linking msbuild, Vault, FogBugz and Community server.

The customer I've been working with for the past 16 months has "bitten off" the entire TFS tree and I've been the prime developer responsible for implementing it and getting it going... all, of course, while doing "real work". Further, (nearly) all of the work we've been doing has been SharePoint focused (custom list event handlers, web parts, site definitions, etc) which means any build must generate properly formed SharePoint Solution (*.wsp) files and the approaches to doing this and handling the installation/upgrade of such are pretty varied.

This weekend I finally completed a build on a project that meets my "minimum requirements" for being a properly formed build. I'm pleased that I was able to, in relatively short order, apply it to another project verifying repeatability. Here's what we're doing:

  1. All build scripts are handled by TFS 2008 (using OOTB functionality)
  2. Solution manifest files and DDF files are maintained both in dev and production build by a customized version of stsdev v1.3 (http://codeplex.com/stsdev)
  3. An "installer" is provided as part of the build (<buildRoot>/Install) that allows the back office team to simply double-click and go. We use the SharePoint Installer (http://codeplex.com/sharepointinstaller) tool/framework to provide this function
  4. All required web.config settings are handled via the feature receiver allowing them to be properly installed/removed on activation/deactivation
  5. Developer-level documentation is provided for the build based on the /// comments in the code. We use Sand Castle (http://codeplex.com/sandcastle) to do the core generation and Sand Castle Help File Builder (http://codeplex.com/shfb) to assist with the build script integration (I tried DocProject - http://codeplex.com/docproject - but it pooched vs 2008 and never worked as described - hopefully it will be more stable when it exists beta).
  6. Passed Style Cop (MS Source Analysis) rules
  7. Passed Code Analysis rules

I still have a ways to go prior to reaching my "nirvana"...

  1. Build should automatically run code analysis (this is certainly possible, I've simply not gotten it implemented yet)
  2. Build should automatically run source analysis (this is possible, I've simply not gotten it implemented yet)
  3. Full testing (unit and system) on build completion - Ideally it would spin up a VM, deploy the appropriate code, execute the test battery, clean up and report on the results.

Even so, it felt good to get a respectable build out the door and to know that it was process driven and repeatable.

SharePoint, General Development , ,

What I'm looking for...

1. November 2007

In a number of the posts I've been writing on the SOA/BPM conference I've referred to the applicability (or lack thereof) of a given approach to "the problem set" that I'm currently working on. I thought it might be good to describe what it is that I'm looking for and a little bit as to why.

I'm working at an organization with roughly 4,000 employees that is in the process of "drinking the MSFT Koolaid". We are deploying nearly every MSFT product and working hard to bring consistency to the platform both from a services aspect as well as the development paradigm. We are focusing on SAP backend, SQL data store, and SharePoint/Office Suite front-end.

We are also focusing on bringing a consistent story to the workflow problem, and, by "workflow" I really mean (well, at least for the purposes of this post) business process. We have business processes in SAP that have workflows behind them and use various means of interaction to keep it moving (i.e. nag-mails, etc.). We have similar workflows in SharePoint for document approval processes and the like. We also have a third set of workflows (business processes) that are either not automated at all, or not in any sort of consistent interface. It is this last set processes that I'm trying to "fix".

I have a couple of "rules":

  1. The interface must live in SharePoint (at least the end-user facing portion)
  2. The "workflow" aspects of the system should utilize one of the two existing execution engines we have running (SAP or WF in MOSS)
  3. The workflow designer should be comfortable for a business analyst to use, and preferably an extension to a tool they are already using (i.e. Visio add-in).
  4. The workflow designer should be able to map actual requirements to the steps in the process (where applicable) and serve as a documentation source of sorts.
  5. The process execution system must provide a webparts (or at least the ability to expose the data as webparts) for the following scenarios:
    1. Overall health of the system
    2. Current user's workflows currently in action
    3. Visual representation of a given workflow process
    4. Analysis relative to the execution stats for each workflow instance and type
      1. i.e. X step of workflow Y is over the planned duration without having completed. Some indication should exist that this process is out of line
      2. i.e. Workflows of type X average Y days with the minimum being Z days and the maximum being T days. (and trends over time)
  6. There must be a hard link between process definition and process execution. The File | Print approach is not sufficient
  7. The system should be standards-based. Meaning, I should be able to import/export the workflow definitions (at least at some level of granularity) to an industry standard such as BPEL or BPMN in order to be able to share or compare that process with other organizations.
  8. The system should be able to provide governance over the models.
  9. In all aspects possible, the system should provide a consistent story WRT the development paradigm we have selected (MSFT .NET, ASP.NET, Windows Workflow, SharePoint, etc.).

Reality...

  1. I'm not exactly certain I'm ready to back down on any of the items above, however I'm coming to the conclusion that there may be a need for a "third" execution engine - meaning, all of the vendors that I saw that had platforms that did what I wanted, either used their own custom execution engine or hosted their own instance of WF separate from MOSS. Even MSFT pushed BizTalk as the "main" process execution engine for *serious* workflows.
  2. If reality point #1 is in fact true, the process execution platform should be based on WF allowing the developers to have a consistent development paradigm.

General Development, Conferences ,