Geeks With Blogs
Cloud9 Azure and Cloud Services, WCF, WF, Dublin, Geneva and Federated Security, Oslo

I asked someone the following question..

in the following code... ( from MSDN Mag 2010 issue ( thomas erl ))

    [Description("PartitionKey=UserId, Rowkey=AccountId")]
    public class UserAccountBalance : TableServiceEntity
        public double Balance { get; set; }
        public UserAccountBalance()
            : base(Guid.NewGuid().ToString(), Guid.NewGuid().ToString())
        public UserAccountBalance(Guid userId, Guid accountId)
            : base(userId.ToString(), accountId.ToString())

is this creating 1 partition for every UserID in the system?

My understanding of partitions is that lets say I have 50 records for each user and I have 10,000 users.. my query doesnt have to do a table scan of the entire 10,000
to find the 50 records im interesting in for that query therebye improvign performance.
and that other tricks can be done with the 50 recrods like moving them around maybe closer to where they are queried the most.. e.g if my user is coming from California then he hits the Webrole instance in the west coast datacenter and somehow Azure is able to learn to move the Cali user data closer to that WebRole instance.
But a guy from New Yorks records would be moved to the Chicago datacenter as he would mostlikey acccess the Chicago WebRole instance doing the query.

that being said.
THe wouldnt the UserId as Partition key fragment the data all over the place?
or does Azure begin to seperate records out to servers based on performance even though I started with 1 record per partition?

I can see how a table that keeps track of a users historical bank transactions can be partitioned based on UserId
but bank balances seem to be 1 record per user.


and got the following answer.....

Usually, you choose a PartitionKey based on 2 factors.

1. Entities within the same partition are usually stored on the same server. Obviously, a search across several servers (even in the same data center) is slower than a search on a single server.

2. PartitionKey is indexed. That means if you query entities for a particular partition, you don't need to perform a table scan.

Both factors need to be considered. And sometimes, they may interdict with each other. For example, if you have too many entities in a partition, you must scan more data when you want to query a particular partition. But if you divide the partition into several small partitions, there's no guarantee that they will be stored on a single server (they may or may not)...

On the other hand, if your query doesn't contain a PartitionKey or RowKey, you're always required to do a table scan, because only PartitionKey and RowKey are indexed. So for the bank balance table, I'm not sure why the author chooses UserID as PartitionKey, but probably because most queries are done per user, so it is desired to index the UserID.

As for data centers, we do not automatically store the data in a data center that is near the user's request location. When creating the storage account, you're required to choose a data center. If you want to serve global users with great performance, you need to create serveral different storage accounts targeting different data centers.

Posted on Thursday, January 14, 2010 10:33 AM | Back to top

Comments on this post: Azure storage partitioning strategy question

No comments posted yet.
Your comment:
 (will show your gravatar)

Copyright © Juan Suero | Powered by: