Monday, January 21, 2013

Social Network Analysis of Apache CloudStack

Apache is about building communities of developers and users around an open source software. As such they can be analyzed with social networking tools to identify patterns of communications (communication networks), sub-communities within groups and bring up the most influential nodes. This type of analysis can be done over time using data from the Apache mailing lists.

Since I am a new Apache committer on CloudStack I wanted to have a look at the health of our community and thought a social network analysis (SNA) would do it. A little googling led me to this very nice research paper on SNA of the R mailing lists. I have not done all the analysis mentioned in the paper, especially the content based analysis but I wanted to post my early results.

Methodology: To get the graphs I grabbed the emails archive from Apache. I used Python to load the mbox files into single Mongo collections. I cleaned the data to avoid replications of senders as well as remove JIRA and Review Board entries. Then with a little bit of PyMongo I made the queries and build the graph with NetworkX. Finished up with the graph visualization and calculations using Gephi. Since there are thousands of emails and threads, there is still some work to pre-process the data, avoid duplicates and match individuals to multiple email addresses.

Using Gephi, I manipulated the graphs. Computing the degree of each node (i.e the number of direct connection to other nodes), the betweeness centrality (i.e a measure of how often a node serves as a bridge between the shortest path between two nodes. In other terms: is a node the best "proxy" between two other nodes ?), I then partitioned the graphs with a color code, trying to identify sub-communities. Finally for clarity I filtered nodes by degree. In CloudStack filtering is especially important since the list has grown quite large of late (This may actually be an indirect sign that it is time to split the dev list).

The graph of the cloudstack-dev mailing list can be seen below:

What stands out right away are the largest nodes, or the most influential nodes according to betweeness centrality. Chip, David, Edison, Chiradeep, Hugo, Wido, Alex are all members of the PMC and exhibit a high centrality. Prasanna and Rohit also exhibit a high centrality but are not currently in the PMC. Also of interest is that this graph is valid since CloudStack joined Apache in April 2012, we can identify contributors who are not active currently but once where and thus are still part of the overall communication network. The color code highlights communities within the community. There seems to be 4 to 5 sub-communities (green, blue, red, cyan, yellow, more investigation is necessary to give interesting meanings to these sub-communities. You will also notice that the edges have all the same thickness. This means that they have the same weight. Once two people exchange an email, an edge is drawn between the two nodes. If they communicate again, the edge is not modified. I will add edge weighting in a future study, this will show us "pathways" between community members and will also affect the influence of the nodes.

Update January 22nd: I added weight to the edges. In english this means that everytime two people communicated in a thread I increases their connectidness by 1. The graph below shows edges with a different thickness. Nodes and Edges were filtered to highlight the strongest connection. This clearly shows the "PMC" of ACS.

The graph of the cloudstack users mailing list can be seen below:

What stands out the most in this graph is that some of the PMC members are still influential (Chiradeep, David, Alex and Edison for instance). But new influential nodes have appeared. Most notably: mcirauqui, geoff.higginbottom and ahmad.emneina. Chip Childers is still present but his influence in this users community is much less. Based on this I am ready to campaign for mcirauqui and geoff to become committers, as they are clear contributors of the CloudStack users community :)

For comparison I checked the HDFS dev mailing list (note that this is fairly restrictive since Hadoop is a very large ecosystem with many mailing lists), followed the same process and obtained the following graph. Maybe the HDFS community can help me analyze it and see if this gives the right picture of their dev community :)

I plan to do more work on this. Cleaning the dataset a bit further, studying the community partitioning, and especially building content based graphs. These will allow us to identify communication network on a particular topic. Say you want to learn about SDN support in CloudStack, we could generate the graph and see who are the most "influential" nodes about SDN in CloudStack.

Tuesday, January 08, 2013

A Mahout Cluster across France and Luxembourg Using CloudStack

Early December I attended the Grid5000 Winter school held at the Ecole des Mines de Nantes (EMN) and organized by Adrien Lebre. Grid 5000 is "a scientific instrument designed to support experiment-driven research in all areas of computer science related to parallel, large-scale or distributed computing and networking ". Basically a large scale testbed to design, build and test distributed systems. The US also have such an infrastructure in academia called FutureGrid. These research infrastructures have become key to enable research in distributed systems approaching scales now seen in the industry rather than test systems on couple machines in a single lab. Currently Grid5000 operates 1195 physical hosts, for a total of 8184 cores across 10 sites.

While in Nantes I met with Alexandra Carpen Amarie an INRIA research engineer who developed an amazing tool G5k campaign. G5k campaign allows any user of Grid5000 (G5k) to book nodes on the infrastructure, deploy machines with bare-metal provisioning and then deploy their favorite Cloud IaaS framework (currently CloudStack, Opennebula and Nimbus). G5k scripts are available via git, of interest are some Chef recipes. Heavily tailored for Alexandra's scripts they could be useful for the CloudStack community. Alexandra held a tutorial on deploying a IaaS and PaaS on G5k. For the tutorial the PaaS was Apache Mahout. Lets' not get into a discussion about whether Mahout is a PaaS or not, the point is that a IaaS can be used to deploy and managed a set of nodes that run Hadoop and Mahout on top, to provide a high level functionality. In this case machine learning algorithms to analyze large data-sets. It was attended by approximately 30 people. How did it work exactly?

One thing about G5k and I believe the French research computing community is that they are very prolific in creating great tools. Unfortunately few people know about them. The clusters of G5k are operated like regular batch processing clusters. A batch scheduler is used to access the nodes. Tool #1: OAR a PBS/MOAB like equivalent. Once the nodes are allocated they are provisioned using Tool #2: Kadeploy a crowbar like equivalent. Of great interest is Tool #3 KaVLAN, a tool to lease the VLANs configured on G5k. While not currently used in Alexandra's G5k campaign, I hope the Apache CloudStack community can start making use of it to test Advanced Zones.

The beauty of G5k campaign is that Alexandra's has hidden most of the complexity of the provisioning and configuring. You only need to write a YAML configuration file for your deployment. Specifying the sites and the number of nodes that you want to run on/at, for example:

deployment:
  engine:
    name: CloudStack
    customization_type: multisiteChef
  walltime: 2:00:00
  sites:
    rennes:
      nodes: 10
      subnet: slash_22=1
    nancy:
      nodes: 10
      subnet: slash_22=1     
    sophia:
      nodes: 10
      subnet: slash_22=1
ssh:
  user: username

Launch your campaign and wait for the nodes to be allocated, provisioned and then configured with your IaaS. Depending on the number of nodes requested, you could have a Cloud working within 20 minutes. You can then interact with it. In the case of CloudStack, using the API, Alexandra developed some wrappers to manage VMs. She did it before CloudMonkey came out and is not needed now even though still a great exercise. Couple days after the tutorial I asked Alexandra to deploy CloudStack across several sites. Within 24 hours I had those snapshots in my inbox. A 5 sites cloud, one basic zone per physical site and 97 physical hosts setup, 800 cores and 100 VMs deployed running Mahout. It took 30 minutes to deploy the nodes, one hour to configure the hosts in CloudStack (serially, Alexandra is working on adding parallel configuration in her tool). The 100 VMs were deployed in roughly 10 minutes. The 3 physical nodes missing were due to bare metal provisioning problems.

The snapshot below shows the infrastructure/zone view of the CloudStack deployment. Five basic zones were configured at Rennes, Toulouse, Nancy, Sophia and Luxembourg. All cities connected via the RENATER fiber network.

Below the infrastructure view, shows five zones, 97 hosts, 10 system VMs (console proxy and secondary storage) and 4 virtual routers (One router was not started at the time of the snapshot).

A small detail that you may have seen from the YAML configuration file is that this is all based on ssh. Access to G5k is via ssh keys and not via a PKI infrastructure. Having worked on TeraGrid. This was a nice surprised. Using PKI across different organizations and managing authorization can be extremely complex. This was a sore point in the TeraGrid. It is also used in the LHC grid with more success but still requires a lot of work. In G5k the user base is smaller and more trusted. SSH keys are distributed among sites using a basic NFS setup on the private RENATER network. This makes it easy for users to access all sites.

Looking ahead, the basic question one might ask is whether it makes sense to run mahout within virtual machines. In cases where the dataset is not very large the use of HDFS as a large scale distributed storage systems is not the issue. Rather the time spent running the machine learning algorithms is. There the cpu overhead of virtualization is the main performance factor. Alexandra pointed me to a paper she wrote on performance of map-reduce in the Cloud. I asked her to do some more analysis specific to a CloudStack based Cloud, stay tuned for the results :).