Growing Collaborations
existing 2012 links to forecast the networks of new 2014 links . Figure 2 contains an example of the model ’ s output , showing one model run , using the 2012 existing links to forecast the 2012 new links in the information-sharing subnetwork . We used twelve model runs , each producing a similar figure . The full results from each run are recorded in tables 4 and 5 . Model runs comparing existing and new links within the 2012 data do not apply link decay , as the data did not provide information for link decay between existing and new links in the same year . Analysis also considered whether to use two iterations of the model in each model repetition for forecasting the 2014 new and existing links from the 2012 new and existing links . This means that we ran the model ’ s accretion and decay algorithms twice before performing comparisons between forecast and observed networks , based on the assumption that model output represented an annual process that would be repeated twice between 2012 and 2014 .
Links can be observed in either : 1 ) both the empirically observed and forecasted networks ; 2 ) neither the empirically observed network nor the forecast network ; 3 ) the forecast network only ; or 4 ) the observed network only . Counts of the links that fall into each of these four categories are reported in the model output under the Direction of Variation heading ( Figure 2 ). The counts are used to determine if the model is accurately placing new links within the forecast network . A perfectly performing model will match all forecasted links to all the observed links ; the model will only show links that are either in both networks or in neither network . A count of the links that only occur in either the observed network or the forecasted network will indicate the degree of difference between the observed and forecasted networks , quantifying the forecast error . This measure is what is referred to as a Hamming distance ( Hamming , 1950 ), which is simply the count of differences between two strings , vectors , or matrices . The shortcoming of using a Hamming distance to measure the variation between two networks is that networks vary in size . Networks with more nodes can drastically increase the number of possible links ; while the Hamming distance is a very common approach to measuring the difference between character strings , vectors , and matrices , it does not account for potential variation . A Jaccard distance ( Jaccard , 1912 ) does account for the possible variation by dividing the Hamming distance by the maximum number of differences that are possible , which , in a network , is the number of possible links . Therefore , we can interpret the Jaccard distance as the model ’ s error rate , reporting the percentage of incorrectly forecast links . Keeping counts of how many links appear in either the observed or the forecasted network , but not the other allows for determining whether the forecasted networks are overor under-estimating the number of links in the observed networks .
We use two control models that compare our model ’ s performance with two plausible alternatives . The first control assumes no changes in the modeled network ; the existing network ( Figure 1 ) is read into the model and then compared directly to the observed network ( i . e ., ties cannot be created or decay ). This control is referred to as the fixed initial conditions control . The second form of control uses an approach based on a common random network generator . Comparing empirical networks to random networks is a common metric in network analysis for determining if certain network phenomena are present in the empirical network . For example , Watts and Strogatz ( 1998 ) use a random
11