Identifying robust clusters and multi-community nodes by combining top-down and bottom-up ... 3 Rennselaer Polytechnic Institute, Department of Computer Science, Troy, NY. 4 SpoÅeczna ... 5 Samsung Research America, San Jose, CA.
Identifying robust clusters and multi-community nodes by combining top-down and bottom-up approaches to clustering Chris Gaiteri1*,2†, Mingming Chen3†, Boleslaw Szymanski3,4, Konstantin Kuzmin3, Jierui Xie3*,5, Changkyu Lee2, Timothy Blanche2, Elias Chaibub Neto6, Su-Chun Huang7, Thomas Grabowski7,8, Tara Madhyastha8 and Vitalina Komashko9 1 Rush University Medical Center, Alzheimer‟s Disease Center, Chicago IL 2 Allen Institute for Brain Science, Modeling, Analysis and Theory Group, Seattle, WA 3 Rennselaer Polytechnic Institute, Department of Computer Science, Troy, NY 4 Społeczna Akademia Nauk, Łódź, Poland 5 Samsung Research America, San Jose, CA 6 Sage Bionetworks, Seattle, WA 7 University of Washington, Department of Neurology, Seattle, WA 8 University of Washington, Department of Radiology, Seattle, WA 9 Trialomics, Seattle WA *current address † These authors contributed equally to this work Abstract Biological functions are often realized by groups of interacting molecules or cells. Membership in these groups may overlap when molecules or cells are reused in multiple functions. Traditional clustering methods assign components to no more than one group, and cannot identify multi-community nodes. Technical noise is common in high-throughput biological datasets and further blurs distinctions between clusters. Together, overlapping nodes and high levels of noise reduce our ability to accurately define clusters in biological datasets and to interpret their biological functions To address these limitations, we designed an algorithm called SpeakEasy, which detects overlapping or non-overlapping communities in commonly studied biological networks. Input to SpeakEasy can be physical networks, such as molecular interactions, or inferred networks, such as gene coexpression networks. The networks can be directed or undirected, and may contain negative links. SpeakEasy combines traditional bottom-up and top-down approaches to clustering, by creating competition between clusters. Nodes that oscillate between multiple clusters in this competition are classified as multi-community nodes. SpeakEasy can quickly process networks with tens of thousands of nodes, quantify the stability of each cluster and select an optimal number of clusters automatically without requiring manual “parameter tuning” for classleading results. Clustering networks derived from gene microarrays, protein affinity, sorted cell populations, electrophysiology, functional magnetic resonance imaging of resting-state brain activity, and synthetic datasets validate the ability of SpeakEasy to facilitate biological insights. For instance, we can identify overlapping co-regulated genes sets, multi-complex proteins and robust changes to the community structure of co-active brain regions in Parkinson disease. These insights rely on a robust overlapping
clustering approach that enables a more realistic interpretation of common highthroughput datasets. Author’s summary A classic debate in biology, dating back to Charles Darwin, is between “lumpers” and “splitters” – between focusing on average characteristics of a large group versus splitting it into specialized subgroups. The reality of biological networks may lie somewhere in between these two classic modeling approaches. It can be useful to divide a network down to a certain level, but past that point, biological components participate in multiple functions and further subdivision becomes inaccurate. “Multicommunity nodes” is a generic term for the biological components, such as genes, cells, or tissues that are members in multiple communities. These multi-community nodes are difficult to detect, because clustering algorithms typically divide objects into nonoverlapping groups. We designed a new way to efficiently detect multi-community nodes and to quantify our confidence in these results. This algorithm, called SpeakEasy, uses both “top-down” and “bottom-up” approaches to clustering, which is analogous to combining the classic perspectives of “lumpers” and “splitters”. To demonstrate that the method is accurate and useful, we cluster synthetic networks in which the true clusters are known, and show top performance on these tests. Encouraged by these results, we also cluster biological datasets wherein the true communities and multi-community nodes are uncertain. Using SpeakEasy, we can more accurately define the structure of several types of networks in ways that facilitate biological insights. Introduction Molecules, cells and tissues carry out biological processes that support homeostasis through interaction networks [1,2,3] and can fall into disease states when those networks are disrupted [4,5,6,7]. Because the structure of networks is related to the functions they carry out [8,9], it is sometimes possible to use network features to map out specific cellular functions or to outline general principles of biological regulation [3,10,11]. Network “communities” or clusters are one such feature that can frequently be mapped onto specific biological functions [10,12,13,14]. For instance, an abundance of transcriptional, epigenetic or proteins flaws within a single pathway tends to associated with disease states [12,14,15,16]. At the cellular level, certain microcircuits are organized into dense clusters of neurons, [17,18] and selective changes to this topology have been seen in disease models [19]. At the level of tissue interactions, breakdown in the default mode network [20] and changes in certain clusters of regions may be specific to major categories of brain diseases [21]. However, these results do not completely support a modern phrenology of one-to-one cluster-function relationships. Biological pathways and clusters tend to be highly overlapping [22], with modular functions coupled together by multi-community nodes [23]. Understanding the properties of nodes with multiple functional effects can provide
more accurate predictions of molecular changes that may lead to disease [7]. However, traditional clustering methods do not generate overlapping clusters [24], which put intrinsic limitations on our ability to map out the clustered structure of biological networks. Technical noise also blurs the boundaries between clusters. Thus, robustly identifying overlapping clusters should provide a more accurate map of biological functions and facilitate insights from high-throughput datasets. Using labels to find clusters Many clustering algorithms have used diffusive processes, essentially allowing clusters to spread through network links and to self-organize into stable communities. A specific class of diffusion-based clustering methods are termed label propagation algorithms (LPA‟s) [25,26]. Label propagation is based on the concept that labels of known communities should also apply reasonably well to nearby nodes (guilt by association). After several iterations of competition among labels, the clusters are defined based on the label that each node has chosen. Later variations of label propagation known as “speaker-listener” algorithms do not require a pre-assigned number of clusters and generally outperform other overlapping clustering methods on synthetic benchmarks [27]. In a label propagation framework, multi-community nodes appear to oscillate among multiple labels with equally good fits [27]. Due to the viral mechanism of label propagation, there is potential for a single label to take over the majority of the network. This would occur if there are only small energy barriers between communities [28]. This limits the utility of label propagation for clustering, because labels may drift into an equilibrium state that is not practically useful (all nodes contained in a single “cluster”). This situation often arises in clustering: an efficient bottom-up process (label propagation) does not always generate a useful global solution. However, if “top-down” information about topology could be integrated into “bottom-up” clustering processes, it might be possible to create a balanced label propagation that produces useful clusters for a wide range of biological datasets. SpeakEasy: New label propagation algorithm to detect overlapping clusters We propose a label propagation clustering algorithm, “SpeakEasy”, to detect both discrete and overlapping clusters in biological networks. Speakeasy is related to earlier label propagation algorithms [25,26] and to speaker-listener label propagation [27] because nodes choose their community based on exchange of “labels” between connected nodes. These “labels” do not refer to a priori community titles. They are simply unique bits of information whose location is tracked and used in determining cluster membership. SpeakEasy differs from previous algorithms, in that if a node receives inputs carrying a given label with a frequency greater than expected at random, it also begins to broadcast that label. This process combines bottom up and top down approaches. Labels are updated based on labels of neighboring nodes (bottom-up approach), conditioned on the overall frequency of these labels (a “topdown” approach). The overall effect of this viral process implemented in SpeakEasy is that nodes tend to join communities to which they are specifically connected, rather than communities to which they are most commonly connected.
Conceptual overview of SpeakEasy As a visual example of how clusters are selected in SpeakEasy, consider snapshot of a network that is in the process of being clustered (Figure 1). This network could consist of any type of biological component, such as genes, proteins or tissues, while the links between them could be inferred from data or literature sources. In this visual example, baseball team logos are used to represent arbitrary labels that are used to define clusters of nodes. Some labels occur with a high frequency in this network (such as the “Yankee‟s” and comprise large evolving clusters. Establishing membership in a popular community requires a large number of connections, because some level of connectivity to large groups is expected at random. When the node shown in gray updates its label, it must choose among the three label-communities to which it is connected. Taking into account the sizes of the connected communities, it has the strongest specific connection to the „Pittsburgh Pirates‟ cluster (nodes labeled with „P‟). Therefore, the node shown in gray will adopt the Pirate‟s label, even though it has an equal number of links to the Yankee‟s. Some nodes can fit equally well with multiple communities. An example of such a node is located between the „Chicago Cubs‟ and „Red Sox‟ communities. As the algorithm iterates, nodes‟ labels stabilize, except for these multicommunity nodes that tend to oscillate between labels. Labels are ultimately used to define clusters; for instance, all the nodes with the „Red Sox‟ label will comprise a cluster (see Methods for details). Results Summary To demonstrate that SpeakEasy accurately detects clusters in biological networks, we first validate its performance on synthetic networks, wherein the true clusters are known. Additionally, in real-world networks wherein the true clusters are unknown, we show that SpeakEasy produces statistically well-separated clusters. Based on these tests, we can be confident in the quality of clusters detected in several common types of biological networks at multiple physical scales (Table 1). These applications were selected for several reasons: analysis of these datasets often utilizes clustering, these datasets have high levels of noise, these datasets are found across a wide range of physical scales and the true community structure is generally unknown or debated. While the cluster structure of these biological networks is often unknown, when possible, we validate our results through orthogonal literature-based metrics. These results show the benefits of robust overlapping clustering to interpretation of many data types. Synthetic clustering benchmarks To generate networks with known community structure, we use the LancichinettiFortunato-Radicchi (LFR) benchmarks, which are widely employed to test clustering performance of overlapping and non-overlapping clustering methods [29]. These benchmarks track recovery of correct clusters, as clusters become increasingly crosslinked and difficult to detect. To track different aspects of cluster recovery, we show several metrics with varying inputs and sensitivities to track SpeakEasy cluster recovery
(Figure 2A) under increasing levels of cross-linking (parameterized by μ). The effect of cross-linking on network structure can be seen by the decreasing modularity („Q‟), or cluster separation, as μ increases (Figure 2B). We also vary the distribution of cluster sizes and intra-cluster degree distribution ( and parameters, respectively) in a range used by other studies, but the fundamental results are robust (Figure S1). Based on the normalized mutual information (NMI) metric that is most commonly used to track cluster recovery, SpeakEasy shows class-leading cluster recovery [27,30,31,32,33], even for highly cross-linked clusters (μ=0.95) (Figure 2A, S1). Additional cluster recover statistics (Figure 2A) have varying inputs and sensitivity, but also support this strong performance. Thus, according to the most popular clustering benchmark, SpeakEasy can accurately identify disjoint clusters, when these clusters are obscured by cross-linking, which simulates the effect of noise in typical datasets. When networks contain multi-community nodes, whose links are exactly divided between multiple communities, community detection is even more challenging. To test the ability of SpeakEasy to accurately detect overlapping clusters, 10% of nodes are set to have an equal number of links with two or more communities (number of communities parameterized by Om). This form of overlap is distinct from the random betweencommunity connections (parameterized by μ), because this 10% subset of nodes is equally well connected to multiple communities (Figure 3A). High μ-values can generate nodes that have multiple communities, but there is still a tendency for their connections to remain concentrated in a single community. Equidistant-multi-community nodes also make correct cluster detection more challenging, as can be observed by comparing O m=1 (no multi-community nodes) versus Om >1 (Figure 3A). This decrease in cluster recovery with increasing levels of multi-community connectivity is universal across overlapping clustering algorithms [34]. It is due to the challenge of identifying all of the communities between which multicommunity nodes are delicately balanced. Increasing the level of cluster cross-linking (μ) also decreases overall performance, especially for higher O m (Figure 3A). These performance trends for increasing O m are similar for networks with different average connectivity („D’), but it is possible to recover networks at higher μ from more connected networks (Figure 3) in part due to increased precision in identifying multi-community nodes (Figure S2A). Overall performance (NMI) increases by ~77% when the average connectivity (D) increases from 10 to 20, primarily driven by increases in performance for high μ networks. Increasing network size often decreases cluster recovery [34]. In the case of SpeakEasy, lower performance on larger networks might be expected because labels must diffuse of large distances in large networks. However, comparing cluster retrieval on networks of 1000 nodes versus 5000 nodes shows that performance actually increases by ~13% (D=10) and ~15% (D=20) (Figure 3). This increase in performance on larger networks is unusual [34] and encouraging because the real biological networks often include tens of thousands of nodes. The ability to identify clusters
regardless of network size is likely due to the integration of top-down and bottom-up information in the label assignment process [28]. These overlapping cluster detection results could primarily be driven by SpeakEasy‟s excellent disjoint cluster performance. Therefore, we specifically track recovery of multicommunity nodes using a statistic we call F(multi)-score, which is computed identically to the standard F-score, but the inputs (specificity and recall, Figure S2) specifically track if multi-community nodes are correctly assigned to all of their true of communities (Figure 3B). These results indicate that SpeakEasy fulfills the goal of detecting multicommunity nodes and does not purely rely on its strong disjoint cluster detection abilities. Like its predecessor, GANXiS [27], SpeakEasy shows rare upward trend in F(multi)-score as Om increases. This result, specifically on multi-community nodes (Figure 3B), should be considered with the overall lower cluster recovery at higher Om values (Figure 3A). Together, these results indicate that while multi-community nodes increase the difficulty of cluster detection, it is still possible to cluster such networks accurately and to detect the sets of communities associated with multi-community nodes. Abstract clustering performance on real-world networks The LFR benchmarks accurately represent certain aspects of social and biological networks, but there are some aspects that they do not model realistically. For example, networks in the LFR benchmarks have low transitivity and null assortativity (propensity for hubs to connect to hubs) [35]. Non-zero assortativity and other network properties found in biological networks may affect the quality of clustering results; therefore it is important to test clustering performance on real networks. Unlike the LFR benchmarks, the correct cluster membership is often unknown for real networks. Clustering performance on these networks can still be compared between methods by tracking separation between inferred clusters. To obtain quantitative cluster performance measures on biological networks with no ground truth clustering solution, we record the modularity (Q) [36] and modularity density scores (Qds) [37] of clusters predicted by SpeakEasy (Table 2). Modularity can assess how well a clustering method can segment a network into (relatively) isolated clusters, with higher scores being desirable, as they should represent more cleanly divided clusters. We compare performance of SpeakEasy to the clustering method GANXiS, because that method showed the best overlapping clustering performance in a recent comparison of clustering methods [34], using the LFR benchmarks. The ten networks chosen for these comparisons comprise popular modularity tests for clustering methods. Using the class-leading GANXiS method as a benchmark, SpeakEasy shows improved performance on 6/10 networks using the original Q metric, with a median percent difference in performance of ~4% (Table 2). Using the newer and more robust Qds metric, SpeakEasy performs better than GANXiS on 9/10 of the networks with a median percent difference of ~22%. Lower performance on the “Netscience” network is likely due to the large number of nodes which do not belong to any community, as predicted
by SpeakEasy. These results provide a benchmark for SpeakEasy, in terms of its statistical ability to detect clusters in a range of commonly tested networks, some of which originate from non-biological sources. Methods that attempt to directly optimize Q, which may or may not lead to biologically optimal clusters, can generate higher Qvalues on these networks [38]. However, these Q-based tests cannot assess the biological interpretability of clusters or additional SpeakEasy features like cluster stability, incorporation of weighted and negative links, overlapping cluster detection or computational scaling. These more diverse aspects of real networks are tested by comparing clusters detected in real biological dataset to gold-standards or to literaturebased predictions of true clusters. Application to finding human coexpressed genes from microarrays Expression of mRNA transcripts that fluctuate in sync across multiple samples may indicate genes with common biological regulation [39]. The exact mechanisms behind these correlations are unspecified, but may include cell-type variation, activation by transcription factors, epigenetic regulation or even un-normalized batch effects [40]. Detecting sets of coexpressed (correlated) genes is useful because these represent endogenous regulatory programs that may be related to phenotypes of interest such as aging or disease status, recorded for the same samples. Accurately identifying groups of genes is the first step in many investigations that relate coexpressed gene sets to phenotypes. However, this task is difficult because of overlapping regulation and technical noise in microarray datasets. Current methods only produce non-overlapping gene sets, which does not fit the known overlapping regulatory control structure of gene expression [40]. Coexpressed gene sets tend to be involved in certain biological functions; therefore, these gene sets tend to have high functional enrichment scores based on ontology databases such as Gene Ontology (GO) and Biocarta [39]. Because true communities in gene-gene correlations matrices are unknown, ontologies can provide an external validation for the validity of the clusters detected by SpeakEasy. We use SpeakEasy to cluster two representative gene expression datasets: The Human Brain Atlas and the Thousand Genomes Project. The Human Brain Atlas [41] is comprised of 3584 microarrays from different brain regions and subregions of 6 individuals; thus gene coexpression relationships in this dataset are likely generated by fluctuating proportions of cell-types, which vary across the tissue-samples. Another expression dataset from the “1000 Genomes Projects” is derived from lymphoblastoid cell lines from 726 individuals [42]. In this dataset the ongoing intracellular regulation is likely a major cause of gene coexpression [40]. First we use SpeakEasy to generate non-overlapping clusters and subclusters of coexpressed genes, which are enriched for various molecular functions, according to Gene Ontology. Unlike hierarchical clustering, SpeakEasy cannot be forced to output subclusters if they do not exist, because it is always admissible for the algorithm to place all nodes into a single cluster. We find 7 primary clusters in the Human Brain Atlas with 47 subclusters with 30 or more members (covering >95% of all genes), and 6 main clusters in the 1000 Genomes cohort with 42 subclusters with 30 or more
members (covering 99% of genes). These subclusters are practically useful in coexpression analysis because they range in size from dozens to a few hundred nodes: they are large enough for ontology analysis and validation, but small enough to be addressed experimentally. Second, we also generate overlapping clusters from both of these gene expression datasets. Enabling overlapping clusters increases the odds that an important multicommunity gene will not be arbitrarily excluded from all but one community. This inclusive approach is useful in early drug target prioritization, because investigations often focus on specific clusters [14]. We detect overlapping clusters that appear to be composed of transcriptionally coactivated genes, based on Gene Ontology enrichment scores (Tables 3, 4). When multi-community output is enabled, allowing nodes to join as many as 4 communities, we see an 8% increase in the average size of clusters in the Human Brain Atlas and 20% increase in average size of clusters in the 1000 Genomes Project. Experiments are sometimes conducted on the basis of coexpression results and it is helpful to quantify the robustness of each cluster. Because SpeakEasy is a stochastic method, we can quantify the stability of detected clusters in the Human Brain Atlas and 1000 Genomes coexpression networks. (The stability of each cluster from Tables 3 and 4 is recorded in Tables 5 and 6 respectively, see Methods.) This cluster stability quantification is useful in the context of planning additional experiments because it indicates the likelihood that a given cluster can be regenerated by SpeakEasy with random initial conditions (see Methods). This is particularly useful in coexpression analysis, because coexpressed gene sets that do not have strong enrichment in a known biological function are typically discarded, on suspicion they are clustering artifacts. High levels of cluster stability can provide increased support for investigation into novel molecular systems that are not well-annotated in biological databases. Conversely, clusters that are not stable arise due to outlying data points and should be approached with caution when planning experiments. Application to protein-protein interaction datasets Multi-community nodes are relevant to protein complexes because they enable a more nuanced interpretation of conformational change due to sequence mutation. Traditionally mutations in specific proteins were treated as though they affect the complete set of interactions [43]. However, some mutations may have a limited effect on protein conformation. For proteins with several binding partners, some interactions may be specifically affected by these mutations, while other binding interfaces are less affected (termed “edgetic” interactions, because the changes are edge-specific) [7]. We compare two popular high throughput protein interaction networks derived from affinity purification and mass spectrometry (AP-MS) techniques to facilitate comparison to other clustering methods [44,45]. We also used the weighted versions of these datasets, in which the weight is related to the probability a given interaction pair truly exists. There are multiple versions of the gold standard set of protein complexes, inferred by different methods at different times, generally through small-scale
experiments. To robustly evaluate the capability of SpeakEasy to detect protein complexes from high throughput data, we test results against three gold-standards for protein complexes, including the classic Munich Information Center for Protein Sequences (MIPS) [46] and the more recent Saccharomyces Genome Database (SGD) [47] (Figure 4). The complete MIPS dataset as well as partial information from SGD are incorporated into a third protein complex list known as CYC2008 [48]. To validate the overlapping protein complexes from SpeakEasy, we compare clusters in the protein interaction networks to multiple gold-standard protein complex lists. A visual example of the input data for comparisons between inferred clusters and gold-standards is shown in Figure 4. NMI between predicted and true complexes for both Gavin et al. and Collins et al. datasets indicates that SpeakEasy produces the most accurate recovery of protein complexes to date [33,49] (Table 7). We also use specificity, precision and the F score (also known as F1) to compare performance of our algorithm to other clustering methods using multiple interaction lists (Gavin and Collins) and goldstandards for protein complexes. Table 7 show demonstrates class-leading protein complex recovery by SpeakEasy across all datasets and definitions of the gold-standard protein complexes using disjoint clusters [32]. This indicates that the synthetic performance advantages of SpeakEasy (Figures 2, 3) translate to protein interaction networks and that the technique may be useful for detecting complexes in future high throughput datasets.
In addition to the strong disjoint cluster performance (Table 7), we are interested in detection of multi-community nodes. SpeakEasy identifies a smaller number of multicommunity nodes than are listed in various gold-standards, although the multicommunity nodes it does detect are often in agreement with the gold-standards (see F(multi)-scores in Table 7). High levels of noise in AP-MS-derived interactions may place hard limits on the prediction of multi-community nodes: their topology may not be accurate enough (in some cases) to provide any evidence in support of truly multicommunity nodes. The lack of evidence (links) necessary for correct detection of true multi-community nodes is visually obvious in many cases (example shown in Figure 4, inset). Therefore, there may be some limitations on using protein complexes as a validation set for multi-community node detection, and performance on synthetic datasets, such as the LFR benchmarks, are important to consider as well. The almost perfect specificity of multi-community node detection with SpeakEasy (Table 7) suggests that high-throughput networks are primarily missing true links, rather than recording incorrect links. Application to cell-type clustering Biological data with multiple hierarchical tiers of information poses a challenge to clustering methods, because both the high-level and low-level classifications could be considered as desirable clustering results. Hierarchical clustering techniques necessarily specify subclusters down to the level of pairs of nodes. Therefore, while hierarchical techniques generate multiple levels of clusters, it is unclear where to “cut” the hierarchical “tree” of clusters in order to extract optimum clusters and/or subclusters.
Subclustering with SpeakEasy is different from hierarchical clustering because subclusters are not generated in all cases, but only when they are strongly supported by the structure of the network. In cases where the datasets are suspected to contain multiple layers of classifications, SpeakEasy clustering can be applied iteratively to create clusters within main clusters. Using a collection of well-characterized immune cell families and sorted cell populations from the Immunologic Genome Project (Immgen) [50,51], we show that nested biological classifications are mirrored by nested clusters in SpeakEasy. The immune system is composed of many populations of cells that can be distinguished by specific combinations of cell surface markers as well as broader functional families, such as dendritic cells, macrophages and natural killer cells. Ignoring the cell-type labels and using only a cell-cell similarity matrix of 212 populations, SpeakEasy identifies clusters of immune cell-types, which correspond to the primary classification of the sorted cells (Figure 5, Table 8). More detailed classifications are also available for these same sorted populations, based on cell-surface markers and tissue of origin [51]. This more detailed classification showed higher correspondence to subclusters identified by a second application of SpeakEasy, as verified by multiple measure of partition similarity. For instance the 2nd tier of clusters has 85% higher NMI with the 2 nd tier of biological classifications than does the 1st tier of clusters (Table 8). Thus, SpeakEasy clustering results reflect the known two-tier hierarchy of immune cell-types in the Immgen database (Figure 5). While SpeakEasy is not a hierarchical classification method, its clusters and subclusters have well-defined biological roles. These results indicate that SpeakEasy may be useful in identifying groups of cells with similar expression characteristics and surface markers. For example, this capability would likely be useful for cell typing in single-cell RNAseq experiments in the brain, because expressionbased categories of neurons are quite limited (true communities are unknown) and because some cells may be at intermediate differentiation states (multi-community nodes). Application to neuronal spike sorting Extracellular neuronal recording with single electrodes, tetrodes, or high density multichannel electrode arrays can be used to detect the activity of multiple nearby neurons. To understand the response properties of individual neurons, it is essential to link specific spike waveforms with specific neurons. This blind source separation process is known as “spike sorting” because each spike is assigned to a particular theorized neuron. Single neurons often generate relatively unique signatures (i.e. spike waveform shapes and amplitude distributions on multiple adjacent electrodes), which aids in the spike sorting process. Clustering is useful in assigning these waveforms to particular neurons, whereby sets of similar waveforms infer the existence of particular neurons. To identify waveforms that are likely to originate from the same unique neuron (stereotypical waveforms), putative action potential (spike) waveforms are first identified
in the neuronal recordings. These waveforms can be combined in a spike similarity matrix (input to SpeakEasy) for clustering. The identified stereotypical waveforms (derived from output clusters) can be then used for real-time or post-hoc template matching that assigns waveforms to specific neurons. Using actual depth-electrode recordings in order to match levels of noise in real brain recordings, we generate a simulated time-series of spikes in which the true spike times and origin is known (see Supplemental Methods). Comparing the inferred clusters that result from the activity of a single neuron to the true associations between spikes and neurons indicates that SpeakEasy can reliably sort spikes (Table 9) from multielectrode recordings. NMI levels from the synthetic clusters (average = 0.69) also provide a comparable reference point for spike sorting of actual recordings where there is no ground truth to estimate accuracy.
Application to resting-state fMRI data Functional neuroimaging, obtained while a subject is at rest (rs-fMRI), has been an invaluable tool in our understanding of systems-level changes in a variety of domains including neurodegenerative disease [52]. Correlations between the rs-fMRI signal in different regions of interest (ROIs) are related to the functional connectivity between these ROIs. However, different brain networks overlap, either because ROIs perform functions for multiple networks or because the low temporal resolution of the blood oxygen level-dependent signal causes temporal smearing of brain networks. The ability to robustly identify functional networks (communities) and changes to this community structure that occur with disease is critical to understanding the physiological changes that may be early indicators of disrupted cognitive function. Figure 6A shows the relatively small inter-regional correlations characteristic of rs-fMRI functional connectivity graphs in Parkinson disease (PD) and controls. The robustness of clusters can be tracked through co-occurrence matrices, which quantify the number of times nodes appear in the same cluster. For instance, the community of predominantly frontal/cingulate components is very stable, while the cluster consisting of mainly temporal components is less stable (Figure 6B). The resting-state community organization of the brain changes significantly in PD. Using clusters from control participants as a frame of reference, we observe both significant changes in community size and inter-community connectivity (see Methods). A cluster comprised of temporal ROIs significantly decreased in co-occurrence in PD (p