A Mutually Supervised Ensemble Approach for ...

3 downloads 0 Views 364KB Size Report
1Department of Computer Science, Fairmont State University, WV, USA. 2Department of Computer Science and Engineering, Mississippi State University, MS, ...
A Mutually Supervised Ensemble Approach for Clustering Heterogeneous Datasets M. Hossain1 , S. Bridges2 , Y. Wang2 , and J. Hodges2 1 Department of Computer Science, Fairmont State University, WV, USA 2 Department of Computer Science and Engineering, Mississippi State University, MS, USA Abstract— We present an algorithm to address the problem of clustering two contextually related heterogeneous datasets that use different feature sets, but consist of non-disjoint sets of objects. The method is based on clustering the datasets individually and then combining the resulting clusters. The algorithm iteratively refines the two sets of clusters using a mutually supervised approach to maximize their mutual entropy and finally computes a single set of clusters. We applied our algorithm on a document collection using multiple feature sets that were extracted by natural language preprocessing methods. Empirical results demonstrate that our method outperforms clustering based on individual feature sets, clustering based on unified feature sets, and clustering based on a well-studied ensemble method. Keywords: Clustering, Ensemble, Mutual Entropy

1. Introduction The traditional clustering paradigm pertains to a single dataset. In some problem domains, multiple contextually related heterogeneous datasets may be available that can provide complementary information. Clustering these datasets may reveal interesting relationships between objects that cannot be obtained by clustering a single dataset. The increasing availability of this type of heterogeneous data in different problem domains, e.g., information retrieval, molecular biology, etc. has motivated research into developing algorithms for clustering contextually related heterogeneous datasets [1], [2], [3], [4]. Heterogeneous datasets can be clustered by constructing a unified feature space [1], [2] or by clustering the datasets individually and then combining the resulting cluster sets [3], [4]. In some domains, several feature sets may be available to represent the same set of objects, but it may not be easy to compute a useful and effective unified feature space because of structural and semantic heterogeneity between the feature sets. We were motivated to pursue an ensemble based approach for clustering heterogeneous datasets. Ensemble methods [5], [6], [7], [8] have been developed that typically focus on combining different clustering results from a single data set. These algorithms do not work well on clustering results generated from different datasets. In [9], an iterative clustering method was proposed that uses two independent disjoint subsets of the original feature set, but

clustering is performed on the same set of objects. In [10], disjoint subsets of objects are clustered independently and the resulting sets of clusters are combined by matching the cluster centroids, but all the objects are represented by the same features. In [4], a correspondence approach was used to extract the correspondence between the different sets of clusters, but a final set of clusters is not produced. In this paper, we address the problem of clustering datasets that are represented by heterogeneous feature sets and that may not necessarily contain the same set of objects. Even though clustering is traditionally perceived to be an unsupervised task, in semi-supervised clustering [11], the performance of an unsupervised clustering algorithm can be improved with some supervision in the form of some labeled data or constraints. In the context of clustering two heterogeneous datasets, once individual datasets are clustered, each set of clusters can be used to supervise the refinement of the other set of clusters in order to provide a combined and improved clustering of the two datasets. In this paper, we present a mutually supervised clustering algorithm for clustering two heterogeneous datasets. It starts with a set of clusters generated from each dataset, then iteratively refines each set of clusters using the other, and finally generates a single set of clusters. The remainder of the paper is organized as follows. In section 2, we present a formal description of our clustering problem. In section 3, we present our algorithm. In section 4, we describe our datasets and present experimental results. Finally, in section 5, we provide concluding remarks.

2. Mutually Supervised Ensemble Clustering Before giving a formal description of the problem of clustering heterogeneous datasets, we first give an intuitive explanation of our clustering approach. We are interested in extracting a single set of partitional clusters from two related heterogeneous datasets. Our method will use a mutually supervised approach to iteratively recompute each set of clusters computed from the individual datasets. The purpose is to allow each individual clustering to refine the other clustering so that at the end of each iteration, the similarity between the two sets of clusters increases. We view this problem as maximizing the mutual entropy [12] between the individual clusterings. Mutual entropy can be used to represent the information content shared by two random

variables. In our previous work [13], we presented an ensemble algorithm that computes a set of seed clusters and then recomputes each set of clusters around the seed clusters iteratively so that the mutual entropy between the two sets of clusters converges. Even though this algorithm was shown to provide an effective solution for clustering heterogeneous datasets, a more effective approach can be to use each set of clusters to iteratively refine the other. 2.1 Problem Statement Let D1 and D2 be two datasets having feature sets F1 and F2 respectively. Heterogeneity between two datasets can occur at the feature level, at the object level, or both. The datasets may have same features, some common features, or no common feature. The same situation can occur with the objects. Ensemble algorithms of [5], [6], [7], [8] primarily deal with the situation where F1 = F2 and D1 = D2 . In this paper, we deal with a situation where ∅ ⊆ (F1∩F2 ) ⊆ (F1∪F2 ) and (D1 ∩ D2 ) ⊃ ∅, i.e., the datasets have disjoint or nondisjoint feature sets and consist of non-disjoint object sets. (1) (1) (1) Definition 1: Let D1 = {o1 , o2 , . . . , on1 } and D2 = (2) (2) (2) {o1 , o2 , . . . , on2 }, where nx is the number of objects in Dx . D1 and D2 are said to be congruent if they represent the same set of objects and semi-congruent if some of the objects they represent (not all) are contained in both. Definition 2: Let D = D1 ∪ D2 = {o1 , o2 , . . . , on }, where max(n1 , n2 ) ≤ n ≤ (n1 + n2 ). We define the object (x) (x) (x) representation vector of Dx as Vx = {v1 , v2 , . . . , vn }, (x) (x) where vi is one if oi ∈ Dx and vi is zero if oi ∈ / Dx . (x) (x) (x) Definition 3: Let Cx = {C1 , C2 , . . . , Ck } be the set of clusters computed from Dx . Given D = {o1 , o2 , . . . , on }, (x) (x) the cluster assignment vector of Cj is defined as Aj = (x) (x) (x) (x) (x) {aj1 , aj2 , . . . , ajn }, where aji is one if oi ∈ Cj and (x) (x) (x) aji is zero if oi ∈ / Cj . A cluster Cj is assumed to be (x) represented by a cluster centroid µj , computed using the feature space of Dx . Given two datasets D1 and D2 where (D1 ∩ D2 ) ⊃ ∅, two sets of clusters C1 and C2 computed from D1 and D2 respectively, the goal is to compute a set of k disjoint clusters C = {C1 ,C2 ,. . . ,Ck } so that ∀oi ∈ (D1 ∪D2 ) ∃Cj ∋ oi . 2.2 Mutual Entropy The notion of mutual entropy is based on Shannon’s information theory and quantifies the amount of information two random variables share. If two random variables are independent, their mutual entropy is zero, i.e., none contains any information about the other. If they are identical, then all the information contained in one is shared by the other. The mutual entropy of two random variables X and Y can be defined using their individual and joint entropies [12]: ∑ i,j

P (X = xi , Y = yj ) log

P (X = xi , Y = yj ) P (X = xi )P (Y = yj )

In our case, each random variable is a given clustering. P (X = xi ) is estimated as the fraction of the objects that are in a cluster in one of the clusterings and P (X = xi , Y = yj ) as the fraction of the objects that are common in two clusters from two different clusterings. (x) Definition 4: Let mi be the number of objects in the ith (x) (x) (xy) cluster of Cx , i.e., mi = |Ci |. Let mij be the number th of objects shared by the i cluster of Cx and j th cluster of (xy) (x) (y) Cy , i.e., mij = |Ci ∩ Cj |. The mutual entropy between two sets of clusters Cx and Cy is then defined as: γxy

=

(xy) ∑ m(xy) mij /n ij )( ) log ( (x) (y) n mi /n mj /n i,j

=

(xy) mij n 1 ∑ (xy) mij log (x) (y) n i,j mi mj

The motivation for considering the mutual entropy stems from its ability to measure a general dependence among random variables. In our case, when the mutual entropy between two sets of clusters is large, it means they are more similar. The goal of our algorithm is to maximize this mutual entropy based on complementary information contained in the two datasets, in order to obtain more meaningful clustering than obtained by either dataset individually. 2.3 Seed Clusters Seed clusters play an important role in the final step of our clustering approach. Once the mutual entropy between two sets of clusters reaches a maximum value, we use the seed clusters to compute the final set of clusters. Definition 5: Given two datasets D1 and D2 , a seed cluster Sj is defined as a set of objects that are in both D1 and D2 , i.e., Sj ⊂ (D1 ∩ D2 ). An object is called a seed object if it belongs to a seed cluster and is called a non-seed object if it does not belong to any seed cluster.

3. Algorithm Overview In this section, we present an overview of our clustering algorithm. For convenience, we will refer to it as CLUSTH . The algorithm is based on computing a set of clusters from each dataset and then using each to refine the other iteratively. When the mutual entropy between the two sets of clusters is maximized, a conglomeration step is carried out to generate a single set of clusters. 3.1 Iterative Refinement Once the individual clustering has been performed on each dataset, the algorithm progresses iteratively through some refinement steps. In these steps, the algorithm will refine the first set of clusters using the second set of clusters and vice-versa. At the beginning of each refinement step, a set of temporary clusters is created for each clustering. Each set of temporary clusters will be initially empty and will correspond to the clusters from the other clustering.

When the first set of clusters C1 is refined, a set of (1) (1) (1) temporary clusters Cˆ1 = {Cˆ1 , Cˆ2 , . . . , Cˆk } will be created, each of which will be initially empty. Note that, each (1) (2) Cˆj will correspond to Cj and will have a precomputed (1) (1) centroid µ ˆj associated with it. Each µ ˆj will be computed using the objects that are in both datasets, i.e., the objects of (2) Cj that are also in the first dataset D1 . But, the centroid computation will be based on the feature space of D1 only. After the centroids have been computed for Cˆ1 , the objects that are exclusively in the second dataset will be added to the (1) respective clusters, i.e., each Cˆj will consist of objects of (2) Cj that are only in D2 . Then, each object that appears in D1 will be added to the closest temporary cluster. At the end of this step, each temporary cluster will consist of objects from both datasets and the set of temporary clusters Cˆ1 will replace the existing clusters in the first set of clusters C1 . The same process will be carried out for the second set of clusters, i.e., the clusters in the second set of clusters C2 will be recomputed using the first set of clusters C1 . This is a mutually supervised process in the sense that one set of clusters guides the computation of the other set of clusters. At the end of each refinement step, the mutual entropy between the two sets of clusters will be computed. These iterations will be repeated until the mutual entropy between the two sets of clusters converges. We will use the average mutual entropy in the most recent iterations to determine convergence. Let us explain the refinement step with the two sets of clusters shown in Figs. 1(a) and 1(b). There are two datasets, D1 = {a, b, c, d, e, f, g, h, i} and D2 = {d, e, f, g, h, i, j, k, l}. The first set of clusters C1 contains the clusters {a, d}, {b, c, e, f }, and {g, h, i}. The second set of clusters C2 contains the clusters {f, g, j}, {e, h, i, l}, and {d, k}. The process starts by creating Cˆ1 , a set of three temporary clusters that are initially empty. The centroids will be pre-computed for these clusters using {f, g}, {e, h, i}, and {d} respectively, since each of these sets of objects is contained in the corresponding cluster of C2 and belong to both D1 and D2 . Note that the centroid computation will be done based on the feature space of D1 only. Then, the temporary clusters will be populated to contain {j}, {l}, and {k} respectively as shown in Fig. 1(c), since these are the objects that exclusively belong to D2 . Then, each object of D1 will be assigned to one of the temporary clusters of Cˆ1 based on the pre-computed cluster centroids. Let us assume that this reassignment results in the clustering shown if Fig. 1(d). This is essentially the refinement of C1 using C2 . The same process, as explained above, will be carried out to refine C2 using the pre-refinement state of C1 . These refinement steps will be repeated until the mutual entropy converges and the conglomeration step will be carried out as explained next.

ad

C1(1)

e h i

C3(1)

b

d c f

g

C

(1) 2

(a) First set of clusters C1

k

Cˆ 3(1)

l

C3( 2)

h i

j ˆ (1) C 1

(c) Temporary clusters Cˆ1

C 2( 2 )

e f j g C1( 2 )

(b) Second set of clusters C2

ad k

l

Cˆ 2(1)

k

C3(1)

l

e

b

c f h i g j C (1) (1) C2 1

(d) C1 after refinement

Fig. 1: Refinement of one set of clusters using another.

3.2 Conglomeration Step The conglomeration step is based on the computation of seed clusters. To generate seed clusters, we will identify k pairs of similar clusters across the two clusterings. One straightforward approach for computing the similarity between two clusters is to count the number of common objects shared between the clusters. However, this may cause a tie between two pairs of clusters and will often cause large diverse clusters to be selected over smaller purer clusters. Instead, we will compute the similarity (1) (2) between Ci and Cj using the information theoretic (12) (1) (2) measure 2mij /(mi + mj ). This measure will give us a continuous value between 0 and 1. If there is no common (1) (2) object between Ci and Cj , it will evaluate to 0, and if the clusters are exactly the same, it will evaluate to 1. We need to compute the similarity between all pairs of clusters across the two clusterings. We considered two methods for pairing similar clusters across the two clusterings. One approach is to find the “most similar” pairs of clusters. However, this is a combinatorial optimization problem in itself, and can be computationally intensive if there are many clusters. Instead, we adopted the second approach where a greedy algorithm is used for selecting similar cluster pairs. In this approach, for each of the k clusters from one clustering, we select the most similar cluster from the other clustering that has not yet been paired with a cluster from the first set of clusters. Once the k pairs of similar clusters are identified, the seed clusters will be computed from the intersection of each pair of clusters. Let us explain the seed cluster computation for the two (1) sets of clusters shown in Figs. 1(a) and 1(b). For C1 , the first cluster in Fig. 1(a), the most similar one in Fig. 1(b) is (2) C3 . These two clusters will be used to generate the first (1) (2) (1) set of seed cluster, i.e., S1 = C1 ∩ C3 = {d}. For C2 , the second cluster in Fig. 1(a), the most similar one in Fig.

(2)

1(b) is C1 . These two clusters will be used to generate the (1) (2) second seed cluster, i.e., S2 = C2 ∩ C1 = {f }. Finally, (1) (2) (1) (2) the only match for C3 is C2 and S3 = C3 ∩ C2 = {h, i}. The seed clusters are shown in Fig. 2.

Algorithm: CLU STH (D1 , D2 , C1 , C2 ) input : Dataset D1 and the set of clusters C1 Dataset D2 and the set of clusters C2 output: A single set of clusters C = {C1 ,C2 ,. . . ,Ck } //Refinement Step repeat for x ← 1 to 2 do for j ← 1 { to k do } ˆ (x) ← oi : a(x%2+1) = v (x) = v (x%2+1) = 1 C j ji i i

d S1 h i S3

f

(x)

S2

Fig. 2: Seed clusters computed from the two sets of clusters shown in Figs. 1(a) and 1(b). Fig. 3 shows the algorithm for generating seed clusters. Once the seed clusters are computed, two centroids will be computed for each seed cluster, each based on one of the datasets. There will still be objects in both the datasets that will not be in any of the seed clusters. We will assign each of these non-seed objects to the closest seed cluster. If the non-seed object is contained in both the datasets, then seed cluster centroids computed from both datasets will be used to find the closest seed cluster. If the non-seed object is contained in one dataset but not the other, then seed cluster centroids computed from the respective dataset will be used to find the closest seed cluster. Algorithm: F indSeedClusters(C1 , C2 )

compute µ ˆj for x ← 1 to 2 do for j ← 1 { to k do } ˆ (x) ← oi : a(x%2+1) = v (x%2+1) = 1, v (x) = 0 C i i ji j for x ← 1 to 2 do foreach oi ∈ Dx do( ) (x) j ← argmaxq sim oi , µ ˆq ˆx ← C ˆ x ∪ {oi } C j

j

for x ← 1 to 2 do for j ← 1 to k do (x) ˆ (x) Cj ← C j until γxy converges

//Conglomeration Step S ← F indSeedClusters(C1 , C2 ) for j ← 1 to k do (1) (2) Cj ← Cj ← Cj ← Sj (1)

// S = {S1 , S2 , . . . , Sk }

(2)

compute µj and µj foreach oi ∈ / {S1 ∪S2(∪. . .∪Sk(} do ) ( )) (1) (1) (2) (2) j ← argmaxq max vi sim oi , µq , vi sim oi , µq Cj ← Cj ∪ {oi }

Fig. 4: The CLUSTH algorithm.

input : Two sets of clusters C1 and C2 output: A set of seed clusters {S1 ,S2 ,. . . ,Sk } for j ← 1 to k do pairedj ← 0 for i ← 1 to k do (12) (1) (2) q ← argmaxj (1 − pairedj ) · 2mij /(mi + mj ) pairedq ← 1 (1) (2) Si ← Ci ∩ Cq return {S1 , S2 , . . . , Sk }

Fig. 3: Algorithm for generating seed clusters.

3.3 The CLUSTH Algorithm Fig. 4 describes the CLUSTH algorithm. The inputs to the algorithm are two datasets D1 and D2 where (D1 ∩D2 ) ⊃ ∅, (1) (1) (1) and two sets of clusters C1 = {C1 , C2 , . . . , Ck } and (2) (2) (2) C2 = {C1 , C2 , . . . , Ck } computed from D1 and D2 respectively. The output of the algorithm is a single set of partitional clusters, C = {C1 , C2 , . . . , Ck }. CLUSTH uses two sets of temporary clusters Cˆ1 and Cˆ2 . For describing the algorithm, we useˆto represent the parameters corresponding to the temporary clusters. The similarity between the ith (x) object oi and the centroid of cluster Cq is represented by (x) sim(oi , µq ).

3.4 Complexity of CLUSTH In the refinement step, CLUSTH needs to scan all the objects for each cluster. If n is the number of total objects from both the datasets, k is the number of clusters, and t is the number of iterations required for convergence, then the overall complexity of these steps is O(2nkt). The conglomeration step computes the seed clusters. There are k 2 pairs of clusters and the computation of the similarity between each pair of clusters requires O((n/k)2 ) operations. Hence, the computation of seed clusters requires O(n2 ) operations. Finally, assigning each non-seed object to a cluster requires O(nk) operations. Thus the overall complexity is O(2nkt + n2 + nk). In general, n ≫ k and n ≫ t, and the complexity can be formalized as O(n2 ).

4. Algorithm Evaluation We evaluated the effectiveness of CLUSTH in two situations: (1) two datasets are clustered that represent the same sets of objects - we call these congruent datasets, and (2) two datasets are clustered that share some but not all objects - we call these semi-congruent datasets.

4.1 Datasets To evaluate the effectiveness of our approach, we applied CLUSTH and several baseline methods to the task of clustering a document collection consisting of 10,000 journal abstracts from ten different Library of Congress categories. The collection was divided into five non-overlapping subsets where each subset contained 2,000 abstracts (200 abstracts from each category). Different NLP methods were used to preprocess each subset in order to capture different aspects of the documents in different feature spaces. We generated four feature vectors from each abstract; one using syntactic preprocessing and three using semantic preprocessing. The feature vectors extracted using each preprocessing method were used to build a dataset for each of the five document subsets. This resulted in a total of 20 data subsets each containing 2,000 documents. The syntactic preprocessing was based on constructing a “bag of keywords” from each abstract using the Collins parser1 and the semantic preprocessing was based on constructing a “bag of senses” from each abstract using the WordNet semantic network2 and the WordNet::Similarity package3 . We used three semantic preprocessing methods: node-based, edge-based, and combined node-and-edgebased. We used the cosine coefficient method to calculate the pairwise similarity between feature vectors. A detailed discussion of the data preprocessing can be found in [14]. 4.2 Experimental Design To test the effectiveness of CLUSTH , we used several baselines. Ten clusters were generated during each clustering because the datasets were known to have ten categories. The first set of baselines was the partitional clusterings based on individual feature sets. For these baselines, we performed graph-based partitional clustering using METIS4 . A similarity matrix was computed from each set of syntactic feature vectors and from each set of semantic feature vectors. The similarity matrix generated from a dataset was converted into an adjacency matrix for a graph before applying the graph-partitioning. Each object was treated as a vertex and each similarity value between a pair of objects was treated as the weight of the edge between the two corresponding vertices. Another baseline was partitional clustering based on an unified feature set obtained from the two respective feature sets and we used the same graph-based clustering approach. We also compared the performance of CLUSTH against the well-known ensemble algorithm CSPA [5]. We also wanted to observe how CLUSTH performs with semi-congruent datasets. To generate data for these experiments, we randomly excluded 50% of the objects from each of the 20 datasets with the constraint that each 1 http://www.cs.columbia.edu/

mcollins/code.html

2 http://wordnet.princeton.edu 3 http://search.cpan.org/dist/WordNet-Similarity 4 http://www-users.cs.umn.edu/

karypis/metis/metis/index.html

category of documents remained equal sized. On average, this resulted in an overlap of approximately 50% between two datasets constructed from the same document subset using different feature sets, i.e., each pair of combined datasets had approximately 1,500 objects and each individual dataset had exactly 1,000 objects. When constructing an unified feature set using two semicongruent datasets, we had to deal with missing values for objects that appear only in one of the datasets. There are two general methods for handling missing feature values: imputation and marginalization [15]. Imputation requires inferring values from neighbors or averaging and marginalization requires missing values to be ignored. In our document clustering domain, the feature space is extremely sparse and therefore the imputation method is not applicable. We decided to employ marginalization with pairwise deletion when dealing with the unified feature sets, i.e., when computing the similarity of two objects, only the features with observed values in both objects were used. 4.3 Evaluation Results We used F-Measure [16] to compare cluster qualities. The F -measure quantifies how the clustering fits the actual classification of data where the most desirable value is 1 and the least desirable value is 0. Fig. 5(a) shows the graphical representation of the F -measure values for CLUSTH and baseline clustering schemes with congruent datasets. For comparison with the clustering based on individual feature sets, we only show the individual clustering yielding the higher F -measure value (Cind ). For each feature set or a combination of feature sets, five data subsets were used. Each graph shows the F -measure values for a particular combination of two heterogeneous feature sets and six different combinations were used. For example, the first graph shows the F -measure values when syntactically preprocessed and semantically (node-based) preprocessed feature sets were used for clustering. Each curve represents the F -measure for one of the clustering schemes. It is evident that CLUSTH produces high-quality clusters (F-measure ≈ 0.9) and outperforms the baselines in all instances. If we compare the performance of the clustering performed on the unified feature sets with clustering based on individual feature sets, we can see that the unified feature sets do not always produce better quality clusters. This demonstrates that feature set integration does not necessarily take advantage of the mutual information in the individual feature sets to yield improved clustering. It is evident that CLUSTH consistently yields higher quality clusters than clustering based on individual feature sets and unified feature sets. CLUSTH also outperforms the clustering based on the ensemble method CSPA. Note that CSPA was tested using identical experimental setup as CLUSTH , i.e., clustering was performed on individual feature sets and then the resulting individual clusterings were combined.

Fig. 5(b) presents the comparison of F -measure values for CLUSTH and baseline clustering schemes when we used semi-congruent datasets by randomly removing 50% of the objects from each data subset. As before, we obtained higher quality clusters with CLUSTH compared to the baseline schemes. It is interesting to note that clustering with unified feature sets always yields lower quality clusters with semi-congruent datasets when compared to other clustering approaches. The results presented in Figs. 5(a) and 5(b) are summarized graphically in Figs. 6(a) and 6(b) respectively. The results shown are averages over all thirty observations along with the standard deviation. In addition to higher average F -measure values for CLUSTH , lower standard deviations of F -measure values are also observed for CLUSTH . Note that for congruent datasets, neither clustering with unified features sets nor CSPA produces higher quality clusters compared to clustering on individual feature sets. When the datasets only share 50% of the same objects (semi-congruent), CSPA and CLUSTH clearly outperform clustering based on individual feature sets and clustering based on unified feature sets with CLUSTH outperforming CSPA. To test the statistical significance of our results, we performed one-tailed paired T-tests. Table 1 shows the corresponding results. We used an α value of .05 where Tcritical = 1.699 for one-tail tests. The hypotheses tested are CLUSTH ≻ Cind , CLUSTH ≻ Cuni , and CLUSTH ≻ CSPA, where Cx ≻ Cy means that clusters generated by Cx are of significantly higher quality than clusters generated by Cy . For each hypothesis, the table shows T-value (Tstat ) and pvalue (PT ≤t ). A T-value of 1.699 or higher and a p-value of 0.05 or lower indicate evidence that the hypothesis is true. The T-test results demonstrate that CLUSTH is significantly better than the baseline clustering schemes.

Hypothesis CLUSTH ≻ Cind CLUSTH ≻ Cuni CLUSTH ≻ CSPA

Congruent Datasets Tstat PT ≤t 41.09

Suggest Documents