Two-Stage SVM Classification for Large Data Sets via ... - IEEE Xplore

11 downloads 9579 Views 320KB Size Report
for large data sets by randomly selecting training data. The first stage SVM classification gets a sketch of support vector distribution. Then the neighbors of these ...
Two-Stage SVM Classification for Large Data Sets via Randomly Reducing and Recovering Training Data Xiaoou Li, Jair Cervantes, and Wen Yu Abstract— Despite of good theoretic foundations and high classification accuracy of support vector machine (SVM), normal SVM is not suitable for classification of large data sets, because the training complexity of SVM is very high. This paper presents a novel two stages SVM classification approach for large data sets by randomly selecting training data. The first stage SVM classification gets a sketch of support vector distribution. Then the neighbors of these support vectors in original data set are used as training data for the second stage SVM classification. Experimental results demonstrate that our approach have good classification accuracy while the training is significantly faster than other SVM classifiers.

I. INTRODUCTION There are a number of standard classification techniques in literature, such as simple rule based and nearest neighbor classifiers, Bayesian classifiers, artificial neural networks, decision tree, support vector machine (SVM), ensemble methods, etc.. Among these techniques, support vector machine (SVM) is one of the best-known technique for its optimization solution [8]. Recently, some new SVM classifications are proposed. A geometric approach to SVM classification was given by [17]. Fuzzy neural network SVM classifier was studied by [16]. Despite of its good theoretic foundations and generalization performance, SVM is not suitable for classification of large data sets since SVM needs to solve the quadratic programming problem (QP) in order to find a separation hyper-plane, which causes an intensive computational complexity. Many researchers have tried to find possible methods to apply SVM classification for large data set. Generally, these methods can be divided into two types: 1) modify SVM so that it could be applied on large data sets, and 2) reduce a large data set to some smaller sets so that the normal SVM could be applied. For the first type, a standard projected conjugate gradient (PCG) chunking algorithm can scale somewhere between linear and cubic in the training set size [7] [13]. Sequential Minimal Optimization (SMO) is a fast method to train SVM [19][6]. Training SVM requires the solution of QP optimization problem. SMO breaks this large QP problem into a series of smallest possible QP problems, it is faster than The work is supported by the Queen’s University Belfast International Exchange Scheme D7143EEC (United Kindom) and CONACyT Project 46729-Y (Mexico). Xiaoou Li and Jair Cervantes are withDepartment of Computer Science, Research Centre for Advanced Studies of National Polytechnic Institute (CINVESTAV-IPN), Av. IPN 2508, Col. Zacatenco, 07360, Mexico City, Mexico. [email protected] Wen Yu is with Department of Automatic Control, Research Centre for Advanced Studies of National Polytechnic Institute (CINVESTAVIPN), Av. IPN 2508, Col. Zacatenco, 07360, Mexico City, Mexico

[email protected]

1-4244-0991-8/07/$25.00/©2007 IEEE

PCG chunking. [9] introduced a parallel optimization step where block diagonal matrices are used to approximate the original kernel matrix so that SVM classification can be split into hundreds of subproblems. A recursive and computational superior mechanism referred as adaptive recursive partitioning was proposed in [14], where the data is recursively subdivided into smaller subsets. Genetic programming is able to deal with large data sets that do not fit in main memory [10]. Neural networks technique can also be applied for SVM to simplify the training process [12]. For the second type, clustering has been proved to be an effective method to reduce data set size. For examples, hierarchical clustering [24] [1], k-means cluster [3] and parallel clustering [6]. Clustering based methods can reduce the computations burden of SVM, but they are very complex for large data set. Random selection is to select data in such way that the learning is maximized by the data. However, it could over-simplify the training data set and lose the benefit of SVM. Rocchio bundling is a statistics-based data reduction method [21]. Another approach is to apply the Bayesian committee machine to train SVM on large data sets [22] where the data is divided into m subsets of the same size, and m models are derived from the individual sets. But it has higher error rate then normal SVM and the sparse property does not hold. The goal of clustering is to separate a finite unlabeled items into a finite and discrete set of “natural” hidden data structures, such that items in the same cluster are more similar to each other and those in different clusters tend to be dissimilar, according to certain measure of similarity or proximity. A large number of clustering methods have been developed, e.g., squared error-based k-means [2], fuzzy Cmeans [18], kernel-base clustering [11]. For these clustering, the optimal number of clusters should be predefined which involves more computational cost than clustering itself [23]. However, clustering algorithms are generally timeconsuming. In fact, the majority of the clustering calculation on the whole large data set is not useful for finding few support vectors. So, it is possible to find a way to reduce the data set without making calculation on all data, and the tradeoff is the classification precision. Random selection may be a such method. In this paper, we propose a new random selection technique to reduce data set in place of clustering. Firstly, we select a predefined number of data from the original data set using our random selection method, and these selected data are used in the first stage SVM. Secondly, the obtained support vectors of the first stage SVM are used to select

3633

data for the second stage SVM by being put back to original data set and selecting their neighbor data within a certain radius (we call the process recovery). Thirdly, the second stage SVM is applied on those de-clustered data. Our experimental results show that the accuracy obtained by our approach is very close to the classic SVM methods, while the training time is significantly shorter. Furthermore, the proposed approach can be applied on huge data sets no matter how large their dimensions are. The rest of the paper is organized as following: Section II reviews SVM classification. Section III introduces a random selection method to select data from original data set. Section IV focuses on explaining the two stages SVM classification. Conclusion is given in Section V. II. SUPPORT VECTOR MACHINE CLASSIFICATION We consider binary classification. A training set S is given as S = {xi , yi }ni=1 where xi ∈ n R and yi ∈ (+1, −1) . Training SVM yields to find the following optimal hyperplane, £ ¤ yi = sign wT ϕ (xi ) + b

or to solve the following quadratic programming problem (primal problem), n X ξk minw,b J (w) = 12 wT w + c ¤k=1 £ T subject : yk w ϕ (xk ) + b ≥ 1 − ξ k

(1)

where ξ k is slack variables to tolerate mis-classifications ξ k > 0, k = 1 · · · n, c > 0. (1) is equivalent to the following quadratic programming problem which is a dual problem with the Lagrangian multipliers αk ≥ 0, maxα J (α) = − 12 subject :

n X

n X

yk yj K (xk , xj ) αk αj +

k=1

αk yk = 0,

k=1

n X

αk

k=1

0 ≤ αk ≤ c

(2) where the function K (xk , xl ) is called the Mercel kernel, which must satisfy the Mercer condition [8]. The resulting classifier is " n # X yi = sign αk yk K (xk , xi ) + b . k=1

Many solutions of (2) are zero, i.e., αk = 0, so the solution vector is sparse, the sum is taken only over the non-zero αk . The xi which corresponds to nonzero αi is called a support vector (SV). Let V be the index set of SV, then the optimal hyperplane is X αk yk K (xk , xj ) + b = 0 k∈V

The resulting classifier is " # X y(x) = sign αk yk K (xk , xj ) + b k∈V

where b is determined by Kuhn-Tucker conditions, ∂L ∂w ∂L ∂b

= 0, = 0,

∂L ∂ξk©=

w= n X

n X

αk yk ϕ (xk )

k=1

αk yk = 0

(3)

k=1

0, c − αk − ν k = 0 (αk − c ≥ 0) ¤ £ ª αk yk wT ϕ (xk ) + b − 1 + ξ k = 0

In primal problem (1) the size of w is fixed which is independent of the number of data points. In dual problem (2) the solution vector α grows with the number of data points n. For high dimensional input space it is better to solve the dual problem, while for large data sets it might be advantageous to solve the primal problem. Sequential minimal optimization (SMO) breaks the large QP problem into a series of smallest possible QP problems [19]. These small QP problems can be solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The memory required by SMO is linear in the training set size, which allows SMO to handle very large training sets [13]. A requirement in (3) l P is αi yi = 0, it is enforced throughout the iterations i=1

and implies that the smallest number of multipliers can be optimized at each step is two. At each step SMO chooses two elements αi and αj to jointly optimize, it finds the optimal values for these two parameters while all others are fixed. The choice of the two points is determined by a heuristic algorithm, the optimization of the two multipliers is performed analytically. Experimentally the performance of SMO is very good, despite needing more iterations to converge. Each iteration uses few operations such that the algorithm exhibits an overall speedup. Besides convergence time, SMO has other important features, such as, it does not need to store the kernel matrix in memory, and it is fairly easy to implement [19]. III. RANDOM SELECTION ALGORITHM

Many efforts have been made on producing “random” sequences of integers. In recent years, algorithms have been proposed to measure how random a sequence is. However, most generators were subjected to rigorous statistical testing where factors including speed, setup time, length of the compiled code, machine independence, range of the set of applications, simplicity and readability, should be considered. It is impossible to obtain a perfect uniform random number generator. In this paper, we propose a random selection algorithm to select a small percentage of data from original large data set, so that SVM algorithm can be applied directly. We assume that (X, Y ) be the training patterns set, where X = {x1 , x2 , ..., xn } is the input data set, xi can be represented T by a vector of p dimension, i.e., xi = (xi1 . . . xip ) , Y = {y1 , y2 , ..., yn } is the label set, label yi ∈ {−1, 1}. Our objective is to select a sample set C = (c1 , c2 , · · · , cl ), l ¿ n from X . Note that l need to be predefined and should be

3634

Original data

X

Y Training data set

First sample

Random selection

Selected sample data

SVM classification Stage 1

Support vectors

First swapping

Second sampling

Second swapping

Fig. 1.

Data recovery

SVM classification Stage 2

Random selection clustering

an integer that is suitable to be training data size for normal SVM algorithms. Firstly, we divide the input data X into two groups according to their label Y , i.e., let the number of positive labeled data is q and the number of negative labeled data is m, and define the positive and negative labeled input data in array form as X + and X − corresponding to their label Y + and Y − respectively, where n = q + m. i.e., X + = [x+ (1) , . . . , x+ (q)] X − = [x− (1) , . . . , x− (m)] Y + = [y + (1) , . . . , y + (q)] = [1, . . . , 1] Y − = [y + (1) , . . . , y + (m)] = [−1, . . . , −1] So the original input data set is the union of X

+

Fig. 2.

Data near the optimal hyperplane

The optimal hyperplane

Two stages SVM classification scheme

s ← diXe Swap(x[s], x[i]) Return x[n − l + 1], . . . , x[n] IV. TWO-STAGE SVM CLASSIFICATION Let (X, Y ) be the training patterns set, X = {x1 , · · · , xn }, Y = {y1 , · · · , yn } T yi = ±1, xi = (xi1, . . . , xip ) ∈ Rp



and X ,

i.e., X = X + ∪ X − . Secondly, we select sample data by sampling the subsets X + and X − independently. Here we use swapping method during selection process, see Fig. 1, and the selection is done in the following way: the first sample data c1 is chosen from X + (or X − ) uniformly and randomly, then it is exchanged with the last data x+ (q) (or x− (m)), i.e., Swap (c1 , x+ (q)) (or Swap (c1 , x− (m))). The second sample c2 is selected from the remaining data {x+ (1), · · · , x+ (q − 1)}, then it is exchanged with the last data x+ (q − 1) (or x− (m − 1)),, i.e., Swap (c2 , x+ (q − 1)) (or Swap (c2 , x− (m − 1))), and so on, until the required number of data for X + (or X − ) are selected. Generally, we may select l/2 sample from each labeled original data sets, i.e., l/2 samples from X + and l/2 samples from X − if the labels are distributed evenly. If not, we may predefine a proportion on the sample for each label according to their distribution on original data or as wanted. The proportion may be defined in various different ways which are out of the range of our investigation in this paper. This swapping method can be also regarded as generating the first l entries in a random permutation. Above random selection can be formed as the following algorithm. For i := n to n − l + 1 Do Generate a uniform [0, 1] random variate X.

(4)

The training task of SVM classification is to find the optimal hyperplane from the input X and the output Y , which maximize the margin between the classes. By the sparse property of SVM, the data which are not support vectors will not contribute the optimal hyperplane. The input data sets which are far away from the decision hyperplane should be eliminated, meanwhile the data sets which are possibly support vectors should be used. Our two stages SVM Classification can be summarized into 4 steps which is shown in Fig. 2: 1) data selection via random selection technique proposed in Section III, 2) the first stage SVM classification, 3) data recovery, 4) the second stage SVM classification. The following subsections will give a detailed explanation on each step. A. Selecting training data We denote the data set of selected samples as C. We may divide C into two subsets, one is positive labeled, and the other is negative labeled, i.e., C + = {∪Ci | y = +1} positive labeled data C − = {∪Ci | y = −1} negative labeled data Fig. 3 (a) shows the original data, where the red circles represent positive labeled data, and the green squares represents negative labeled data. After random selection, the selected sample data are shown in Fig. 3 (b). Then, C + consists of those data which are represented by red circles in Fig. 3 (b), and C − consists of those data which are represented by green squares in 3 (b). Only these selected data will be used as training data in the first stage SVM classification.

3635

(a)

(a)

(b)

(b) Fig. 3.

Fig. 4.

Random selection

B. The first stage SVM classification In the first stage SVM classification we use SVM classification with SMO algorithm to get the decision hyperplane X yk α∗1,k K (xk , x) + b∗1 = 0 (5)

Training data recovery

we are using. An extreme case is that all original data are recovered if the radius is large enough. In this paper we use the margin of the first stage SVM to determine the radius of circles which are used to recover original data for the second stage SVM.

k∈V1

where V1 is the index set of the support vectors in the first stage. Here, the training data set is C + ∪ C − which has been obtained by random selection in the last sub section. Fig. 3(b) and 4(a) illustrate the training data set and the support vectors obtained in the first stage SVM classification, respectively.

Bi (ci , r) ,

ci ∈ V,

γi =

K K ° =° °X ° kw∗ k2 ° ° ∗ yi αi xi ° ° ° ° i∈V

2

And the recovered data set is ∪Ci ∈V {Bi (ci , r)} .

C. Data recovery Nota that, the original data set is reduced significantly by random selection. In the first stage SVM classification, the training data set is only a small percentage of the original data. This may affect the classification precision, i.e., the obtained decision hyperplane cannot be precise enough, a refined process is necessary. However, at least it gives us a reference on which data can be eliminated, the data far from the hyperplane can be removed for further classification stage. On the other hand, we like to make a classification using useful data as many as possible. But, some useful data hasn’t been selected during the random selection stage. So, a natural idea is to recover those data which are near to the support vectors, and use the recovered data to make classification again. We propose to recover the data by using the support vectors as centers and circling their neighbors in original data set with certain radius, Fig. 4 illustrates the recovery process. The data in circles in Fig. 4 (b) are the recovered data. Of course, how many data will be recovered depends on the radius selection. We intend to recover a data set size near to the training data set size suitable for the SVM algorithm

D. The second stage SVM classification Taking the recovered data as new training data set, we use again SVM classification with SMO algorithm to get the final decision hyperplane X

yk α∗2,k K (xk , x) + b∗2 = 0

(6)

k∈V2

where V2 is the index set of the support vectors in the second stage. Generally, the hyperplane (5) is close to the hyperplane (6). Note that, the original data set is reduced significantly by random selection and the first stage SVM. The recovered data should consist of the most useful data from the SVM optimization viewpoint, because the data far away from the decision hyperplane have been eliminated. On the other hand, almost all original data near the hyperplane (5) are included by our data recovery technique, the hyperplane (6) should be an approximation of the hyperplane of an normal SMO with all original training data set. Fig. 5 illustrates the training data and hyperplane of the second stage SVM classification.

3636

Table 1. Results with our approach IJCNN data set

(a)

t

Acc

l

#MC

TrD1

SV1

TrD2

SV2

1000

4.53

92.7

350

39

388

86

473

175

5000

8.73

94.1

400

84

483

109

541

180

12500

13.3

94.7

450

105

554

127

673

193

25000

25.9

94.9

500

147

646

118

712

287

37500

45.3

95.3

1000

179

1178

122

1693

358

49990

78.1

97.7

2000

201

2200

185

2370

430

Table 1 shows our experiment results on different data size. For example, in the experiment on 49990 data points we sectioned it into 2000 clusters including one mixed cluster. In the first stage classification, we got 2200 training data after data selection including 2000 cluster centers and 201 data in a mixed labeled cluster. then 185 support vectors were obtained. Following the data recovery process, 2370 data were recovered as training data for the second stage SVM, which included the cluster centers which are support vectors and clusters with mixed labels. In the second stage SVM, 430 support vectors were obtained. The total training time is 78.08 seconds, and the accuracy reaches 97.7%.

(b) Fig. 5.

#

The second stage SVM

Table 2. Training time and accuracy

E. Example

IJCNN data set

We use the IJCNN 2001 data set [20] as an example to compare our approach with other SVM classification methods including LIBSVM [5], SMO [19] and Simple SVM [8]. All the experiments were done with a PC Pentium-IV 1,8GHz host computer. Example 1: IJCNN 2001 The data set is available at [20] and [5]. There are 49990 training data points and 91701 testing data points, each record has 22 attributes. The sizes of the training data we used are 1, 000, 5, 000, 12, 500, 25, 000, 37, 500 and 49, 990. In our experiments, the RBF kernel is à ! T (x − z) (x − z) f (x, z) = exp − (7) 2 2rck we choose rck = r/5. The comparison results are shown in Table 1, where the notations are explained as follows: "#" is the data size; "t" is the training time of the whole classification which includes the time of clustering, the first stage SVM training, de-clustering and the second stage SVM training; "Acc" is the accuracy; "l" is the number of clusters used in the experiment; "#MC" is the number of data in all mixed labeled clusters; "TrD1" is the training data size of the first stage SVM classification; "SV1" is the number of support vectors obtained in the first stage SVM; "TrD2" is the training data size of the second stage SVM classification; "SV2" is the number of support vectors obtained in the second stage SVM.

Our approach

SMO

Simple SVM

LIBSVM

#

t

Acc

t

Acc

t

Acc

t

Acc

1000

4.53

92.7

1.67

96.1

2.1

94.3

0.5

93.1

5000

8.73

94.1

89.8

97.2

4.6

96.3

7.5

96.3

12500

13.3

94.7





183

96.1

47.6

98.2

25000

25.9

94.9





3823

97.2

177

98.6

37500

45.3

95.3





10872

97.7

394

98.8

49990

78.1

97.7





20491

98.2

731

98.8

Table 2 shows the comparison results on training time and accuracy between our classifier and some other SVM algorithms including SMO, simple SVM and LIBSVM. For example, to classify 1000 data, LIBSVM is the fastest, and SMO has the best accuracy, our two approaches are not better than them, although the time and accuracy are still acceptable. However, to classify 49990 data, Simple SVM and SMO have no better accuracy than the others, but their training time is tremendous longer. Comparing with our approach, the training time of LIBSVM is almost 10 times of ours, on the other hand, the accuracies are almost the same. This experiment implies that our approach can reach the same accuracy as the other algorithm can in a very short training time when the data set is large. V. C ONCLUSION A random selection method and a two stage SVM classification approach for large data sets are proposed in this work. Comparing with other SVM classifiers, our classifier via randomly reducing and recovering training data has the following characteristics: 1) The proposed approach is very practical for large data set classification. It can be used as a primitive classifier since it can be as fast as possible (depending on the

3637

accuracy requirement) no matter how large is the data set size. Generally, SVM training time depends on the training data size, Our approach can overcome the QP problem using the proposed random selection method, so our total training data size is much smaller than that other SVM approaches, although we need twice classifications. 2) The training data size is reduced dramatically after random selection in our classification, however, the classification accuracy doesn’t decrease. This thanks to the data recovery process, since the recovered data used in the second stage SVM are all important samples, and thus we guarantee the improvement of accuracy. In our previous works [3][4][15], some experiments had been done on showing the accuracy difference between the first and second stage SVM of our approach. 3) Random selection is much faster than other clustering based data selection because it does not partition data. But, it restricts that the original data set should be relatively uniform. Although we select same number of data with each label, it is still possible that data are when data labels are not distributed relatively uniform. In this case, we need to consider the class distribution which is our on going work.

[14] S-W.Kim and B.J.Oommen, Enhancing Prototype Reduction Schemes with Recursion: A Method Applicable for “Large” Data Sets, IEEE Trans. Syst., Man, Cybern. B, 34(3), 2004, pp. 1184-1397. [15] X. Li and K. Li, A Two-Stage SVM Classifier for Large Data Sets with Application to RNA Sequence Detection, in 2007 International Conference on Life System Modeling and Simulation (LSMS’07), Letcure Notes in Bioinformatics , Shanghai, China, September 14-17, 2007 [16] C-T. Lin, C-M. Yeh, S-F. Liang, J-F. Chung, and N. Kumar, SupportVector-Based Fuzzy Neural Network for Pattern Classification, IEEE Trans. Fuzzy Syst., 14(1), 2006, pp. 31-41. [17] M.E. Mavroforakis and S. Theodoridis, A Geometric Approach to Support Vector Machine(SVM) Classification, IEEE Trans. Neural Networks, 17(3), 2006, pp. 671-682. [18] N. Pal and J. Bezdek, On cluster validity for the fuzzy c-means model, IEEE Trans. Fuzzy Syst., 3(3), 1995, pp. 370–379. [19] J. Platt, Fast Training of support vector machineusing sequential minimal optimization, Advances in Kernel Methods: support vector machine, MIT Press, Cambridge, MA, 1998. [20] D. Prokhorov, IJCNN 2001 neural network competition, http://www.geocities.com/ijcnn/nnc_ijcnn01.pdf, 2001. [21] L. Shih, D.M. Rennie, Y. Chang and D.R. Karger, Text Bundling: Statistics-based Data Reduction, in Twentieth Int. Conf. on Machine Learning (ICML-2003), Washington DC, 2003. [22] V. Tresp, A Bayesian Committee Machine, Neural Computation, 12(11), 2000, pp. 2719-2741. [23] R. Xu and D.WunschII, Survey of Clustering Algorithms, IEEE Trans. Neural Networks, 16(3), 2005, pp. 645-678. [24] H. Yu , J. Yang and J. Han, Classifying Large Data Sets Using SVMs with Hierarchical Clusters, in 9th ACM SIGKDD, Washington, DC, USA, 2003.

R EFERENCES [1] M. Awad, L. Khan, F. Bastani and I. L. Yen , An Effective support vector machine(SVMs) Performance Using Hierarchical Clustering, in16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), Boca Raton, FL, USA, 15-17 November, 2004, pp. 663- 667. [2] G. Babu and M. Murty, A near-optimal initial seed value selection in K-means algorithm using a genetic algorithm, Pattern Recognit. Lett., 14(10), 1993, pp. 763–769. [3] J. Cervantes, X. Li and W.Yu, Support Vector Machine Classification Based on Fuzzy Clustering for Large Data Sets, MICAI 2006: Advances in Artificial Intelligence, Srpinger-Verlgag, Lecture Notes in Computer Science (LNCS), vol. 4293, 2006, pp. 572-582. [4] J. Cervantes, X. Li, W. Yu, K. Li, Support vector machine classification for large data sets via minimum enclosing ball clustering, Neurocomputing, accepted for publication. [5] C-C. Chang and C-J. Lin, LIBSVM: a library for support vector machines, http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [6] P-H. Chen, R-E. Fan and C-J. Lin, A Study on SMO-Type Decomposition Methods for Support Vector Machines, IEEE Trans. Neural Networks, 17(4), 2006, pp. 893-908. [7] R. Collobert and S. Bengio, SVMTorch: Support vector machines for large regresion problems, Journal of Machine Learning Research, Vol.1, 2001, pp. 143-160. [8] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , Cambridge University Press, 2000. [9] J-X. Dong, A. Krzyzak, and C.Y. Suen, Fast SVM Training Algorithm with Decomposition on Very Large Data Sets, IEEE Trans. Pattern Analysis and Machine Intelligence, 27(4), 2005, pp.603-618. [10] G. Folino, C. Pizzuti and G. Spezzano, GP Ensembles for Large-Scale Data Classification, IEEE Trans. Evol. Comput., 10(5), 2006, pp.604– 616. [11] M. Girolami, Mercer kernel based clustering in feature space, IEEE Trans. Neural Networks, 13(3), 2002, pp. 780–784. [12] G.B. Huang, K.Z. Mao, C.K. Siew and D.-S. Huang, Fast Modular Network Implementation for Support Vector Machines”, IEEE Trans. on Neural Networks, 16(6), 2005, pp. 1651-1663. [13] T.Joachims, Making large-scale support vector machine learning practica, Advances in Kernel Methods: Support Vector Machine. MIT Press, Cambridge, MA, 1998.

3638

Suggest Documents