Data Dimension Reduction for Clustering Semi-Structured Documents using QR Fuzzy C-Mean (QR-FCM)

—The rapid growth of XML adoption has created an urgent need for a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account in order to support a more precise document analysis. In order to efficiently analyze the information represented in XML documents, research on XML document clustering is actively in progress. The key issue is how to devise a similar measure to be used for clustering between XML documents and, since XML documents have a hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Data dimension reduction (DDR) plays an important role in handling a massive quantity of high dimensional data such as mass semantic structural documents. Having projected XML documents to the lower dimensional space obtained from the DDR/QR, our proposed method of QR Fuzzy C-Mean is to execute the document-analysis clustering algorithms (we call QR-FCM). The DDR can substantially reduce the computing time and/or memory requirement of a given document-analysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters, or to search for a better solution.


INTRODUCTION
An XML document, which is semi-structured data, has a hierarchical structure. Therefore, rather than using the similarity measure of the general document clustering techniques as it is, a new similarity measure which considers the semantic and structural information of an XML document must be investigated. However, some XML clustering methods used the similarity measure which only takes the structural information of XML documents into account. Hwang proposes a clustering method which extracts a typical structure of the maximum frequency pattern n using PrefixSpan algorithm [1] on XML documents [2,3]. However, since such a typical structure extracted from XML documents is not the only structure which represents the XML document itself, it cannot be the representative of the whole document corpus, since there is an accuracy issue of similarity. Lian summarizes XML documents into an Sgraph which is a structural graph, and proposes that the calculation method of the distance between S-graphs is used for clustering [4]. However, they do not consider semantic information on XML documents since they only focus on structural information. Since dimension reduction is one of the fundamental methods of data analysis, there have been a great many studies on effective and efficient dimension reduction algorithms. There are linear dimension reduction algorithms including principal component analysis (PCA) [5] and multidimensional scaling (MDS) [6]. There are also nonlinear dimensional reduction algorithms (NLDR) including an Isomap [7], locally linear embedding (LLE) [8], [9], Hessian LLE [10], Laplacian eigenmaps [11], local tangent space alignment (LTSA) [12] and a distance preserving dimension reduction based on a singular value decomposition (DPDR/QR) [13]. These dimensions cover a variety of areas such as biomedical image recognition, biomedical text data mining, and biological data analysis.
The rest of this paper is organized as follows. In Section II, we introduce the prepared XML documents on a vector space model. In Section III, we show the DDR based on the QR factorization and the DDR QR-FCM. Section IV presents the experimental results illustrating properties of the proposed DDR methods. A summary is made in Section V.

II. PREPARATION OF SEMANTIC-BASED XML DOCUMENTS
In this section, we first introduce the pre-processing steps for the incorporation of hierarchical information in encoding the XML tree's paths. This is based on the preorder tree representation (PTR) [14] and will be introduced after a brief review of how to generate an XML tree from an XML document. Figure 1 illustrates an example of structural summary. By applying the phase of the nested reduction on tree T1, we derived tree T2 where there are no nested nodes. By applying the repetition reduction on tree T2, we derived tree T3, which is the structural summary tree without nested and repeated nodes. Once the trees have been compacted using structural summaries, the nesting and repetition are reduced.
as a set of edges. As an example, Figure 2 depicts a sample XML tree containing some information about the collection of books. The book consists of intro tags, each comprising title, author and date tags.
Each author contains fname and lname, each date includes year and month tags. Figure 2 left shows only the first letter of each tag for simplicity. The XML document has a hierarchical structure and this structure is organized with tag paths, to represent document characteristics which can predict the contents of the XML document. Strictly speaking, this shows the semantic structural characteristics of the XML document. In this paper, we propose a new method for calculating the similarity using all of the tag paths of the XML tree representing the semantic structural information of the XML document. From now on, a tag path is termed path element. Table I shows path elements obtained from the XML document in Figure 3. The PEL-i represents the extracted path elements on the XML document tree from the i th tree level to the leaf node. For example, the PEL-1 means the path element from the root level (level 1) to the leaf node, and PEL-2 means the path element from the level 2 to the leaf node respectively. Vector model represents a document as a vector whose elements are the weights of the path elements within the document. To calculate the weight of each path element within the document, a Term Frequency and IDF (Inverse Document Frequency) method is used [15].

A. Path Element Structural Semantic Weight (PESSW)
We define the PESSW (Path Element Structural Semantic Weight) which calculates the weight of the path element in an XML document. The PESSW is PEWF (Path Element  (1). In this paper, we use the PESSW and DPESSW interchange. The level number of the root tag is 1, and that of a tag under the root tag is 2, and so on. N is a real number larger than 1, and in this paper, 1 is chosen for the value of n. PEIDFij is shown in equation (3). PEIDFij is shown in equation (3).
where N is the total number of documents and DFj is the number of documents in which the j th path element appears.
The PESSW is prudently calculated to correctly reflect the structural semantic similarity. Table II shows  and t is the total number of path elements in dx, dy respectively [16].

B. Data Dimension Reduction (DDR) Via QR Factorization
Let us deal with n XML documents the dimension of which is m s.t.  , where di is the i-th column of D and ri is the i-th column of R1. We can obtain cosine similarities between the two XML documents (di and dj) in the original dimensional space from R1 as where rj is the j-th column of n n R    1 . We refer this method to as DDR-QR.

C. Our proposed DDR SVD-QR Algorithm
As described in the previous section, from the QR factorization, we have the originated document vector Fuzzy C-Means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method developed by [17] and improved by [18] [19] is frequently used in pattern recognition. It is based on the minimization of the following objective function: where m is any real number greater than 1, uij is the degree of membership of di in the cluster j, di is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the dissimilarity between any measured data and the center. Fuzzy partitioning is carried out by means of an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by: