 Open Access
 Total Downloads : 76
 Authors : HsuKuang Chang
 Paper ID : IJERTV8IS060686
 Volume & Issue : Volume 08, Issue 06 (June 2019)
 Published (First Online): 02072019
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Data Dimension Reduction for Clustering SemiStructured Documents using QR Fuzzy CMean (QRFCM)
HsuKuang Chang
Department of Information Engineering IShou University Kaohsiung, Taiwan
AbstractThe rapid growth of XML adoption has created an urgent need for a proper representation for semistructured documents, where the document semantic structural information has to be taken into account in order to support a more precise document analysis. In order to efficiently analyze the information represented in XML documents, research on XML document clustering is actively in progress. The key issue is how to devise a similar measure to be used for clustering between XML documents and, since XML documents have a hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Data dimension reduction (DDR) plays an important role in handling a massive quantity of high dimensional data such as mass semantic structural documents. Having projected XML documents to the lower dimensional space obtained from the DDR/QR, our proposed method of QR Fuzzy CMean is to execute the documentanalysis clustering algorithms (we call QRFCM). The DDR can substantially reduce the computing time and/or memory requirement of a given documentanalysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters, or to search for a better solution.
KeywordsQR; DDR; PESSW; FCM; QRFCM

INTRODUCTION
An XML document, which is semistructured data, has a hierarchical structure. Therefore, rather than using the similarity measure of the general document clustering techniques as it is, a new similarity measure which considers the semantic and structural information of an XML document must be investigated. However, some XML clustering methods used the similarity measure which only takes the structural information of XML documents into account. Hwang proposes a clustering method which extracts a typical structure of the maximum frequency pattern n using PrefixSpan algorithm [1] on XML documents [2, 3]. However, since such a typical structure extracted from XML documents is not the only structure which represents the XML document itself, it cannot be the representative of the whole document corpus, since there is an accuracy issue of similarity. Lian summarizes XML documents into an S graph which is a structural graph, and proposes that the calculation method of the distance between Sgraphs is used for clustering [4]. However, they do not consider semantic information on XML documents since they only focus on structural information. Since dimension reduction is one of
great many studies on effective and efficient dimension reduction algorithms. There are linear dimension reduction algorithms including principal component analysis (PCA) [5] and multidimensional scaling (MDS) [6]. There are also nonlinear dimensional reduction algorithms (NLDR) including an Isomap [7], locally linear embedding (LLE) [8], [9], Hessian LLE [10], Laplacian eigenmaps [11], local tangent space alignment (LTSA) [12] and a distance preserving dimension reduction based on a singular value decomposition (DPDR/QR) [13]. These dimensions cover a variety of areas such as biomedical image recognition, biomedical text data mining, and biological data analysis.
The rest of this paper is organized as follows. In Section II, we introduce the prepared XML documents on a vector space model. In Section III, we show the DDR based on the QR factorization and the DDR QRFCM. Section IV presents the experimental results illustrating properties of the proposed DDR methods. A summary is made in Section V.

PREPARATION OF SEMANTICBASED XML DOCUMENTS
In this section, we first introduce the preprocessing steps for the incorporation of hierarchical information in encoding the XML trees paths. This is based on the preorder tree representation (PTR) [14] and will be introduced after a brief review of how to generate an XML tree from an XML document.
Figure 1 illustrates an example of structural summary. By applying the phase of the nested reduction on tree T1, we derived tree T2 where there are no nested nodes. By applying the repetition reduction on tree T2, we derived tree T3, which is the structural summary tree without nested and repeated nodes. Once the trees have been compacted using structural summaries, the nesting and repetition are reduced.
Figure 1 nested and repeated nodes extraction
Now the XML document is modeled as a XML tree T=(V,E) where V={ v , v , ….} as a set of vertices and V ,
the fundamental methods of data analysis, there have been a 1 2 1
2 V , (v1 , v2 ) E as a set of edges. As an example, Figure 2 depicts a sample XML tree containing some information about the collection of books. The book consists of intro tags, each comprising title, author and date tags. Each author contains fname and lname, each date includes year and month tags. Figure 2 left shows only the first letter of each tag for simplicity.

PATH ELEMENT OF THE VECTOR SPACE MODEL (PEVSM)
Vector model represents a document as a vector whose elements are the weights of the path elements within the document. To calculate the weight of each path element within the document, a Term Frequency and IDF (Inverse Document Frequency) method is used [15].

Path Element Structural Semantic Weight (PESSW)
We define the PESSW (Path Element Structural Semantic Weight) which calculates the weight of the path element in an XML document. The PESSW is PEWF (Path Element Weighted Frequency) multiplied by the PEIDF (Path Element Inverse Document Frequency). The PESSWij of ith path element in the jth document is shown in equation (1). In this paper, we use the PESSW and DPESSW interchange.
PESSWij
PEWFij PEIDFij
(1)
Figure 2 Example of XML Document
PEWFij is shown in equation (2).
The XML document has a hierarchical structure and this structure is organized with tag paths, to represent document
PEWFij
freqij xn
(2)
1
1
characteristics which can predict the contents of the XML document. Strictly speaking, this shows the semantic
freqij is a frequency of jth path element in a ith document
1
structural characteristics of the XML document. In this paper, we propose a new method for calculating the similarity using
and it is multiplied by level weight
xn
in order to consider
all of the tag paths of the XML tree representing the semantic structural information of the XML document. From now on, a tag path is termed path element. Table I shows path elements obtained from the XML document in Figure 3. The PELi represents the extracted path elements on the XML document tree from the ith tree level to the leaf node. For
the semantic importance of a path element in a document. X refers to the level number of the highest tag of a tag path. The level number of the root tag is 1, and that of a tag under the root tag is 2, and so on. N is a real number larger than 1, and in this paper, 1 is chosen for the value of n. PEIDFij is shown in equation (3). PEIDFij is shown in equation (3).
example, the PEL1 means the path element from the root
PEIDF log N
(3),
level (level 1) to the leaf node, and PEL2 means the path element from the level 2 to the leaf node respectively.
ij DF
j
j
where N is the total number of documents and DFj is the number of documents in which the jth path element appears. The PESSW i prudently calculated to correctly reflect the structural semantic similarity. Table II shows the PEWF, PEIDF, and PESSW on sample trees in Figure 3.
Path
Element
PEWF
PEIDF
PESSW
doc1
doc2
doc3
doc1
doc2
doc3
doc1
doc2
doc3
/B/I/T/D
1.0
0.0
0.0
1.1
0.0
0.0
1.1
0.0
0.0
/B/I/A/F
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/B/I/A/L
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/M/I/A/L
0.0
0.0
1.0
0.0
0.0
1.1
0.0
0.0
1.1
/B/I/T/
2.0
1.0
0.0
0.41
0.41
0.0
0.81
0.41
0.0
/B/I/A
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/B/I/
2.0
1.0
0.0
0.41
0.41
0.0
0.81
0.41
0.0
/B
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/M/I/T
0.0
0.0
1.0
0.0
0.0
1.1
0.0
0.0
1.1
/M
0.0
0.0
1.0
0.0
0.0
1.1
0.0
0.0
1.1
/M/I
0.0
0.0
2.0
0.0
0.0
1.1
0.0
0.0
2.2
/I/T/D
0.5
0.0
0.0
1.10
0.0
0.0
0.5
0.0
0.0
/I/A/F
0.5
0.5
0.0
0.41
0.41
0.0
0.21
0.21
0.0
/I/A/L
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
/I/T
1.0
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
/I/A
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
/I
1.0
0.5
1.0
0.0
0.0
0.0
0.0
0.0
0.0
/T/D
0.33
0.0
0.0
1.1
0.0
0.0
0.33
0.0
0.0
Path
Element
PEWF
PEIDF
PESSW
doc1
doc2
doc3
doc1
doc2
doc3
doc1
doc2
doc3
/B/I/T/D
1.0
0.0
0.0
1.1
0.0
0.0
1.1
0.0
0.0
/B/I/A/F
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/B/I/A/L
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/M/I/A/L
0.0
0.0
1.0
0.0
0.0
1.1
0.0
0.0
1.1
/B/I/T/
2.0
1.0
0.0
0.41
0.41
0.0
0.81
0.41
0.0
/B/I/A
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/B/I/
2.0
1.0
0.0
0.41
0.41
0.0
0.81
0.41
0.0
/B
1.0
1.0
0.0
0.41
0.41
0.0
0.41
0.41
0.0
/M/I/T
0.0
0.0
1.0
0.0
0.0
1.1
0.0
0.0
1.1
/M
0.0
0.0
1.0
0.0
0.0
1.1
0.0
0.0
1.1
/M/I
0.0
0.0
2.0
0.0
0.0
1.1
0.0
0.0
2.2
/I/T/D
0.5
0.0
0.0
1.10
0.0
0.0
0.5
0.0
0.0
/I/A/F
0.5
0.5
0.0
0.41
0.41
0.0
0.21
0.21
0.0
/I/A/L
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
/I/T
1.0
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
/I/A
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0.0
0.0
/I
1.0
0.5
1.0
0.0
0.0
0.0
0.0
0.0
0.0
/T/D
0.33
0.0
0.0
1.1
0.0
0.0
0.33
0.0
0.0
TABLE II. AN EXAMPLE OF PTWF, PTIDF AND PESSW
Figure 3 XML Documents Example
Table I. Path elements example
PEL1
PEL2
PEL3
PEL4
/B/I/T/D
/I/T/D
/T/D
D
/B/I/A/F
/I/A/F
/A/F
F
/B/I/A/L
/I/A/L
/A/L
L
/M/I/A/L
/I/T
/T
/B/I/T/
/I/A
/A
/B/I/A
/I
/B/I/
/B
/M/I/T
/M
/M/I
/A/F
0.33
0.33
0.0
0.41
0.41
0.0
0.14
0.14
0.0
/A/L
0.33
0.33
0.33
0.0
0.0
0.0
0.0
0.0
0.0
/T
0.67
0.33
0.33
0.0
0.0
0.0
0.0
0.0
0.0
/A
0.33
0.33
0.33
0.0
0.0
0.0
0.0
0.0
0.0
D
0.25
0.0
0.0
1.1
0.0
0.0
0.27
0.0
0.0
F
0.25
0.25
0.0
0.41
0.41
0.0
0.1
0.1
0.0
L
0.25
0.25
0.25
0.0
0.0
0.0
0.0
0.0
0.0
/A/F
0.33
0.33
0.0
0.41
0.41
0.0
0.14
0.14
0.0
/A/L
0.33
0.33
0.33
0.0
0.0
0.0
0.0
0.0
0.0
/T
0.67
0.33
0.33
0.0
0.0
0.0
0.0
0.0
0.0
/A
0.33
0.33
0.33
0.0
0.0
0.0
0.0
0.0
0.0
D
0.25
0.0
0.0
1.1
0.0
0.0
0.27
0.0
0.0
F
0.25
0.25
0.0
0.41
0.41
0.0
0.1
0.1
0.0
L
0.25
0.25
0.25
0.0
0.0
0.0
0.0
0.0
0.0
PESSW, document vector QT D r (refer to economic QR
factoring), document vector
Q T D r
(refer to QR
QRFCM
Clustering Algorithm
QRFCM
Clustering Algorithm
f
f
Let dx and dy be two vectors which represent an XML document docx and docy. Cosine similarity is defined as
factorization of rank PESSW), and document vector
Q
Q
~T D ~r (refer to QR factorization of efficient low rank
PESSW) based on the XML documents, which is taken as the QRFCM input data and then goes through the clustering.
being the angle between two vectors and is quantified by
T
T
equation (4) and (5).
Vector
T
T
Vector
PESSW
rk Qe d i
cos
d x d y
, that is (4)
input Vctor
Vctor
r Q T d , k rank (d )
k
k
Q
Q
~r ~T d , k lowrank (d )
 d x   d y 
t
k
N 1 N
T wkx wky
F(I)=[f1,f2,..,fc] , where
fi ij Pj N ij .
sim(doc
, doc
) d x d y
k 1
j1
j1
x y
(5)
Fuzzy CMeans (FCM) is a method of clustering which
 d x   d y  t 2 t 2
allows one piece of data to belong to two or more clusters.
dx (w1x ,w2x ,…,wtx )
wkx
k 1
wky k 1
This method developed by [17] and improved by [18] [19] is frequently used in pattern recognition. It is based on the
, minimization of the following objective function:
d (w , w
,…,w )
and
(w , w
,…,w ) is
N C m 2
y 1y 2 y ty
1x 2 x tx
Jm ij

di c j
, 1 m
weight of dx,
(w1y , w2 y ,…,wty ) is weight of document of dy,
i1
j1
and t is the total number of path elements in dx, dy respectively [16].


Data Dimension Reduction (DDR) Via QR Factorization
Let us deal with n XML documents the dimension of which is m s.t. m n . We compute the QR factorization of
the XML document matrix D mn :
where m is any real number greater than 1, uij is the degree of membership of di in the cluster j, di is the ith of ddimensional measured data, cj is the ddimension center of the cluster, and
* is any norm expressing the dissimilarity between any
measured data and the center.
Fuzzy partitioning is carried out by means of an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by:
R1
R1 N m
D QR Q 0 Q1 Q2 0 Q1R1 ,
ij di
1 c
i1
Q mm
,is an orthogonal matrix and
R nn is an
, j N
1
upper triangular matrix. Then, Q mn can be considered
ij
C d c
2 m
m1 ij
1 i j
i1
as being a dimensionality transformation matrix when m > n
k 1
di ck
and the lower dimensional representation x n1 of a
vector x m1 can be computed as x QT x . Thus, the
This iteration will stop when
1 max
{ ( k 1) ( k ) }
, where is a termination
1
1
i
i
T
T
lower dimensional representation di of each XML document di Q d ri , where di is the ith column of D
ij ij ij
criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle
and ri is the ith column of R1. We can obtain cosine similarities between the two XML documents (di and dj) in the original dimensional space from R1 as
point of Jm.


EXPERIMENT RESULT
In Figure 4, we show the occupied space (k) on the variant
d T Q QT d
rT r
T T
cos(d , d )
i 1 1 j
i j ,
document vectors DPESSW,
Q D , Q D , and
1
1
Q
Q
i
i
i j QT d/p>
QT d
1
1
Q
Q
j
j
2 2
i 2 rj 2
~T D
from 400, 800, 1200, 1600, and 2000 XMLs
1
1
r
r
where rj is the jth column of R nn . We refer this
separately. When comparing the DPESSW with the ~T D on
method to as DDRQR.

Our proposed DDR SVDQR Algorithm
As described in the previous section, from the QR factorization, we have the originated document vector
2000 XMLs, we could save a lot of space, and importantly the clustering result where we used the QRFCM would be unaffected.
In Figure 5, we show the CPU executing time (ms) to run QRFCM on the variant document vectors DPESSW , Q T D ,
Q
Q
T
T
on
on
Q D
Q D
Space(k)/CPU(ns)
Variant Vectors
2000 XMLs
1600 XMLs
% saved Space CPU
% saved
Space CPU
Q T D ~ Q T D
70% 34%
70% 34%
Q T D ~ ~T D
Q
85% 69%
85% 69%
Q T D ~ ~T D
Q
50% 53%
51% 53%
Space(k)/CPU(ns)
Variant Vectors
2000 XMLs
1600 XMLs
% saved Space CPU
% saved
Space CPU
Q T D ~ Q T D
70% 34%
70% 34%
Q T D ~ ~T D
Q
85% 69%
85% 69%
Q T D ~ ~T D
Q
50% 53%
51% 53%
Q T D , and ~T D from 400, 800, 1200, 1600, and 2000 XMLs
Table III. Percentage of space and CPU saved in DDRQR.
separately. When comparing the D
PESSW
with the ~
2000 XMLs, we could save a great deal of time, and importantly the clustering result where we used the QRFCM still remains the right result.
Figure 4: Space on the variant vector from different # XMLs
Table IV. Percentage of saved in DPESSW / thinQR.
Space(k)/CPU(ns)
Variant Vectors
2000 XMLs
1600 XMLs
% saved
Space CPU
% saved Space CPU
D ~ QT D
PESSW
81% 50%
81% 50%
D ~ Q T D
PESSW
94% 67%
94% 68%
D ~ ~T D
PESSW Q
97% 85%
97% 86%


CONCLUSION
The original XML documents DN=[d1,d2,..,dN] are modeled on the vector space model according to the path element of each document, that is DPESSW=PESSW, then a QR factorization was conducted on the DPESSW. We derived the
DPESSW=QR ( D Q R ), or Q T D R
, and then took
k k k k
the rank on the r Q T d
with the rank of DPESSW, and
k k
k
k
Qk
Qk
finally ~r ~T d with the lowrank of QR on DPESSW. We
Rk
Rk
~
passed the 4 resulting vectors (DPESSW, Rk , Rk , and ) into
the QRFCM clustering algorithm to attain the clustering result. In terms of clustering result of the section experiment, we found the same clustering result as from the variant
Rk
Rk
~
DPESSW, Rk , Rk and vectors. From the practical
Figure 5: CPU (ms) executing QRFCM on the variant vector
Q
Q
In Table III, we show the percentage of space saved when
experiment results, we conclude that using the lowrank vector
Rk
Rk
~ instead of the PESSW original document not only saved
comparing QT D with Q T D , QT D with
~
~T D , Q T D
with
space on the input vector but also took less time to cluster the documents.
QT D ,on running QRFCM using 2000 XMLs and 1600
XMLs from 5 DTDs. We also show the percentage of the CPU executing time when comparing QT D with
T
T
Q
Q
Q
Q
~ D , QT D and Q T D , Q T D with ~T D , on running QR
FCM using 2000 XMLs and 1600 XMLs from 5 DTDs. On
REFERENCES

J. Pei, J. Han, B. M. Asi, H. Pinto, PrefixSpan : Mining Sequential Pattern efficiently by PrefixProjected Pattern Growth, Int. Conf. Data Engineering(ICDE), 2001.

J. H. Hwang, K. H. Ryu, XML A New XML clustering for Structural Retrieval, International Conference on Conceptual
comparing
QT D
with
Q T D
, we find that using
Modeling, 2004.

Jwong Hee Hwang, Keun ho Ryu, Clustering and Retrieval of XML
Q T D instead of QT D on 2000 XMLs for the QRFCM, saving % of space in k and % of CPU time in ms. In Table IV, we show the percentage of the space saved and CPU time
Q
Q
(ns) when comparing the original document vector DPESSW with QT D , Q T D , and ~T D on running QRFCM using
2000 XMLs and 1600 XMLs from 5 DTDs. When comparing DPESSW with Q T D , we find that using
Q T D instead of DPESSW on 2000 XMLs for the QRFCM saves % of space in k and % of CPU running time in ms.
documents by Structure, Computational Science and Its ApplicationsICCSA 2005.

Wang Lian, David Wailok, An Efficient and Scalable Algorithm for Clustering XML Documents by Structure, IEEE Computer Society Technical Committee on Data Engineering , 2004.

W. F. Massay, Principal components regression in exploratory statistical research, J. Amer Statist. Assoc., vol. 60, pp. 234246, 1965.

W. S. Torgerson, Theory & Methods of Scaling. New York: Wiley,1958.

J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," Science, vol. 290, no. 5500, pp. 23192323, 2000.

S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding Science, vol. 290, pp. 23232326, 2000.

L. K. Saul and S. T. Roweis, "Think globally, fit locally: Unsupervised learning of low dimensional manifolds," Journal of Machine Learning Research, vol. 4, pp. 119155, 2003.

D. L. Donoho and C. E. Grimes, "Hessian eigenmaps: locally embedding techniques for highdimensional data," Proc. Natl Acad. Sci. USA, vol. 100, pp. 55915596, 2003.

M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, vol. 15, no. 6, pp. 13731396, 2003.

Z. Zhang and H. Zha, "Principal manifolds and nonlinear dimension reduction via tangent space alignment," SIAM Journal of Scientific Computing, vol. 26, no. 1, pp. 313338, 2004.

Hyunsoo Kim, Haesun Park, and Hongyuan Zha, Distance preserving dimension reduction using the QR factorization or the Cholesky factorization, Proceedings of the 2007 IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE 2007), vol I, pages 263269.

Theodore Dalamagas, Tao Cheng, Klaas Jan Winkel, Timos Sellis, A Methodology for Clustering XML Documents by Structure, Information Systems, 31(3): 187228, 2006.

Gao J. and Zhang J. (2005): Clustered SVD strategies in latent semantic indexing.Inf. Process. Manag., Vol. 41, No. 5, pp. 1051 1063.

Berry M.W. and Shakhina A.P. (2005): Computing sparse reduced rank approximation to sparse matrices. ACM Trans. Math. Software, 2005, Vol. 31, No. 2, pp. 252269.

J. C. Dunn, Well Separated Clusters and Optimal Fuzzy Partitions,
Journal Cybern., 4, 1974,. 95104.

J. C. Bezdek, Numerical Taxonomy with Fuzy Sets, J. Math. Biol., 1, 1974, 5771.

J. C. Bezdek, Fuzzy Mathematics in Pattern Classification, (Ph.D Thesis, Cornell University, 1973).