Data Dimension Reduction for Clustering Semi-Structured Documents using QR Fuzzy C-Mean (QR-FCM)

Hsu-Kuang Chang

doi:10.17577/IJERTV8IS060686

Volume 08, Issue 06 (June 2019)

Data Dimension Reduction for Clustering Semi-Structured Documents using QR Fuzzy C-Mean (QR-FCM)

DOI : 10.17577/IJERTV8IS060686

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 157
Total Downloads : 76
Authors : Hsu-Kuang Chang
Paper ID : IJERTV8IS060686
Volume & Issue : Volume 08, Issue 06 (June 2019)
Published (First Online): 02-07-2019
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Data Dimension Reduction for Clustering Semi-Structured Documents using QR Fuzzy C-Mean (QR-FCM)

Hsu-Kuang Chang

Department of Information Engineering I-Shou University Kaohsiung, Taiwan

AbstractThe rapid growth of XML adoption has created an urgent need for a proper representation for semi-structured documents, where the document semantic structural information has to be taken into account in order to support a more precise document analysis. In order to efficiently analyze the information represented in XML documents, research on XML document clustering is actively in progress. The key issue is how to devise a similar measure to be used for clustering between XML documents and, since XML documents have a hierarchical structure, it is not appropriate to cluster them by using a general document similarity measure. Data dimension reduction (DDR) plays an important role in handling a massive quantity of high dimensional data such as mass semantic structural documents. Having projected XML documents to the lower dimensional space obtained from the DDR/QR, our proposed method of QR Fuzzy C-Mean is to execute the document-analysis clustering algorithms (we call QR-FCM). The DDR can substantially reduce the computing time and/or memory requirement of a given document-analysis clustering algorithm, especially when we need to run the document analysis algorithm many times for estimating parameters, or to search for a better solution.

KeywordsQR; DDR; PESSW; FCM; QR-FCM

INTRODUCTION

An XML document, which is semi-structured data, has a hierarchical structure. Therefore, rather than using the similarity measure of the general document clustering techniques as it is, a new similarity measure which considers the semantic and structural information of an XML document must be investigated. However, some XML clustering methods used the similarity measure which only takes the structural information of XML documents into account. Hwang proposes a clustering method which extracts a typical structure of the maximum frequency pattern n using PrefixSpan algorithm [1] on XML documents [2, 3]. However, since such a typical structure extracted from XML documents is not the only structure which represents the XML document itself, it cannot be the representative of the whole document corpus, since there is an accuracy issue of similarity. Lian summarizes XML documents into an S- graph which is a structural graph, and proposes that the calculation method of the distance between S-graphs is used for clustering [4]. However, they do not consider semantic information on XML documents since they only focus on structural information. Since dimension reduction is one of

great many studies on effective and efficient dimension reduction algorithms. There are linear dimension reduction algorithms including principal component analysis (PCA) [5] and multidimensional scaling (MDS) [6]. There are also nonlinear dimensional reduction algorithms (NLDR) including an Isomap [7], locally linear embedding (LLE) [8], [9], Hessian LLE [10], Laplacian eigenmaps [11], local tangent space alignment (LTSA) [12] and a distance preserving dimension reduction based on a singular value decomposition (DPDR/QR) [13]. These dimensions cover a variety of areas such as biomedical image recognition, biomedical text data mining, and biological data analysis.

The rest of this paper is organized as follows. In Section II, we introduce the prepared XML documents on a vector space model. In Section III, we show the DDR based on the QR factorization and the DDR QR-FCM. Section IV presents the experimental results illustrating properties of the proposed DDR methods. A summary is made in Section V.
PREPARATION OF SEMANTIC-BASED XML DOCUMENTS

In this section, we first introduce the pre-processing steps for the incorporation of hierarchical information in encoding the XML trees paths. This is based on the preorder tree representation (PTR) [14] and will be introduced after a brief review of how to generate an XML tree from an XML document.

Figure 1 illustrates an example of structural summary. By applying the phase of the nested reduction on tree T1, we derived tree T2 where there are no nested nodes. By applying the repetition reduction on tree T2, we derived tree T3, which is the structural summary tree without nested and repeated nodes. Once the trees have been compacted using structural summaries, the nesting and repetition are reduced.

Figure 1 nested and repeated nodes extraction

Now the XML document is modeled as a XML tree T=(V,E) where V={ v , v , ….} as a set of vertices and V ,

the fundamental methods of data analysis, there have been a 1 2 1

2 V , (v1 , v2 ) E as a set of edges. As an example, Figure 2 depicts a sample XML tree containing some information about the collection of books. The book consists of intro tags, each comprising title, author and date tags. Each author contains fname and lname, each date includes year and month tags. Figure 2 left shows only the first letter of each tag for simplicity.

PATH ELEMENT OF THE VECTOR SPACE MODEL (PEVSM)

Vector model represents a document as a vector whose elements are the weights of the path elements within the document. To calculate the weight of each path element within the document, a Term Frequency and IDF (Inverse Document Frequency) method is used [15].

Path Element Structural Semantic Weight (PESSW)

We define the PESSW (Path Element Structural Semantic Weight) which calculates the weight of the path element in an XML document. The PESSW is PEWF (Path Element Weighted Frequency) multiplied by the PEIDF (Path Element Inverse Document Frequency). The PESSWij of ith path element in the jth document is shown in equation (1). In this paper, we use the PESSW and DPESSW interchange.

PESSWij

PEWFij PEIDFij

(1)

Figure 2 Example of XML Document

PEWFij is shown in equation (2).

The XML document has a hierarchical structure and this structure is organized with tag paths, to represent document

PEWFij

freqij xn

(2)

1

characteristics which can predict the contents of the XML document. Strictly speaking, this shows the semantic

freqij is a frequency of j-th path element in a i-th document

1

structural characteristics of the XML document. In this paper, we propose a new method for calculating the similarity using

and it is multiplied by level weight

xn

in order to consider

all of the tag paths of the XML tree representing the semantic structural information of the XML document. From now on, a tag path is termed path element. Table I shows path elements obtained from the XML document in Figure 3. The PEL-i represents the extracted path elements on the XML document tree from the ith tree level to the leaf node. For

the semantic importance of a path element in a document. X refers to the level number of the highest tag of a tag path. The level number of the root tag is 1, and that of a tag under the root tag is 2, and so on. N is a real number larger than 1, and in this paper, 1 is chosen for the value of n. PEIDFij is shown in equation (3). PEIDFij is shown in equation (3).

example, the PEL-1 means the path element from the root

PEIDF log N

(3),

level (level 1) to the leaf node, and PEL-2 means the path element from the level 2 to the leaf node respectively.

ij DF

j

where N is the total number of documents and DFj is the number of documents in which the jth path element appears. The PESSW i prudently calculated to correctly reflect the structural semantic similarity. Table II shows the PEWF, PEIDF, and PESSW on sample trees in Figure 3.

Path Element	PEWF			PEIDF			PESSW
Path Element	doc1	doc2	doc3	doc1	doc2	doc3	doc1	doc2	doc3
/B/I/T/D	1.0	0.0	0.0	1.1	0.0	0.0	1.1	0.0	0.0
/B/I/A/F	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/B/I/A/L	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/M/I/A/L	0.0	0.0	1.0	0.0	0.0	1.1	0.0	0.0	1.1
/B/I/T/	2.0	1.0	0.0	0.41	0.41	0.0	0.81	0.41	0.0
/B/I/A	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/B/I/	2.0	1.0	0.0	0.41	0.41	0.0	0.81	0.41	0.0
/B	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/M/I/T	0.0	0.0	1.0	0.0	0.0	1.1	0.0	0.0	1.1
/M	0.0	0.0	1.0	0.0	0.0	1.1	0.0	0.0	1.1
/M/I	0.0	0.0	2.0	0.0	0.0	1.1	0.0	0.0	2.2
/I/T/D	0.5	0.0	0.0	1.10	0.0	0.0	0.5	0.0	0.0
/I/A/F	0.5	0.5	0.0	0.41	0.41	0.0	0.21	0.21	0.0
/I/A/L	0.5	0.5	0.5	0.0	0.0	0.0	0.0	0.0	0.0
/I/T	1.0	0.5	0.5	0.0	0.0	0.0	0.0	0.0	0.0
/I/A	0.5	0.5	0.5	0.0	0.0	0.0	0.0	0.0	0.0
/I	1.0	0.5	1.0	0.0	0.0	0.0	0.0	0.0	0.0
/T/D	0.33	0.0	0.0	1.1	0.0	0.0	0.33	0.0	0.0

Path Element	PEWF			PEIDF			PESSW
Path Element	doc1	doc2	doc3	doc1	doc2	doc3	doc1	doc2	doc3
/B/I/T/D	1.0	0.0	0.0	1.1	0.0	0.0	1.1	0.0	0.0
/B/I/A/F	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/B/I/A/L	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/M/I/A/L	0.0	0.0	1.0	0.0	0.0	1.1	0.0	0.0	1.1
/B/I/T/	2.0	1.0	0.0	0.41	0.41	0.0	0.81	0.41	0.0
/B/I/A	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/B/I/	2.0	1.0	0.0	0.41	0.41	0.0	0.81	0.41	0.0
/B	1.0	1.0	0.0	0.41	0.41	0.0	0.41	0.41	0.0
/M/I/T	0.0	0.0	1.0	0.0	0.0	1.1	0.0	0.0	1.1
/M	0.0	0.0	1.0	0.0	0.0	1.1	0.0	0.0	1.1
/M/I	0.0	0.0	2.0	0.0	0.0	1.1	0.0	0.0	2.2
/I/T/D	0.5	0.0	0.0	1.10	0.0	0.0	0.5	0.0	0.0
/I/A/F	0.5	0.5	0.0	0.41	0.41	0.0	0.21	0.21	0.0
/I/A/L	0.5	0.5	0.5	0.0	0.0	0.0	0.0	0.0	0.0
/I/T	1.0	0.5	0.5	0.0	0.0	0.0	0.0	0.0	0.0
/I/A	0.5	0.5	0.5	0.0	0.0	0.0	0.0	0.0	0.0
/I	1.0	0.5	1.0	0.0	0.0	0.0	0.0	0.0	0.0
/T/D	0.33	0.0	0.0	1.1	0.0	0.0	0.33	0.0	0.0

TABLE II. AN EXAMPLE OF PTWF, PTIDF AND PESSW

Figure 3 XML Documents Example

Table I. Path elements example

PEL-1	PEL-2	PEL-3	PEL-4
/B/I/T/D	/I/T/D	/T/D	D
/B/I/A/F	/I/A/F	/A/F	F
/B/I/A/L	/I/A/L	/A/L	L
/M/I/A/L	/I/T	/T
/B/I/T/	/I/A	/A
/B/I/A	/I
/B/I/
/B
/M/I/T
/M
/M/I

/A/F	0.33	0.33	0.0	0.41	0.41	0.14	0.14
/A/L	0.33	0.33	0.33	0.0	0.0	0.0	0.0
/T	0.67	0.33	0.33	0.0	0.0	0.0	0.0
/A	0.33	0.33	0.33	0.0	0.0	0.0	0.0
D	0.25	0.0	0.0	1.1	0.0	0.27	0.0
F	0.25	0.25	0.0	0.41	0.41	0.1	0.1
L	0.25	0.25	0.25	0.0	0.0	0.0	0.0

/A/F	0.33	0.33	0.0	0.41	0.41	0.14	0.14
/A/L	0.33	0.33	0.33	0.0	0.0	0.0	0.0
/T	0.67	0.33	0.33	0.0	0.0	0.0	0.0
/A	0.33	0.33	0.33	0.0	0.0	0.0	0.0
D	0.25	0.0	0.0	1.1	0.0	0.27	0.0
F	0.25	0.25	0.0	0.41	0.41	0.1	0.1
L	0.25	0.25	0.25	0.0	0.0	0.0	0.0

PESSW, document vector QT D r (refer to economic QR

factoring), document vector

Q T D r

(refer to QR

QR-FCM

Clustering Algorithm

QR-FCM

Clustering Algorithm

f

Let dx and dy be two vectors which represent an XML document docx and docy. Cosine similarity is defined as

factorization of rank PESSW), and document vector

Q

~T D ~r (refer to QR factorization of efficient low rank

PESSW) based on the XML documents, which is taken as the QR-FCM input data and then goes through the clustering.

being the angle between two vectors and is quantified by

T

equation (4) and (5).

Vector

T

Vector

PESSW

rk Qe d i

cos

d x d y

, that is (4)

input Vctor

Vctor

r Q T d , k rank (d )

k

Q

~r ~T d , k lowrank (d )

| d x | | d y |

t

k

N 1 N

T wkx wky

F(I)=[f1,f2,..,fc] , where

fi ij Pj N ij .

sim(doc

, doc

) d x d y

k 1

j1

x y

(5)

Fuzzy C-Means (FCM) is a method of clustering which

| d x | | d y | t 2 t 2

allows one piece of data to belong to two or more clusters.

dx (w1x ,w2x ,…,wtx )

wkx

k 1

wky k 1

This method developed by [17] and improved by [18] [19] is frequently used in pattern recognition. It is based on the

, minimization of the following objective function:

d (w , w

,…,w )

and

(w , w

,…,w ) is

N C m 2

y 1y 2 y ty

1x 2 x tx

Jm ij

di c j

, 1 m

weight of dx,

(w1y , w2 y ,…,wty ) is weight of document of dy,

i1

j1

and t is the total number of path elements in dx, dy respectively [16].

Data Dimension Reduction (DDR) Via QR Factorization

Let us deal with n XML documents the dimension of which is m s.t. m n . We compute the QR factorization of

the XML document matrix D mn :

where m is any real number greater than 1, uij is the degree of membership of di in the cluster j, di is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and

||*|| is any norm expressing the dissimilarity between any

measured data and the center.

Fuzzy partitioning is carried out by means of an iterative optimization of the objective function shown above, with the update of membership uij and the cluster centers cj by:

R1

R1 N m

D QR Q 0 Q1 Q2 0 Q1R1 ,

ij di

1 c

i1

Q mm

,is an orthogonal matrix and

R nn is an

, j N

1

upper triangular matrix. Then, Q mn can be considered

ij

C d c

2 m

m1 ij

1 i j

i1

as being a dimensionality transformation matrix when m > n

k 1

di ck

and the lower dimensional representation x n1 of a

vector x m1 can be computed as x QT x . Thus, the

This iteration will stop when

1 max

{ ( k 1) ( k ) }

, where is a termination

1

1

i

i

T

T

lower dimensional representation di of each XML document di Q d ri , where di is the i-th column of D

ij ij ij

criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle

and ri is the i-th column of R1. We can obtain cosine similarities between the two XML documents (di and dj) in the original dimensional space from R1 as

point of Jm.

EXPERIMENT RESULT

In Figure 4, we show the occupied space (k) on the variant

d T Q QT d

rT r

T T

cos(d , d )

i 1 1 j

i j ,

document vectors DPESSW,

Q D , Q D , and

1

Q

i

i j QT d/p>

QT d

1

Q

j

2 2

i 2 rj 2

~T D

from 400, 800, 1200, 1600, and 2000 XMLs

1

r

where rj is the j-th column of R nn . We refer this

separately. When comparing the DPESSW with the ~T D on

method to as DDR-QR.

Our proposed DDR SVD-QR Algorithm

As described in the previous section, from the QR factorization, we have the originated document vector

2000 XMLs, we could save a lot of space, and importantly the clustering result where we used the QR-FCM would be unaffected.

In Figure 5, we show the CPU executing time (ms) to run QR-FCM on the variant document vectors DPESSW , Q T D ,

Q

T

on

Q D

Space(k)/CPU(ns) Variant Vectors	2000 XMLs	1600 XMLs
Space(k)/CPU(ns) Variant Vectors	% saved Space CPU	% saved Space CPU
Q T D ~ Q T D	70% 34%	70% 34%
Q T D ~ ~T D Q	85% 69%	85% 69%
Q T D ~ ~T D Q	50% 53%	51% 53%

Space(k)/CPU(ns) Variant Vectors	2000 XMLs	1600 XMLs
Space(k)/CPU(ns) Variant Vectors	% saved Space CPU	% saved Space CPU
Q T D ~ Q T D	70% 34%	70% 34%
Q T D ~ ~T D Q	85% 69%	85% 69%
Q T D ~ ~T D Q	50% 53%	51% 53%

Q T D , and ~T D from 400, 800, 1200, 1600, and 2000 XMLs

Table III. Percentage of space and CPU saved in DDR-QR.

separately. When comparing the D

PESSW

with the ~

2000 XMLs, we could save a great deal of time, and importantly the clustering result where we used the QR-FCM still remains the right result.

Figure 4: Space on the variant vector from different # XMLs

Table IV. Percentage of saved in DPESSW / thin-QR.

Space(k)/CPU(ns) Variant Vectors	2000 XMLs	1600 XMLs
Space(k)/CPU(ns) Variant Vectors	% saved Space CPU	% saved Space CPU
D ~ QT D PESSW	81% 50%	81% 50%
D ~ Q T D PESSW	94% 67%	94% 68%
D ~ ~T D PESSW Q	97% 85%	97% 86%

CONCLUSION

The original XML documents DN=[d1,d2,..,dN] are modeled on the vector space model according to the path element of each document, that is DPESSW=PESSW, then a QR factorization was conducted on the DPESSW. We derived the

DPESSW=QR ( D Q R ), or Q T D R

, and then took

k k k k

the rank on the r Q T d

with the rank of DPESSW, and

k k

k

Qk

finally ~r ~T d with the low-rank of QR on DPESSW. We

Rk

~

passed the 4 resulting vectors (DPESSW, Rk , Rk , and ) into

the QR-FCM clustering algorithm to attain the clustering result. In terms of clustering result of the section experiment, we found the same clustering result as from the variant

Rk

~

DPESSW, Rk , Rk and vectors. From the practical

Figure 5: CPU (ms) executing QR-FCM on the variant vector

Q

In Table III, we show the percentage of space saved when

experiment results, we conclude that using the low-rank vector

Rk

~ instead of the PESSW original document not only saved

comparing QT D with Q T D , QT D with

~

~T D , Q T D

with

space on the input vector but also took less time to cluster the documents.

QT D ,on running QR-FCM using 2000 XMLs and 1600

XMLs from 5 DTDs. We also show the percentage of the CPU executing time when comparing QT D with

T

Q

~ D , QT D and Q T D , Q T D with ~T D , on running QR-

FCM using 2000 XMLs and 1600 XMLs from 5 DTDs. On

REFERENCES

J. Pei, J. Han, B. M. Asi, H. Pinto, PrefixSpan : Mining Sequential Pattern efficiently by Prefix-Projected Pattern Growth, Int. Conf. Data Engineering(ICDE), 2001.
J. H. Hwang, K. H. Ryu, XML A New XML clustering for Structural Retrieval, International Conference on Conceptual

comparing

QT D

with

Q T D

, we find that using

Modeling, 2004.
Jwong Hee Hwang, Keun ho Ryu, Clustering and Retrieval of XML

Q T D instead of QT D on 2000 XMLs for the QR-FCM, saving % of space in k and % of CPU time in ms. In Table IV, we show the percentage of the space saved and CPU time

Q

Q

(ns) when comparing the original document vector DPESSW with QT D , Q T D , and ~T D on running QR-FCM using

2000 XMLs and 1600 XMLs from 5 DTDs. When comparing DPESSW with Q T D , we find that using

Q T D instead of DPESSW on 2000 XMLs for the QR-FCM saves % of space in k and % of CPU running time in ms.

documents by Structure, Computational Science and Its Applications-ICCSA 2005.
Wang Lian, David Wai-lok, An Efficient and Scalable Algorithm for Clustering XML Documents by Structure, IEEE Computer Society Technical Committee on Data Engineer-ing , 2004.
W. F. Massay, Principal components regression in exploratory statistical research, J. Amer Statist. Assoc., vol. 60, pp. 234-246, 1965.
W. S. Torgerson, Theory & Methods of Scaling. New York: Wiley,1958.
J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global geometric framework for nonlinear dimensionality reduction," Science, vol. 290, no. 5500, pp. 2319-2323, 2000.
S. T. Roweis and L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding Science, vol. 290, pp. 2323-2326, 2000.
L. K. Saul and S. T. Roweis, "Think globally, fit locally: Unsupervised learning of low dimensional manifolds," Journal of Machine Learning Research, vol. 4, pp. 119-155, 2003.
D. L. Donoho and C. E. Grimes, "Hessian eigenmaps: locally embedding techniques for high-dimensional data," Proc. Natl Acad. Sci. USA, vol. 100, pp. 5591-5596, 2003.
M. Belkin and P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation, vol. 15, no. 6, pp. 1373-1396, 2003.
Z. Zhang and H. Zha, "Principal manifolds and nonlinear dimension reduction via tangent space alignment," SIAM Journal of Scientific Computing, vol. 26, no. 1, pp. 313-338, 2004.
Hyunsoo Kim, Haesun Park, and Hongyuan Zha, Distance preserving dimension reduction using the QR factorization or the Cholesky factorization, Proceedings of the 2007 IEEE 7th International Symposium on Bioinformatics & Bioengineering (BIBE 2007), vol I, pages 263-269.
Theodore Dalamagas, Tao Cheng, Klaas Jan Winkel, Timos Sellis, A Methodology for Clustering XML Documents by Structure, Information Systems, 31(3): 187-228, 2006.
Gao J. and Zhang J. (2005): Clustered SVD strategies in latent semantic indexing.Inf. Process. Manag., Vol. 41, No. 5, pp. 1051 1063.
Berry M.W. and Shakhina A.P. (2005): Computing sparse reduced- rank approximation to sparse matrices. ACM Trans. Math. Software, 2005, Vol. 31, No. 2, pp. 252269.
J. C. Dunn, Well Separated Clusters and Optimal Fuzzy Partitions,

Journal Cybern., 4, 1974,. 95-104.
J. C. Bezdek, Numerical Taxonomy with Fuzy Sets, J. Math. Biol., 1, 1974, 57-71.
J. C. Bezdek, Fuzzy Mathematics in Pattern Classification, (Ph.D Thesis, Cornell University, 1973).

Data Dimension Reduction for Clustering Semi-Structured Documents using QR Fuzzy C-Mean (QR-FCM)

Leave a Reply