Multi-Modal Knowledge Fusion Method Based on Label Alignment

Download Full-Text PDF Cite this Publication

Text Only Version

Multi-Modal Knowledge Fusion Method Based on Label Alignment

M Pradeep1, S Hanumantha Rao2

Associate Professor1, Professor2 Department of Electronics and Communication

Engineering

Shri Vishnu Engineering College for Women, Bhimavaram, India

Abstract. In order to solve the "semantic gap" between text, image, video and other different modal data, we proposed a multi-modal knowledge fusion method based on label alignment. This method realizes text semantic annotations according to the characteristics of text, picture, and video data, constructs text label set of each modal data, calculate the label similarity of set of multi-modal labels with the semantic similarity method of the longest common subsequence, and select the text pair with highest confidence to fuse different modal data. In order to verify the effectiveness of the method, we built a data set containing three modalities of text, picture, and video for the field of CCP information building. The method of this paper was used to fuse multi-modal data on the data set, and analyze the experimental results. The accuracy of the method is 97.3%, which shows the high effective of the method.

Keywords.Label alignment, multimodal data, knowledge fusion, semantic annotation, label similarity.

INTRODUCTION

In the era of big data, multi-modal data such as text, image, and video on the Internet has shown explosive growth. Each modal has its own specific information and statistical characteristics, but different modal data have different levels of knowledge expressivity on varying degrees, and usually share high-level concepts and semantic information. Therefore, research on the feature representation and learning methods of multiple modal information, overcome the impact of heterogeneous problems on multi-modal representation, eliminate redundancy in multi-modal heterogeneous data, and realize the collaborative representation of multi-modal

data, has profound significance for knowledge representation[1]. Moreover, merge the intra-modal information and cross-modal complementary information to obtain comprehensive features at the conceptual level by

different modal information fusion, which can improve the performance of actual tasks such as retrieval, classification, etc.

In the research of multi-modal data, multi-modal data fusion method is a core issue, its purpose is to integrate characteristics of multi-modal data to obtain consistent and common model output[2]. Multimodal fusion can use complementary information in multimodal data to discover the dependence of knowledge on multimodality. The existing multi-modal data fusion methods can be divided into phase-based fusion algorithms, feature-based fusion algorithms, and semantic-based fusion algorithms[3]. The phase-based fusion algorithm is an early period method that uses different modal data at different stages of the data mining task to complete the corresponding fusion analysis. This type of method does not require consistency between different modal. The feature-based fusion algorithm originally refers to the sequential connection of features extracted from different data sets into a feature vector, and then complete tasks such as clustering, classification, and prediction based on the concatenated features. According to this, researchers have proposed some improved algorithms, such as using deep neural networks to learn unified feature representations of different modal data[6,7], literature[8] gives two basic multi-modal feature fusions model base on deep network, through the coupled modal Deep Auto Encoder (DAE) can complete cross-modal data feature learning. Srivastava et al.[9] proposed a multi-modal Deep Boltzmann Machine model (DBM), which combines two modal features of pictures and text to complete data classification and retrieval. The disadvantage of this type of method is that because the representation, distribution and density of different modes may be different, simple attribute connection will ignore the correlation between the modes, and only regard a feature as a real-valued number or a categorical value [4]. Different from

feature-based fusion, semantic-based fusion algorithm refers to understanding the data meaning of each modal and the relationship between different modal features, abstracting the semantics of different modalities in the data fusion process to complete cross-modal data fusion. The existing semantic-based fusion algorithms are roughly divided into collaborative training methods, multi-core learning methods, subspace learning methods, probability-dependent methods and transfer learning methods. Since the subspace learning method is based on the semantic sharing of data multi-modal description, and can project multiple high-dimensional feature sets to the same low-dimensional semantically related space to alleviate the dimensional disaster problem, it has attracted more and more scholars' attention[18]. To research the correlation between different modal data, the point is to construct a common feature subspace, map the feature vectors of different modal data to this space, and then perform similarity on the data of different modalities in this space measure. The earliest multi-modal shared subspace learning algorithm uses C Correlation Analysis (CCA)[10-12] to maximize the correlation between two modalities, learn the largest correlation subspace and output the projection matrix corresponding to each modal. In order to solve the non-linear and non-orthogonal problems of the original data, some scholars have proposed a Kernelized Canonical Correlation Analysis (KCCA)[13], Mixed Probabilistic Canonical Correlation Analysis (Mix PCCA) [14], Deep-typical Canonical Correlation analysis (DCCA)[15] and other methods. Liang et al.[16] proposed a group-invariant cross-modal subspace learning method. This method not only learns the symbiosis information of sample pairs, but also learns the group symbiosis relationship between different modalities while learning the projection subspace. This high-level semantic correspondence effectively improves the robustness of the latent subspace and the accuracy of retrieval. Mahadevan et al.[17] proposed to learn low-dimensional embedding while maintaining the local geometric structure in different modes, which effectively improves the stability of the embedding. Commonly used subspace mappings models include Bilinear model (BLM)[19], Partial Least Squares (PLS)[20-22] and so on. In addition, some methods map all modal knowledge to the common semantic space, and then align entities between different modalities to facilitate representation by calculating the semantic similarity between

different modal knowledge in the common subspace. For example, Zhu et al.[23] proposed a joint knowledge embedding method to achieve entity alignment, and improved the alignment performance based on the iterative training method[24]. According to the internal structure information (entities and relationships) of heterogeneous knowledge graphs, this method first uses PTransE (path-based TransE)[25] to learn distributed representations of different knowledge graphs. The entities and relationships in the heterogeneous knowledge graph are encoded into a unified, continuous low-dimensional semantic space, and the entities are aligned according to the semantic distance between the entities in the joint space. The experiments of Lin et al.[26] on PTransE show that considering the relationship path can greatly improve the recognition rate of knowledge representation learning and the performance of knowledge graphs to complete tasks. JAPE[27] uses joint representation learning technology to directly embed entities and relationships in different knowledge graphs into a unified vector space, and converts the matching process between instances in different knowledge graphs into a process of calculating the distance of vector representation. IPTransE only iteratively updates instance matches, while JAPE uses attributes and text description information to enhance case representation learning. EnAli[28] is an unsupervised method for matching entities in two or more heterogeneous data sources. Wu et al.[29] constructed a unified fusion framework to train a specific domain-specific emotion classifier for the target domain by fusing emotion knowledge from multiple sources.

In view of the general problem of existing multi-modal data fusion methods that cannot effectively model the same semantic information in multi-modal samples, this paper studies the multi-modal data fusion in different scenarios. Based on the relevant characteristics of multi-modal data, proposes A multi-modal data knowledge fusion framework based on label alignment, which fully considers the relationship between modalities. By standardizing and aligning the labels between different modal knowledge, the potential shared information of each modal data is collaboratively learned to effectively improved, the accuracy of multi-modal data fusion is given. The article gives its detailed design and optimization process, and verifies the performance of the proposed knowledge fusion model on the multi-modal data set in the vertical field of party building,

which provides reference for the research of multimodal fusion.

  1. Structure

    This paper focus on the research of multi-modal fusion tasks. Multi-modal fusion refers to the integration of different forms of information (such as pictures, videos, text, audio, etc.) from two or more modalities and uses independent models to extract different the instance sub-components of the modal, then find the correspondence between the instance sub-components from two or more modalities, and perform modal alignment. After the above, the consistent interpretation or description of multi-modal data is obtained through calculations. The consistent interpretation or description of modal data through calculation and averaging is used to complete the process of analysis and identification tasks, that is, through data assimilation, the given alternative multi-modal data is fused to dig out more potential information.

    1. Multimodal fusion

      Given the multi-modal data set = {1 , 2, , } ,

      1, 2, , refers there are n types modal in the set , where = {1, 2, 3, , }, , 2, 3, , , are the basic elements in the modal . If = {1, 2, 3}, the semantic space is = {1, 2, , } , 1, 2, , are specific semantics, and multimodal fusion refers to obtaining

      the same semantics through fusion algorithms The multi-modal data pair of (1, 2, 3), where1 1,2

      2, 3 3, 1, 2, 3 .

    2. Label alignment

      Given two data label sets A and B, find out all the labels in data set A (or B) that can be aligned to B (or A), namely:

      (, ) = {(, )| , }

  2. Multimodal data fusion method basedon label alignment

    Text

    a

    a3

    a2

    Text

    a

    a3

    a2

    a1

    a1

    a2

    a2

    c1

    a1 b1

    b2

    c2

    c3

    b3

    c1

    a1 b1

    b2

    c2

    c3

    b3

    Image

    Image

    b

    b3

    b1

    b2

    a3

    a3

    Video

    Video

    c1 c

    c3

    c2

    Figure 1. Multi-modal data fusion framework.

    Figure 1 shows the two main tasks of text, image, and video information extraction and multi-modal information fusion included in this method.

    2.2. Text information extraction

    Most of the existing text corpus is unstructured data, and text information extraction is to extract the required information fields from semi-structured and unstructured text content. The text information extraction method used in this paper is a segmented convolutional neural network model that introduces an attention mechanism. The model is used to extract the main entities and their relationships in the corpus, and to extract important information from the original data. Among them, the use of segmented convolutional neural network for keyword extraction can improve the semantics of feature vector representation, reduce the sparsity of data dimensions, and effectively solve the problem of data noise. In addition, for the input text data, some words may contain more important semantic information than other parts. Therefore, introducing an attention mechanism to focus on key information and ignoring irrelevant information can reduce the interference caused by irrelevant information on the extraction of semantic features. The point of information extracted by the model is get more accurate. The specific process of text information extraction is shown in Figure 2:

    Attention Layer

    2.1. Multimodal data fusion framework

    The model we proposed in this paper is mainly divided into two parts. The first part is the construction of labeled multi-modal data sets, which completes the semantic labeling process of text, picture, and video data, extracts the labels of

    heterogeneous data and performs a unified semantic

    b1 b2

    .

    .

    .

    bq-1 bq

    Input

    T

    Input

    T

    Align Weights

    Bag

    features

    Softmax

    representation, then output their semantic feature labels in the same format; the second part is to complete the data fusion process by using the text labels in different modalities. The multi-modal data fusion framework is shown in Figure 1:

    =12 ,q

    Figure 2. Text information extraction.

    This process takes the word sequence in the corpus as

    ATT

    ATT

    input, and uses the Skip-gram model of word2vec to represent a single word in a vector form. The position vector corresponding to the word is spliced, and the feature map is obtained through the convolutional layer. In order to better capture the structured information between the two entities, the feature map is divided into three segments at the pooling

    to solve the impact of the amount of data on the accuracy of tags, Manual intervention is used to calibrate part of the data, which improves the accuracy of semantic tags, and finally obtains image semantic tags .

    The specific extraction process is shown in Figure 3.

    layer for pooling. Finally, the result of the pooled feature vector through the softmax layer is stored as text data in a two-dimensional table.

      1. Image semantic label extraction

        Input

        I

        Encoder

        CNN

        CNN

        x1 x2

        xm

        xn

        Decoder

        z1 z2

        zt

        zP

        Modal Output y1

        LSTM

        LSTM

        Manual correction

        Manual correction

        Semantic label

        b

        Semantic label

        b

        y2

        yt

        yP

        Compared with text, pictures contain more visual information, such as shallow colors and textures, and higher-level semantic information such as characters, actions, etc. Therefore, this paper uses deep learning model combined with manual correction to extract semantic tags of pictures. The deep learning model adopts the Encoder-Decoder structure. First, the picture is input and enters the Encoder part. After the spatial characteristics of CNN, the feature map of the convolutional layer is used to extract the features of n

        positions in the picture = {1, 2, 3, , }, , where x is a D-dimensional vector, for example, set the height and width of the feature map to 14, the number of channels to 256, n=14*14=196, D=256. At the same time, in order to

        extract the high-level semantics of the image, an Attention mechanism is added during decoding to assign different weights t the extracted image features.

        Suppose that in the t-th stage of decoding, when the t-th word is generated, is the context vector passed into the decoder RNN, and 1 is the hidden layer state of the previous stage of the RNN. This context vector is a

        weighted average of = {1, 2, 3, , }, specifically, the

        relationship between and = {1, 2, 3, , } can be expressed by equation (3.1) :

        = =1 , (3.1)

        , is to measure the weight of the image feature at the m-th position when the t-th word is generated. This weight is actually a function of the previous hidden layer state 1 and the m-th position image feature . Get the feature as the input of LSTM, generate the hidden variable, and output the model result .

        The above model greatly improves the accuracy of the

        generated semantic tags compared with traditional methods. However, due to less supervision data, there are still problems such as semantic focus shift and unclear description. In order

        Figure 3: Image semantic label extraction.

        Where I is the input color image, {1, , , , } is the "image semantic description", that is, the semantic label of the picture, where the number of words in the final output sequence is its length p( the size of p is variable). Each word

        is a K-dimensional probability, which is used to represent the probability of the word appearing in the image semantics, where k is the size of the text dictionary used by the model.

        and are D-dimensional features, which respectively represent n different description areas of the input image and the context corresponding to each description area.

      2. Video semantic label extraction

        Since the video is composed of a continuous image sequence, in which character action boundaries, repeated action segments, event occurrence timing information, and optical flow changes between frames are relatively fine, it is necessary to provide a large amount of supervision information to guide model training. There is still a big gap between research and methods for the goal of video semantic information extraction. The use of machine learning methods to extract video semantic information will have a very large impact on the final results of multimodal fusion, and the results will even lose their meaning, and artificial Annotation is easy to understand the semantic information expressed by the image sequence in the video, and based on the semantic content of the video, the video can be classified into human behavior and complex events. Finally, according to the text label rules, the best key themes of the video are marked to form a complete natural sentence, and a description label containing the most dynamic information and the most accurate topic is generated for the video. Therefore, this article uses manual labeling to extract video tags, comprehensively considering the important frames of the

        video and the semantic information contained in it, the main process is shown in Figure 4.

        =

        (2 ) (1 + 2)

        Input

        V Manual I dentification

        Manual labeling

        Key theme

        Key theme

        Generate labels

        The longest common subsequence algorithm is used to calculate the similarity, which can quickly find the longest common subsequence between two sequences, and measure the similarity between the sequences according to the results

        Important Semantic Informatio n

        Important Semantic Informatio n

        Output video Label c

        Output video Label c

        Figure 4: Video semantic label extraction.

      3. Label alignment

    In the above process, three modal labels of text, image, and video can be obtained respectively. In order to achieve multi-modal fusion, this paper uses the similarity calculation of the longest common subsequence to fuse different modalities.

    The longest common subsequence means that given two sequences X and Y, query the longest common subsequence among all common subsequences in X and Y. The length of the sequence can be quantified to calculate similarity by applying the longest common subsequence degree. The longest common subsequence solving methods include exhaustive method and dynamic programming method, but when the amount of data reaches a certain amount, the amount of calculation will increase exponentially as the number of elements in the sequence increases, while the dynamic programming method can avoid repeated calculations, which greatly improves computing efficiency, and thus becomes the main method of use.

    Suppose the sequence = (1, 2, , ) and = (1, 2, , ), the number of elements of X and Y are n and m, respectively; = (1, 2, , ) and = (1 , 2, ,

    ) is a subset of the sequence X and Y respectively, where in andjm. According to the subset of the longest common subsequence between two sequences is also its common subsequence property, the longest common subsequence LCS( , ) must be solved recursively according to whether the corresponding element values are equal, formula (3.2) can be Indicates the specific calculation process:

    (, ) =

    { ( 1, 1) + , = ;

    {( 1, ), (, 1)},

    (3.2)

    If the length of the longest common subsequence obtained is k, the length of text 1 is n1, and the length of text 2 is n2, the calculation formula of similarity S is:

    obtained. This method has high effectiveness. Therefore, we use the longest common subsequence similarity algorithm for fusion, and the specific

    process is shown in Algorithm 1:

    Algorithm 1LSCMMF

    Input: T={ 1, 2, , } I={ 1, 2, , }

    V={1, 2, , }

    Output: E(a, i, v)

    Step 1T_lable_set =A, I_lable_set =B, V_lable_set =C,

    = {1, 2, , } =

    {1, 2, , } = {1, 2, , }

    Step 2commonset = (, , ) = A && && ; Step 3value_T = v1x value_I = v2y , lcs (v1x, v2y) =

    MAX(v1x,v2y)

    Step 4 = max((1), (2), , ())

    Step 5if lable = T,

    go to step6

    else

    return None Step 6output E(a, i, v)

    Among them, define T, as text data set, I as picture data set, and V as video data set; (, , ) is the data pair after fusion; common set is picture, text, and video common label set ; Value_T is the text semantic label value, and value_I is the image semantic label value.

  3. Experiment and result analysis

    1. Data collection and processing

      We use Python to crawl structured data and advanced party constitutions, party history and party disciplines from "People's Daily Online", "Communist Party Member", "Chinese Communist Party News", "Daily Economic Net" and other party building related websites. After that, we got our dataset that has a total of 4,000 pieces of unstructured text data (txt format) such as personalities, current political news, etc., were crawled from Baidu Encyclopedia, Baidu Pictures, Google Pictures and the aforementioned party building

      related websites, including important meetings, major events, and the birth of major figures in the party. A total of 1800 Pieces of picture data (jpg or png format), selected from Douban video, iQiyi,Youku video and otherrelated video websites , 1100 pieces of film and television video data (MP4 format). While crawling the data, we filter the data according to the corresponding format, the data type and data volume are shown in the following table.

      Table 1: Multi-modal party building data set

      td>

      310

      Modal type Data content

      text

      Image

      video

      character

      2872

      1000

      790

      event

      1128

      800

      Experimental process

      We use python3.6, ubantu16.04 in the experiment, the experimental development environment is Pycharm2015, and uses skimage digital image processing library, Jieba Chinese word segmentation library, atplotlib drawing library and other dependent libraries. The specific experimental process is shown in Figure 5:

      Begin

      TextImage Video Data

      Firstly, input the acquired multi-modal data set, extract the semantic labels of the text, picture, and video data respectively, extract the important people, events and their relationships and attributes related to party building from the text data through the PCNN+ATT model, and use them as text The semantic label specification is stored in the table; the image data is normalized, and the Encoder-Decoder model is used to extract the text data as a dictionary to extract the relevant image semantic label set; the video data is manually annotated according to the text The extraction result is annotated subject words and stored as a video tag set. Finally, the semantic similarity of the three modal tags is calculated, the longest common subsequence in the different modal tags is calculated and the similarity of the group of tags is calculated, the group with the highest similarity is selected for matching, and the pictures in the group of tags, The video semantic tag is linked to the corresponding text, and the corresponding data pair is output.

    2. Analysis of experimental results

      The results of the algorithm are evaluated using two commonly used indicators, AUC and AP.

      The AP index can well show the area under the

      Precision-Recall curve and can better reflect the overall performance of the algorithm. Its calculation formula is as

      Text

      semantic

      label set

      PCNN+ATT

      Encoder-Decoder

      Image

      semantic

      label set

      Manual

      labeling

      Video

      semantic

      label set

      shown in formula (4.1):

      AP = TP

      (4.1)

      Multi-modal

      label sets

      Candidate Label pair F1(a1,b1) F1(a1,c1) F1(a2,c2)

      No

      Calculate the similarity of each candidate label pair

      Similarity

      ranking

      Whether the candidate pair

      has the highest score

      Yes

      Output the original data pair corresponding to the label pair as the fusion result

      End

      Figure 5:Experimental flowchart.

      TP / TP + FP

      Among them, TP is a correctly aligned multi-modal data pair, FP is a multi-modal data pair with semantically unrelated data labels. In addition, in order to verify the fusion accuracy of different data pairs, this paper adopts a cross-validation method to calculate the fusion results separately for the characters and events in the data set. P1 is the alignment accuracy rate of person-related data, and P2 is the alignment accuracy rate of event-related data.

      The AUC method can better evaluate the accuracy of the method. The accuracy rate in the link prediction problem is defined as the probability that the link pair of nodes in the network is higher than the probability that there is no link node pair. Assuming that n evaluation experiments have been carried out, n times of results have higher scores for link node pairs, and the two types of node pairs score the same for n times. The calculation formula of the AUC result is as shown in formula (4.2):

      AUC =

      n+0.5n n

      (4.2)

      multimodal knowledge graphs, at the same time, it also can provide new ideas for research on cross-modal retrieval and

      Using the above evaluation indicators to verify the method on the constructed data set, the results are shown in Table 2. The accuracy of using the label alignment method reached 98.2% and 96.4% in the two categories of person and event respectively, and the average accuracy reached 97.3%. Among them, the picture and video label information is complete and accurate.

      Label

      alignment method

      AUC50%

      AP

      CN

      0.9122

      0.9003

      ACT

      0.8909

      0.8893

      Node2Vec

      0.9377

      0.9136

      Net2Vec-CLP

      0.9223

      0.9471

      Our Method

      0.9455

      0.9733

      Label

      alignment method

      AUC50%

      AP

      CN

      0.9122

      0.9003

      ACT

      0.8909

      0.8893

      Node2Vec

      0.9377

      0.9136

      Net2Vec-CLP

      0.9223

      0.9471

      Our Method

      0.9455

      0.9733

      Table 2: Method verification results

      The experimental result shows that: compared with other methods, our method use the longest common subsequence similarity algorithm to calculate the similarity before linking, comprehensively utilizes the same semantic labels of multi-modal data. The similarity algorithm calculates the association relationship between the tag sets to align the data in the multi-modal database library, to a certain extent, to make up for the "semantic gap" of the data, while Net2Vec-CLP and others only use network links to build semantic information. Therefore, the multi-modal fusion algorithm based on label alignment proposed in this paper has improved accuracy and AUC value. Among them, the accuracy has been improved the most. The reason is that this method uses structured semantic tags. This type of information It is more accurate than directly calculating multi-modal semantic feature associations.

  4. Summary and Outlook

This paper proposes a label alignment knowledge fusion method for multimodal data, which is different from the existing knowledge fusion method of multimodal data fusion. This model extracts semantic labels in each single-modal by different extraction ways, and use label alignment method to combine text, image and video data by merging the same format label. The experimental results show that the accuracy of the method we proposed in this paper is greatly improved to a certain extent compared with the existing methods. Therefore, in theoretical research and practical applications, the multimodal data knowledge fusion method proposed in this paper provides a new reference for the field of

visual question and answer fields. Yet the algorithm in this paper relies on labeled data, it is often difficult to find enough aligned sample pairs in the actual data set. To solve this problem, the next step will explore the use of unsupervised algorithms or small sample learning algorithms to study how to improve the data in the model processing efficiency to makes the model more suitable for actual data.

Acknowledgments

This work has supported by Natural science foundation of Ningxia province (No. 2020AAC03218), Ningxia first-class discipline and scientific research projects (electronic science and technology) (NO. NXYLXK2017A07) and the North Minzu University Graduate innovation program (No. YCX20075, No. S2020-11407-038G).

REFERENCES

  1. Baltrusaitis T, Ahuja C, Morency L P. Multimodal Machine Learning: A Survey and Taxonomy,IEEE Transactions on Pattern Analysis and Machine Intelligence, (2017), 423-443.

  2. Sun L, Li B, Yuan C, et al. Multimodal Semantic Attention Network for Video Captioning,IEEE Transactions on Big Data, (2019), 1300-1305.

  3. Zheng, Yu. Methodologies for Cross-Domain Data Fusion: An Overview,IEEE Transactions on Big Data, 1-1(2015), 16-34.

  4. Xu C, Tao D, Xu C. A Survey on Multi-view Learning, Computer Science, (2013), 2031-2038.

  5. Krizhevsky A, Sutskever I, Hinton G. ImageNet Classificaton with Deep Convolutional Neural Networks,Advances in neural information processing systems, 60(6)(2012), 84-90.

  6. Li W, Zhang X, Wang Y, et al. GrappSeq: Fusion Embedding Learning for Knowledge Graph Completion,IEEE Access 7: (2019)1-1.

  7. Xie T, Wu B, Jia B, et al. Graph-ranking collective Chinese entity linking algorithm,Frontiers of Computer Science, 14(2020),291-303.

  8. Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models, (2014), 2012-2025.

  9. Hassan Akbari, SveborKaraman, Surabhi Bhargava, et al. Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding,IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) Long Beach, California, (2019), 12476-12486.

  10. Ngiam J, Khosla A, Kim Ms et al. Multimodal deep learning,Proceedings of the 28th International Conference on Machine Learning, Bellevue: ACM, (2011), 689-696.

  11. Srivastava N, Salakhutdinov R. Multimodal Learning with Deep Boltzmann Machines, In Advances in Neural Information Processing Systems, Lake Tahoe, Spain, (2012), 2222-2230.

  12. Hardoon D R,Szedmak S, Shave-Taylor J. Canonical correlation analysis: an overview with application to learning methods[ J]. Neural Computation,2014,16 (12):2639-2664.

  13. Rasiwasia N, Pereira J C, Coviello E, et al. A new approach to cross-modal multimedia retrieval, In Proceedings of the 18th ACM international conference on multimedia. Firenze, Italy: ACM,(2010), 251-260.

  14. Menon A K, Surian D, Chawla S. Cross-modal retrieval: a pairwise classification approach,In Proceedings of the 2015 SIAM international conference on data mining,(2015), 199-210.

  15. AKAHO S. A kernel method for canonical correlation analysis, Proceedings of the international meeting of the psychometric society,(2001), 263-269.

  16. Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis,International conference on machine learning, (2013), 1247-1255.

  17. HUANG Xin, PENG Yuxin,YUANMingkuan. Cross-modal common representation learning by hybrid transfer network,The 26th International Joint Conference on Artificial.Melbourne, Australia,(2017), 1893- 1900.

  18. Mahadevan V, Wong C W, Pereira J C, et a1. Maximum covariance unfolding: Manifoldlearning for bimodal data, Advances in Neural Information Processing Systems. Granada,

    Spain,(2011), 918-926

  19. Zhao L, Chen Z, Yang Y, et al. Incomplete multi-view clustering via deep semantic mapping,Neuro computing,275 (2018), 1053-1062.

  20. Ma J, Qiao Y, Hu G, et al. ELPKG: A High-Accuracy Link Prediction Approach for Knowledge Graph Completion, Symmetry, 11(9)(2019),1096.

  21. Leontev M L, Islenteva V, Sukhov S V. Non-iterative Knowledge Fusion in Deep Convolutional Neural Networks, Neural processing letters, 51(1)(2020),1-22.

  22. Jacobs D W, Daume H, Kumar A, et al. Generalized Multiview Analysis: A discriminative latent space, 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, (2012), 2160-2167.

  23. Rosipal R, Nicole Krämer. Overview and Recent Advances in Partial Least Squares, Subspace, latent structure and feature selection. Bohinj, Slovenia: Springer,(2006),34-51.

  24. Thoma S, Thalhammer A, Harth A, et al. FusE: Entity-Centric Data Fusion on Linked Data,ACM Transactions on the Web (TWEB), 13(2)(2019),1-36.

  25. Zhou, Long, Mou, et al. Multiview partial least squares,Chemometrics and Intelligent Laboratory Systems, (2017), 13-21.

  26. Sharma A, Jacobs D W. Bypassing synthesis: PLS for face recognition with pose, low-resolution and sketch,Computer Vision and Pattern Recognition (CVPR), IEEE,(2011), 593-600.

  27. Khaire P, Imran J, Kumar P. Human Activity Recognition by Fusion of RGB, Depth, and Skeletal Data,Proceedings of 2nd International Conference on Computer Vision & Image Processing.(2018).

  28. Gehring J, Auli M, Grangier D, et al. Convolutional Sequence to Sequence Learning, The 34th International Conference on Machine Learning. New York, USA:ACM Press, (2017), 1243-1252

  29. Peng Y, Qi J, Huang X, et al. CCL: Cross-modal Correlation Learning with Multi-grained Fusion by Hierarchical Network, IEEE Transactions on Multimedia, 20(2)(2017),405-420.

  30. Liang J, He R, Sun Z, et a1. Group in-variant cross-modal subspace learning, In Processing of IJCAI, New York,

    USA,(2016), 1739-1745

  31. LIANG Xiaodan, HU Zhiting, ZHANG Hao, et al. Recurrent topic-transition GAN for visual paragraph generation, 2017 IEEE International Conference on Computer Vision. Honolulu, USA,(2017),3362-3371.

  32. GAO Lianli, GUO Zhao. Video captioning with attention-based LSTM and semantic consistency, IEEE Transactions on Multimedia, 19(9)(2019),2045-2055

  33. YANG Yang, ZHOU Jie, AI Jiangbo. Video captioning by adversarial LSTM, IEEE Transactions on Image Processing,27(11)(2018), 5600-5611.

  34. PENG Yuxin, QI Jinwei. CM-GANs: Cross-modal generative adversarial networks for common representation learning,

    Multimedia,15(1)(2019),1-13

  35. Song Xiaozhao, Zheng Xin, Li Zhixu, et al. Online encyclopedia table knowledge acquisition and fusion for knowledge base expansion,Software Engineering, 10(2019),1-6.

  36. Xiao T, Hui L. Research Front Recognition Based on

    Multi-source Data Knowledge Fusion Method,Journal of Modern Information, 039(008)(2019),29-36.

  37. Smirnov A, Levashova T. Knowledge fusion patterns: A survey,Information Fusion, 52(2019),31-40.

Leave a Reply

Your email address will not be published. Required fields are marked *