 Open Access
 Authors : M Pradeep, S Hanumantha Rao
 Paper ID : IJERTCONV9IS05046
 Volume & Issue : ICRADL – 2021 (Volume 09 – Issue 05)
 Published (First Online): 27032021
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
MultiModal Knowledge Fusion Method Based on Label Alignment
M Pradeep1, S Hanumantha Rao2
Associate Professor1, Professor2 Department of Electronics and Communication
Engineering
Shri Vishnu Engineering College for Women, Bhimavaram, India
Abstract. In order to solve the "semantic gap" between text, image, video and other different modal data, we proposed a multimodal knowledge fusion method based on label alignment. This method realizes text semantic annotations according to the characteristics of text, picture, and video data, constructs text label set of each modal data, calculate the label similarity of set of multimodal labels with the semantic similarity method of the longest common subsequence, and select the text pair with highest confidence to fuse different modal data. In order to verify the effectiveness of the method, we built a data set containing three modalities of text, picture, and video for the field of CCP information building. The method of this paper was used to fuse multimodal data on the data set, and analyze the experimental results. The accuracy of the method is 97.3%, which shows the high effective of the method.
Keywords.Label alignment, multimodal data, knowledge fusion, semantic annotation, label similarity.
INTRODUCTION
In the era of big data, multimodal data such as text, image, and video on the Internet has shown explosive growth. Each modal has its own specific information and statistical characteristics, but different modal data have different levels of knowledge expressivity on varying degrees, and usually share highlevel concepts and semantic information. Therefore, research on the feature representation and learning methods of multiple modal information, overcome the impact of heterogeneous problems on multimodal representation, eliminate redundancy in multimodal heterogeneous data, and realize the collaborative representation of multimodal
data, has profound significance for knowledge representation[1]. Moreover, merge the intramodal information and crossmodal complementary information to obtain comprehensive features at the conceptual level by
different modal information fusion, which can improve the performance of actual tasks such as retrieval, classification, etc.
In the research of multimodal data, multimodal data fusion method is a core issue, its purpose is to integrate characteristics of multimodal data to obtain consistent and common model output[2]. Multimodal fusion can use complementary information in multimodal data to discover the dependence of knowledge on multimodality. The existing multimodal data fusion methods can be divided into phasebased fusion algorithms, featurebased fusion algorithms, and semanticbased fusion algorithms[3]. The phasebased fusion algorithm is an early period method that uses different modal data at different stages of the data mining task to complete the corresponding fusion analysis. This type of method does not require consistency between different modal. The featurebased fusion algorithm originally refers to the sequential connection of features extracted from different data sets into a feature vector, and then complete tasks such as clustering, classification, and prediction based on the concatenated features. According to this, researchers have proposed some improved algorithms, such as using deep neural networks to learn unified feature representations of different modal data[6,7], literature[8] gives two basic multimodal feature fusions model base on deep network, through the coupled modal Deep Auto Encoder (DAE) can complete crossmodal data feature learning. Srivastava et al.[9] proposed a multimodal Deep Boltzmann Machine model (DBM), which combines two modal features of pictures and text to complete data classification and retrieval. The disadvantage of this type of method is that because the representation, distribution and density of different modes may be different, simple attribute connection will ignore the correlation between the modes, and only regard a feature as a realvalued number or a categorical value [4]. Different from
featurebased fusion, semanticbased fusion algorithm refers to understanding the data meaning of each modal and the relationship between different modal features, abstracting the semantics of different modalities in the data fusion process to complete crossmodal data fusion. The existing semanticbased fusion algorithms are roughly divided into collaborative training methods, multicore learning methods, subspace learning methods, probabilitydependent methods and transfer learning methods. Since the subspace learning method is based on the semantic sharing of data multimodal description, and can project multiple highdimensional feature sets to the same lowdimensional semantically related space to alleviate the dimensional disaster problem, it has attracted more and more scholars' attention[18]. To research the correlation between different modal data, the point is to construct a common feature subspace, map the feature vectors of different modal data to this space, and then perform similarity on the data of different modalities in this space measure. The earliest multimodal shared subspace learning algorithm uses C Correlation Analysis (CCA)[1012] to maximize the correlation between two modalities, learn the largest correlation subspace and output the projection matrix corresponding to each modal. In order to solve the nonlinear and nonorthogonal problems of the original data, some scholars have proposed a Kernelized Canonical Correlation Analysis (KCCA)[13], Mixed Probabilistic Canonical Correlation Analysis (Mix PCCA) [14], Deeptypical Canonical Correlation analysis (DCCA)[15] and other methods. Liang et al.[16] proposed a groupinvariant crossmodal subspace learning method. This method not only learns the symbiosis information of sample pairs, but also learns the group symbiosis relationship between different modalities while learning the projection subspace. This highlevel semantic correspondence effectively improves the robustness of the latent subspace and the accuracy of retrieval. Mahadevan et al.[17] proposed to learn lowdimensional embedding while maintaining the local geometric structure in different modes, which effectively improves the stability of the embedding. Commonly used subspace mappings models include Bilinear model (BLM)[19], Partial Least Squares (PLS)[2022] and so on. In addition, some methods map all modal knowledge to the common semantic space, and then align entities between different modalities to facilitate representation by calculating the semantic similarity between
different modal knowledge in the common subspace. For example, Zhu et al.[23] proposed a joint knowledge embedding method to achieve entity alignment, and improved the alignment performance based on the iterative training method[24]. According to the internal structure information (entities and relationships) of heterogeneous knowledge graphs, this method first uses PTransE (pathbased TransE)[25] to learn distributed representations of different knowledge graphs. The entities and relationships in the heterogeneous knowledge graph are encoded into a unified, continuous lowdimensional semantic space, and the entities are aligned according to the semantic distance between the entities in the joint space. The experiments of Lin et al.[26] on PTransE show that considering the relationship path can greatly improve the recognition rate of knowledge representation learning and the performance of knowledge graphs to complete tasks. JAPE[27] uses joint representation learning technology to directly embed entities and relationships in different knowledge graphs into a unified vector space, and converts the matching process between instances in different knowledge graphs into a process of calculating the distance of vector representation. IPTransE only iteratively updates instance matches, while JAPE uses attributes and text description information to enhance case representation learning. EnAli[28] is an unsupervised method for matching entities in two or more heterogeneous data sources. Wu et al.[29] constructed a unified fusion framework to train a specific domainspecific emotion classifier for the target domain by fusing emotion knowledge from multiple sources.
In view of the general problem of existing multimodal data fusion methods that cannot effectively model the same semantic information in multimodal samples, this paper studies the multimodal data fusion in different scenarios. Based on the relevant characteristics of multimodal data, proposes A multimodal data knowledge fusion framework based on label alignment, which fully considers the relationship between modalities. By standardizing and aligning the labels between different modal knowledge, the potential shared information of each modal data is collaboratively learned to effectively improved, the accuracy of multimodal data fusion is given. The article gives its detailed design and optimization process, and verifies the performance of the proposed knowledge fusion model on the multimodal data set in the vertical field of party building,
which provides reference for the research of multimodal fusion.

Structure
This paper focus on the research of multimodal fusion tasks. Multimodal fusion refers to the integration of different forms of information (such as pictures, videos, text, audio, etc.) from two or more modalities and uses independent models to extract different the instance subcomponents of the modal, then find the correspondence between the instance subcomponents from two or more modalities, and perform modal alignment. After the above, the consistent interpretation or description of multimodal data is obtained through calculations. The consistent interpretation or description of modal data through calculation and averaging is used to complete the process of analysis and identification tasks, that is, through data assimilation, the given alternative multimodal data is fused to dig out more potential information.

Multimodal fusion
Given the multimodal data set = {1 , 2, , } ,
1, 2, , refers there are n types modal in the set , where = {1, 2, 3, , }, , 2, 3, , , are the basic elements in the modal . If = {1, 2, 3}, the semantic space is = {1, 2, , } , 1, 2, , are specific semantics, and multimodal fusion refers to obtaining
the same semantics through fusion algorithms The multimodal data pair of (1, 2, 3), where1 1,2
2, 3 3, 1, 2, 3 .

Label alignment
Given two data label sets A and B, find out all the labels in data set A (or B) that can be aligned to B (or A), namely:
(, ) = {(, ) , }


Multimodal data fusion method basedon label alignment
Text
a
a3
a2
Text
a
a3
a2
a1
a1
a2
a2
c1
a1 b1
b2
c2
c3
b3
c1
a1 b1
b2
c2
c3
b3
Image
Image
b
b3
b1
b2
a3
a3
Video
Video
c1 c
c3
c2
Figure 1. Multimodal data fusion framework.
Figure 1 shows the two main tasks of text, image, and video information extraction and multimodal information fusion included in this method.
2.2. Text information extraction
Most of the existing text corpus is unstructured data, and text information extraction is to extract the required information fields from semistructured and unstructured text content. The text information extraction method used in this paper is a segmented convolutional neural network model that introduces an attention mechanism. The model is used to extract the main entities and their relationships in the corpus, and to extract important information from the original data. Among them, the use of segmented convolutional neural network for keyword extraction can improve the semantics of feature vector representation, reduce the sparsity of data dimensions, and effectively solve the problem of data noise. In addition, for the input text data, some words may contain more important semantic information than other parts. Therefore, introducing an attention mechanism to focus on key information and ignoring irrelevant information can reduce the interference caused by irrelevant information on the extraction of semantic features. The point of information extracted by the model is get more accurate. The specific process of text information extraction is shown in Figure 2:
Attention Layer
2.1. Multimodal data fusion framework
The model we proposed in this paper is mainly divided into two parts. The first part is the construction of labeled multimodal data sets, which completes the semantic labeling process of text, picture, and video data, extracts the labels of
heterogeneous data and performs a unified semantic
b1 b2
.
.
.
bq1 bq
Input
T
Input
T
Align Weights
Bag
features
Softmax
representation, then output their semantic feature labels in the same format; the second part is to complete the data fusion process by using the text labels in different modalities. The multimodal data fusion framework is shown in Figure 1:
=12 ,q
Figure 2. Text information extraction.
This process takes the word sequence in the corpus as
ATT
ATT
input, and uses the Skipgram model of word2vec to represent a single word in a vector form. The position vector corresponding to the word is spliced, and the feature map is obtained through the convolutional layer. In order to better capture the structured information between the two entities, the feature map is divided into three segments at the pooling
to solve the impact of the amount of data on the accuracy of tags, Manual intervention is used to calibrate part of the data, which improves the accuracy of semantic tags, and finally obtains image semantic tags .
The specific extraction process is shown in Figure 3.
layer for pooling. Finally, the result of the pooled feature vector through the softmax layer is stored as text data in a twodimensional table.

Image semantic label extraction
Input
I
Encoder
CNN
CNN
x1 x2
xm
xn
Decoder
z1 z2
zt
zP
Modal Output y1
LSTM
LSTM
Manual correction
Manual correction
Semantic label
b
Semantic label
b
y2
yt
yP
Compared with text, pictures contain more visual information, such as shallow colors and textures, and higherlevel semantic information such as characters, actions, etc. Therefore, this paper uses deep learning model combined with manual correction to extract semantic tags of pictures. The deep learning model adopts the EncoderDecoder structure. First, the picture is input and enters the Encoder part. After the spatial characteristics of CNN, the feature map of the convolutional layer is used to extract the features of n
positions in the picture = {1, 2, 3, , }, , where x is a Ddimensional vector, for example, set the height and width of the feature map to 14, the number of channels to 256, n=14*14=196, D=256. At the same time, in order to
extract the highlevel semantics of the image, an Attention mechanism is added during decoding to assign different weights t the extracted image features.
Suppose that in the tth stage of decoding, when the tth word is generated, is the context vector passed into the decoder RNN, and 1 is the hidden layer state of the previous stage of the RNN. This context vector is a
weighted average of = {1, 2, 3, , }, specifically, the
relationship between and = {1, 2, 3, , } can be expressed by equation (3.1) :
= =1 , (3.1)
, is to measure the weight of the image feature at the mth position when the tth word is generated. This weight is actually a function of the previous hidden layer state 1 and the mth position image feature . Get the feature as the input of LSTM, generate the hidden variable, and output the model result .
The above model greatly improves the accuracy of the
generated semantic tags compared with traditional methods. However, due to less supervision data, there are still problems such as semantic focus shift and unclear description. In order
Figure 3: Image semantic label extraction.
Where I is the input color image, {1, , , , } is the "image semantic description", that is, the semantic label of the picture, where the number of words in the final output sequence is its length p( the size of p is variable). Each word
is a Kdimensional probability, which is used to represent the probability of the word appearing in the image semantics, where k is the size of the text dictionary used by the model.
and are Ddimensional features, which respectively represent n different description areas of the input image and the context corresponding to each description area.

Video semantic label extraction
Since the video is composed of a continuous image sequence, in which character action boundaries, repeated action segments, event occurrence timing information, and optical flow changes between frames are relatively fine, it is necessary to provide a large amount of supervision information to guide model training. There is still a big gap between research and methods for the goal of video semantic information extraction. The use of machine learning methods to extract video semantic information will have a very large impact on the final results of multimodal fusion, and the results will even lose their meaning, and artificial Annotation is easy to understand the semantic information expressed by the image sequence in the video, and based on the semantic content of the video, the video can be classified into human behavior and complex events. Finally, according to the text label rules, the best key themes of the video are marked to form a complete natural sentence, and a description label containing the most dynamic information and the most accurate topic is generated for the video. Therefore, this article uses manual labeling to extract video tags, comprehensively considering the important frames of the
video and the semantic information contained in it, the main process is shown in Figure 4.
=
(2 ) (1 + 2)
Input
V Manual I dentification
Manual labeling
Key theme
Key theme
Generate labels
The longest common subsequence algorithm is used to calculate the similarity, which can quickly find the longest common subsequence between two sequences, and measure the similarity between the sequences according to the results
Important Semantic Informatio n
Important Semantic Informatio n
Output video Label c
Output video Label c
Figure 4: Video semantic label extraction.

Label alignment
In the above process, three modal labels of text, image, and video can be obtained respectively. In order to achieve multimodal fusion, this paper uses the similarity calculation of the longest common subsequence to fuse different modalities.
The longest common subsequence means that given two sequences X and Y, query the longest common subsequence among all common subsequences in X and Y. The length of the sequence can be quantified to calculate similarity by applying the longest common subsequence degree. The longest common subsequence solving methods include exhaustive method and dynamic programming method, but when the amount of data reaches a certain amount, the amount of calculation will increase exponentially as the number of elements in the sequence increases, while the dynamic programming method can avoid repeated calculations, which greatly improves computing efficiency, and thus becomes the main method of use.
Suppose the sequence = (1, 2, , ) and = (1, 2, , ), the number of elements of X and Y are n and m, respectively; = (1, 2, , ) and = (1 , 2, ,
) is a subset of the sequence X and Y respectively, where in andjm. According to the subset of the longest common subsequence between two sequences is also its common subsequence property, the longest common subsequence LCS( , ) must be solved recursively according to whether the corresponding element values are equal, formula (3.2) can be Indicates the specific calculation process:
(, ) =
{ ( 1, 1) + , = ;
{( 1, ), (, 1)},
(3.2)
If the length of the longest common subsequence obtained is k, the length of text 1 is n1, and the length of text 2 is n2, the calculation formula of similarity S is:
obtained. This method has high effectiveness. Therefore, we use the longest common subsequence similarity algorithm for fusion, and the specific
process is shown in Algorithm 1:
Algorithm 1LSCMMF
Input: T={ 1, 2, , } I={ 1, 2, , }
V={1, 2, , }
Output: E(a, i, v)
Step 1T_lable_set =A, I_lable_set =B, V_lable_set =C,
= {1, 2, , } =
{1, 2, , } = {1, 2, , }
Step 2commonset = (, , ) = A && && ; Step 3value_T = v1x value_I = v2y , lcs (v1x, v2y) =
MAX(v1x,v2y)
Step 4 = max((1), (2), , ())
Step 5if lable = T,
go to step6
else
return None Step 6output E(a, i, v)
Among them, define T, as text data set, I as picture data set, and V as video data set; (, , ) is the data pair after fusion; common set is picture, text, and video common label set ; Value_T is the text semantic label value, and value_I is the image semantic label value.


Experiment and result analysis

Data collection and processing
We use Python to crawl structured data and advanced party constitutions, party history and party disciplines from "People's Daily Online", "Communist Party Member", "Chinese Communist Party News", "Daily Economic Net" and other party building related websites. After that, we got our dataset that has a total of 4,000 pieces of unstructured text data (txt format) such as personalities, current political news, etc., were crawled from Baidu Encyclopedia, Baidu Pictures, Google Pictures and the aforementioned party building
related websites, including important meetings, major events, and the birth of major figures in the party. A total of 1800 Pieces of picture data (jpg or png format), selected from Douban video, iQiyi,Youku video and otherrelated video websites , 1100 pieces of film and television video data (MP4 format). While crawling the data, we filter the data according to the corresponding format, the data type and data volume are shown in the following table.
Table 1: Multimodal party building data set
td>
310
Modal type Data content
text
Image
video
character
2872
1000
790
event
1128
800
Experimental process
We use python3.6, ubantu16.04 in the experiment, the experimental development environment is Pycharm2015, and uses skimage digital image processing library, Jieba Chinese word segmentation library, atplotlib drawing library and other dependent libraries. The specific experimental process is shown in Figure 5:
Begin
TextImage Video Data
Firstly, input the acquired multimodal data set, extract the semantic labels of the text, picture, and video data respectively, extract the important people, events and their relationships and attributes related to party building from the text data through the PCNN+ATT model, and use them as text The semantic label specification is stored in the table; the image data is normalized, and the EncoderDecoder model is used to extract the text data as a dictionary to extract the relevant image semantic label set; the video data is manually annotated according to the text The extraction result is annotated subject words and stored as a video tag set. Finally, the semantic similarity of the three modal tags is calculated, the longest common subsequence in the different modal tags is calculated and the similarity of the group of tags is calculated, the group with the highest similarity is selected for matching, and the pictures in the group of tags, The video semantic tag is linked to the corresponding text, and the corresponding data pair is output.

Analysis of experimental results
The results of the algorithm are evaluated using two commonly used indicators, AUC and AP.
The AP index can well show the area under the
PrecisionRecall curve and can better reflect the overall performance of the algorithm. Its calculation formula is as
Text
semantic
label set
PCNN+ATT
EncoderDecoder
Image
semantic
label set
Manual
labeling
Video
semantic
label set
shown in formula (4.1):
AP = TP
(4.1)
Multimodal
label sets
Candidate Label pair F1(a1,b1) F1(a1,c1) F1(a2,c2)
…
No
Calculate the similarity of each candidate label pair
Similarity
ranking
Whether the candidate pair
has the highest score
Yes
Output the original data pair corresponding to the label pair as the fusion result
End
Figure 5:Experimental flowchart.
TP / TP + FP
Among them, TP is a correctly aligned multimodal data pair, FP is a multimodal data pair with semantically unrelated data labels. In addition, in order to verify the fusion accuracy of different data pairs, this paper adopts a crossvalidation method to calculate the fusion results separately for the characters and events in the data set. P1 is the alignment accuracy rate of personrelated data, and P2 is the alignment accuracy rate of eventrelated data.
The AUC method can better evaluate the accuracy of the method. The accuracy rate in the link prediction problem is defined as the probability that the link pair of nodes in the network is higher than the probability that there is no link node pair. Assuming that n evaluation experiments have been carried out, n times of results have higher scores for link node pairs, and the two types of node pairs score the same for n times. The calculation formula of the AUC result is as shown in formula (4.2):
AUC =
n+0.5n n
(4.2)
multimodal knowledge graphs, at the same time, it also can provide new ideas for research on crossmodal retrieval and
Using the above evaluation indicators to verify the method on the constructed data set, the results are shown in Table 2. The accuracy of using the label alignment method reached 98.2% and 96.4% in the two categories of person and event respectively, and the average accuracy reached 97.3%. Among them, the picture and video label information is complete and accurate.
Label
alignment method
AUC50%
AP
CN
0.9122
0.9003
ACT
0.8909
0.8893
Node2Vec
0.9377
0.9136
Net2VecCLP
0.9223
0.9471
Our Method
0.9455
0.9733
Label
alignment method
AUC50%
AP
CN
0.9122
0.9003
ACT
0.8909
0.8893
Node2Vec
0.9377
0.9136
Net2VecCLP
0.9223
0.9471
Our Method
0.9455
0.9733
Table 2: Method verification results
The experimental result shows that: compared with other methods, our method use the longest common subsequence similarity algorithm to calculate the similarity before linking, comprehensively utilizes the same semantic labels of multimodal data. The similarity algorithm calculates the association relationship between the tag sets to align the data in the multimodal database library, to a certain extent, to make up for the "semantic gap" of the data, while Net2VecCLP and others only use network links to build semantic information. Therefore, the multimodal fusion algorithm based on label alignment proposed in this paper has improved accuracy and AUC value. Among them, the accuracy has been improved the most. The reason is that this method uses structured semantic tags. This type of information It is more accurate than directly calculating multimodal semantic feature associations.


Summary and Outlook
This paper proposes a label alignment knowledge fusion method for multimodal data, which is different from the existing knowledge fusion method of multimodal data fusion. This model extracts semantic labels in each singlemodal by different extraction ways, and use label alignment method to combine text, image and video data by merging the same format label. The experimental results show that the accuracy of the method we proposed in this paper is greatly improved to a certain extent compared with the existing methods. Therefore, in theoretical research and practical applications, the multimodal data knowledge fusion method proposed in this paper provides a new reference for the field of
visual question and answer fields. Yet the algorithm in this paper relies on labeled data, it is often difficult to find enough aligned sample pairs in the actual data set. To solve this problem, the next step will explore the use of unsupervised algorithms or small sample learning algorithms to study how to improve the data in the model processing efficiency to makes the model more suitable for actual data.
Acknowledgments
This work has supported by Natural science foundation of Ningxia province (No. 2020AAC03218), Ningxia firstclass discipline and scientific research projects (electronic science and technology) (NO. NXYLXK2017A07) and the North Minzu University Graduate innovation program (No. YCX20075, No. S202011407038G).
REFERENCES

Baltrusaitis T, Ahuja C, Morency L P. Multimodal Machine Learning: A Survey and Taxonomy,IEEE Transactions on Pattern Analysis and Machine Intelligence, (2017), 423443.

Sun L, Li B, Yuan C, et al. Multimodal Semantic Attention Network for Video Captioning,IEEE Transactions on Big Data, (2019), 13001305.

Zheng, Yu. Methodologies for CrossDomain Data Fusion: An Overview,IEEE Transactions on Big Data, 11(2015), 1634.

Xu C, Tao D, Xu C. A Survey on Multiview Learning, Computer Science, (2013), 20312038.

Krizhevsky A, Sutskever I, Hinton G. ImageNet Classificaton with Deep Convolutional Neural Networks,Advances in neural information processing systems, 60(6)(2012), 8490.

Li W, Zhang X, Wang Y, et al. GrappSeq: Fusion Embedding Learning for Knowledge Graph Completion,IEEE Access 7: (2019)11.

Xie T, Wu B, Jia B, et al. Graphranking collective Chinese entity linking algorithm,Frontiers of Computer Science, 14(2020),291303.

Kiros R, Salakhutdinov R, Zemel R. Multimodal neural language models, (2014), 20122025.

Hassan Akbari, SveborKaraman, Surabhi Bhargava, et al. Multilevel Multimodal Common Semantic Space for ImagePhrase Grounding,IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) Long Beach, California, (2019), 1247612486.

Ngiam J, Khosla A, Kim Ms et al. Multimodal deep learning,Proceedings of the 28th International Conference on Machine Learning, Bellevue: ACM, (2011), 689696.

Srivastava N, Salakhutdinov R. Multimodal Learning with Deep Boltzmann Machines, In Advances in Neural Information Processing Systems, Lake Tahoe, Spain, (2012), 22222230.

Hardoon D R,Szedmak S, ShaveTaylor J. Canonical correlation analysis: an overview with application to learning methods[ J]. Neural Computation,2014,16 (12):26392664.

Rasiwasia N, Pereira J C, Coviello E, et al. A new approach to crossmodal multimedia retrieval, In Proceedings of the 18th ACM international conference on multimedia. Firenze, Italy: ACM,(2010), 251260.

Menon A K, Surian D, Chawla S. Crossmodal retrieval: a pairwise classification approach,In Proceedings of the 2015 SIAM international conference on data mining,(2015), 199210.

AKAHO S. A kernel method for canonical correlation analysis, Proceedings of the international meeting of the psychometric society,(2001), 263269.

Andrew G, Arora R, Bilmes J, et al. Deep canonical correlation analysis,International conference on machine learning, (2013), 12471255.

HUANG Xin, PENG Yuxin,YUANMingkuan. Crossmodal common representation learning by hybrid transfer network,The 26th International Joint Conference on Artificial.Melbourne, Australia,(2017), 1893 1900.

Mahadevan V, Wong C W, Pereira J C, et a1. Maximum covariance unfolding: Manifoldlearning for bimodal data, Advances in Neural Information Processing Systems. Granada,
Spain,(2011), 918926

Zhao L, Chen Z, Yang Y, et al. Incomplete multiview clustering via deep semantic mapping,Neuro computing,275 (2018), 10531062.

Ma J, Qiao Y, Hu G, et al. ELPKG: A HighAccuracy Link Prediction Approach for Knowledge Graph Completion, Symmetry, 11(9)(2019),1096.

Leontev M L, Islenteva V, Sukhov S V. Noniterative Knowledge Fusion in Deep Convolutional Neural Networks, Neural processing letters, 51(1)(2020),122.

Jacobs D W, Daume H, Kumar A, et al. Generalized Multiview Analysis: A discriminative latent space, 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, (2012), 21602167.

Rosipal R, Nicole KrÃ¤mer. Overview and Recent Advances in Partial Least Squares, Subspace, latent structure and feature selection. Bohinj, Slovenia: Springer,(2006),3451.

Thoma S, Thalhammer A, Harth A, et al. FusE: EntityCentric Data Fusion on Linked Data,ACM Transactions on the Web (TWEB), 13(2)(2019),136.

Zhou, Long, Mou, et al. Multiview partial least squares,Chemometrics and Intelligent Laboratory Systems, (2017), 1321.

Sharma A, Jacobs D W. Bypassing synthesis: PLS for face recognition with pose, lowresolution and sketch,Computer Vision and Pattern Recognition (CVPR), IEEE,(2011), 593600.

Khaire P, Imran J, Kumar P. Human Activity Recognition by Fusion of RGB, Depth, and Skeletal Data,Proceedings of 2nd International Conference on Computer Vision & Image Processing.(2018).

Gehring J, Auli M, Grangier D, et al. Convolutional Sequence to Sequence Learning, The 34th International Conference on Machine Learning. New York, USA:ACM Press, (2017), 12431252

Peng Y, Qi J, Huang X, et al. CCL: Crossmodal Correlation Learning with Multigrained Fusion by Hierarchical Network, IEEE Transactions on Multimedia, 20(2)(2017),405420.

Liang J, He R, Sun Z, et a1. Group invariant crossmodal subspace learning, In Processing of IJCAI, New York,
USA,(2016), 17391745

LIANG Xiaodan, HU Zhiting, ZHANG Hao, et al. Recurrent topictransition GAN for visual paragraph generation, 2017 IEEE International Conference on Computer Vision. Honolulu, USA,(2017),33623371.

GAO Lianli, GUO Zhao. Video captioning with attentionbased LSTM and semantic consistency, IEEE Transactions on Multimedia, 19(9)(2019),20452055

YANG Yang, ZHOU Jie, AI Jiangbo. Video captioning by adversarial LSTM, IEEE Transactions on Image Processing,27(11)(2018), 56005611.

PENG Yuxin, QI Jinwei. CMGANs: Crossmodal generative adversarial networks for common representation learning,
Multimedia,15(1)(2019),113

Song Xiaozhao, Zheng Xin, Li Zhixu, et al. Online encyclopedia table knowledge acquisition and fusion for knowledge base expansion,Software Engineering, 10(2019),16.

Xiao T, Hui L. Research Front Recognition Based on
Multisource Data Knowledge Fusion Method,Journal of Modern Information, 039(008)(2019),2936.

Smirnov A, Levashova T. Knowledge fusion patterns: A survey,Information Fusion, 52(2019),3140.