February 21, 2024

Jo Mai Asian Culture

Embrace Artistry Here

Fusion of the word2vec word embedding model and cluster analysis for the communication of music intangible cultural heritage

7 min read

Long text data acquisition for music ICH

To perform cluster analysis on texts concerning ICH in the domain of music, it is essential to acquire a dataset containing these texts. Consequently, this article utilizes web scraping techniques to procure textual data associated with ICH in music, thus establishing the foundational dataset for subsequent cluster analysis. The research employed web crawlers to gather data concerning the dissemination of ICH related to music, efficiently managing a substantial volume of webpage links through well-designed request methods. Additionally, due to its Python integration, comprehensiveness, and high scalability, the Scrapy framework was selected as the method for web data scraping in this investigation.

Scrapy is currently the most widely used web crawler framework. It employs Python, a highly integrated and flexible programming language, to complete its tasks. With the Scrapy framework, a web crawler can be developed swiftly, and the crawler based on this framework is highly scalable and robust. Therefore, this article designs a web crawler for ICH data (hereinafter referred to as the “ICH crawler”) based on the Scrapy framework. Table 1 lists the specific operation steps.

Table 1 Steps of the scrapy-based ICH crawler.

Figure 1 depicts the primary process of the ICH crawler based on the aforementioned steps of the Scrapy data crawler.

Figure 1
figure 1

Design process of the ICH crawler.

Figure 1 depicts the main workflow of the data crawler designed here. It involves initiating a request to a designated URL, verifying if the URL can be parsed, analyzing and storing the webpage structure, outputting and saving the data as per the predefined structure, verifying if the URL meets the termination criterion, and ultimately concluding the crawling process.

Word segmentation of music ICH communication based on the domain dictionary method

Efficient clustering analysis of text requires word segmentation to be performed. This section performs word segmentation on the dataset of ICH texts related to music, which has been constructed using a non-heritage domain lexicon method.

Textual information on the subject of ICH differs significantly from the ordinary textual information, particularly in the list of certain ICH items and some basic features, such as regions, scenes, acts, and attire. As a whole, the name of the ICH project needs to be distinguished. This section builds a lexicon in this area and combines it with the Jieba word segmentation tool to enhance the effect of Chinese word segmentation in the field of ICH.

Python’s Jieba word segmentation module combines dictionaries with statistical approaches for word segmentation to produce accurate results when processing Chinese text. First, the unique text is segmented using a trained Chinese prefix dictionary. Then, a directed acyclic graph is built for all possible situations that could constitute words in the text. Second, the maximum probability path is found using the dynamic programming method. Third, the maximum segmentation combination is determined using word frequency16,17.

In statistical word segmentation methods, Jieba effectively tackles the challenge of unregistered words in the text by leveraging the HMM. This is attributed to the HMM model’s exceptional performance in text segmentation, as it can recognize and segment unknown vocabulary by considering contextual information. Consequently, it adeptly manages text unique to specific domains. It calculates these unregistered words based on the Viterbi algorithm, and tags these words with parts of speech through the calculation results.

This article employs the ICH dataset related to music developed in the previous section to extract the list of music-related ICH projects from China’s National Intangible Cultural Heritage website. Subsequently, a lexicon for the ICH domain is generated using the data obtained. The Jieba method is applied to construct the lexicon, wherein each word was represented in a line with three components: the word, its frequency, and its part of speech. In this article, the word frequency is excluded, and a preliminary lexicon for the ICH domain is created in the format of “music category + specific quantity,” arranged alphabetically. The lexicon can be continuously updated and improved by monitoring and adding newly encountered unregistered words.

Application of the word2vec model in music non-heritage texts

Furthermore, this article employs text representation to format natural language in a way that enables computers to more effectively analyze and compute it. Text vectorization, also known as word embedding, is a highly popular method for achieving this. There are two types of word embeddings, discrete and distributed. Of these, the distributed word embedding approach utilizing the word2vec model can map similar words to vector spaces in close proximity and frequently provides more precise semantic similarity representations. Consequently, this section employs the word2vec model to construct word vectors for a dataset of music ICH texts, thereby providing a solid groundwork for text clustering analysis. Moreover, the rationale behind choosing the word2vec model in this article is to harness text vectorization techniques, specifically distributional word embedding methods, to create word vectors that represent textual content associated with ICH in music.

Word2vec model

The word2vec model was proposed by Mikolov et al. in 2013. The syntactic and semantic rules of the language are captured by the word vector formed using word2vec, and the semantic relationship between all words can be described by the vector’s offset18. Figure 2 displays a word2vec model as an example.

Figure 2
figure 2

Relationship between word2vec word vectors.

Figure 2 indicates that the word2vec model enables vector operations to be carried out between texts. Specifically, when the word vector for “King” is subtracted by the word vector for “Man” and added to the word vector for “Woman”, the result is approximately equal to the word vector for “Queen”.

The word2vec model is an improved neural network language model (NNLM). Table 2 lists the main changes and the advantages of these changes.

Table 2 Optimization and advantages of the word2vec model.

One can choose between the Continuous Bag-of-Word (CBOW) model and the Skip-gram model in word2vec. The CBOW model estimates how often a determined word appears in the text by analyzing the frequency of occurrence of a set number of words before and after the location of W(t). In contrast to CBOW, Skip-gram uses the current word to forecast the probabilities of the two adjacent words. In the end, word2vec uses Hierarchical Softmax and NegativeSampling to train the model19.

Migration learning of the pre-trained word2vec model

Transfer learning is performed on a non-heritage text corpus based on the pre-trained word2vec model to guarantee the word vector effect. The basic tenet of transfer learning is to extract skills and knowledge from one area and apply them to another. If the transfer performance is high, we can save time and money on labeling data, significantly boosting the efficiency of our learning20.

This experiment presents the Chinese Wikipedia corpus for pre-training to produce the pre-trained word2vec model. This model ensures that the word vector has an accurate contextual relationship and mitigates the effect of insufficient collected data sets. It is important to check if the pre-trained model’s word vector dimension is the same as the new training’s word vector dimension. For this reason, the ICH corpus undergoes transfer learning to guarantee the coherence of the words.

Design of vector evaluation indicators for music intangible words

Evaluation of word vectors involves both introspective and objective measures. In the context of vector training technology, “internal evaluation” refers to the assessment of performance on individual intermediate subtasks. Simple and quick analogy subtasks, for instance, can aid in the comprehension of word vectors and allow for the quantitative evaluation of their efficacy. In most circumstances, it is not necessary to cover specific downstream jobs for evaluation. This evaluation approach is selected because a single NLP task can take a long time, and the effectiveness of word vectors will vary depending on the downstream task.

External parties ultimately evaluate word vector downstream tasks. Word vectors are only as efficient as the data they are trained on. After all, the word vector is the foundation of some NLP activities; therefore, some form of external evaluation is usually necessary. For this reason, internal review is still required to help pinpoint the source of poor downstream task model performance.

This article presents an experimental assessment for producing non-relic word vectors, which uses the correlation criterion. This criterion has the advantages of being rapid and straightforward to calculate.

First, K representative words with the characteristics of the ICH items are selected from the texts of each category of intangible cultural assets. For each word, the n most similar words are generated based on the cosine similarity of the intangible word vector, and the most pertinent words are selected based on subjective human judgement. The loss function is calculated by the cosine similarity Y between words and the subjective evaluation score f(x) (where the subjective evaluation value is determined to be the highest similarity value among n words). The evaluation index P is obtained according to Eq. (1).

$$\varvecP = \frac{\mathop \sum \nolimits_1^\varvecK \varvecL\left( \varvecf(\varvecx),\;\varvecY \right)}\varvecK$$


In Eq. (1), K stands for the number of ICH representative words selected for each category; \(\varvecf(\varvecx)\) represents the subjective score; Y signifies the cosine similarity between words; L refers to the loss function, which can be expressed as Eq. (2).

$$\varvecL(\varvecf(\varvecx),\;\varvecY) = \left| \varvecY – \varvecf(\varvecx) \right|$$


Equation (2) indicates that the selected loss function in this model is the absolute value loss function. This function is obtained by computing the difference between the predicted value and the target value, and then taking the absolute value of the result.

The average evaluation value is calculated according to the evaluation criteria obtained for each category. After normalization processing, the final evaluation index is between 0 and 7. The higher the value, the more semantic information the word vector contains, and the better the effect of the training model.


Leave a Reply

Your email address will not be published. Required fields are marked *

Copyright © All rights reserved. | Newsphere by AF themes.