0
本文作者: 我在思考中 | 2021-11-10 15:37 |
多標(biāo)簽文本分類(lèi)是自然語(yǔ)言處理中的一類(lèi)經(jīng)典任務(wù),訓(xùn)練模型為給定文本標(biāo)記上不定數(shù)目的類(lèi)別標(biāo)簽。然而實(shí)際應(yīng)用時(shí),各類(lèi)別標(biāo)簽的訓(xùn)練數(shù)據(jù)量往往差異較大(不平衡分類(lèi)問(wèn)題),甚至是長(zhǎng)尾分布,影響了所獲得模型的效果。重采樣(Resampling)和重加權(quán)(Reweighting)常用于應(yīng)對(duì)不平衡分類(lèi)問(wèn)題,但由于多標(biāo)簽文本分類(lèi)的場(chǎng)景下類(lèi)別標(biāo)簽間存在關(guān)聯(lián),現(xiàn)有方法會(huì)導(dǎo)致對(duì)高頻標(biāo)簽的過(guò)采樣。本項(xiàng)工作中,我們探討了優(yōu)化損失函數(shù)的策略,尤其是平衡損失函數(shù)在多標(biāo)簽文本分類(lèi)中的應(yīng)用。基于通用數(shù)據(jù)集 (Reuters-21578,90 個(gè)標(biāo)簽) 和生物醫(yī)學(xué)領(lǐng)域數(shù)據(jù)集(PubMed,18211 個(gè)標(biāo)簽)的多組實(shí)驗(yàn),我們發(fā)現(xiàn)一類(lèi)分布平衡損失函數(shù)的表現(xiàn)整體優(yōu)于常用損失函數(shù)。研究人員近期發(fā)現(xiàn)該類(lèi)損失函數(shù)對(duì)圖像識(shí)別模型的效果提升,而我們的工作進(jìn)一步證明其在自然語(yǔ)言處理中的有效性。
多標(biāo)簽文本分類(lèi)是自然語(yǔ)言處理(NLP)的核心任務(wù)之一,旨在為給定文本從標(biāo)簽庫(kù)中找到多個(gè)相關(guān)標(biāo)簽,可應(yīng)用于搜索(Prabhu et al., 2018)和產(chǎn)品分類(lèi)(Agrawal et al., 2013)等諸多場(chǎng)景。圖 1 展示了通用多標(biāo)簽文本分類(lèi)數(shù)據(jù)集 Reuters-21578 的樣例數(shù)據(jù)(Hayes and Weinstein, 1990)。
圖2 Reuters-21578的長(zhǎng)尾分布和標(biāo)簽連鎖現(xiàn)象。
圖3 損失函數(shù)的具體設(shè)計(jì)。
表1 實(shí)驗(yàn)用數(shù)據(jù)集的基本信息
表2 實(shí)驗(yàn)結(jié)果對(duì)比
羅氏集團(tuán)制藥部門(mén)中國(guó) CIO 施涪軍:該工作來(lái)自于合作團(tuán)隊(duì)在生物醫(yī)學(xué)領(lǐng)域的深度學(xué)習(xí)應(yīng)用探索。相比于日常文本,生物醫(yī)學(xué)領(lǐng)域的語(yǔ)料往往更專(zhuān)業(yè),而標(biāo)注更稀疏,導(dǎo)致 AI 應(yīng)用面臨“最后一公里”的落地挑戰(zhàn)。本論文從稀疏標(biāo)注的長(zhǎng)尾分布等問(wèn)題入手,由 CV 前沿研究引入損失函數(shù)并優(yōu)化,使得既有 NLP 模型可以在框架不變的情況下將訓(xùn)練資源向?qū)嵗^少的類(lèi)別平衡,進(jìn)而實(shí)現(xiàn)整體的模型效果提升。很高興看到此策略在面臨類(lèi)似問(wèn)題的日常文本上同樣有效,希望繼續(xù)與院校、企業(yè)在前沿技術(shù)的研究與應(yīng)用上扎實(shí)共創(chuàng)。
參考文獻(xiàn):
Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. 2013. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828.
Francisco Charte, Antonio J Rivera, María J del Jesus,and Francisco Herrera. 2015. Addressing imbalance in multilabel classification: Measures and random resampling algorithms. Neurocomputing, 163:3–16.
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics.
NCBI Resource Coordinators. 2017. Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 46(D1):D8–D13.
Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class-balanced loss based on effective number of samples. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9260–9269.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
T. Durand, N. Mehrasa, and G. Mori. 2019. Learning a deep convnet for multi-label classification with partial labels. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 647–657, Los Alamitos, CA, USA. IEEE Computer Society.
Andrew Estabrooks, Taeho Jo, and Nathalie Japkowicz. 2004. A multiple resampling method for learning from imbalanced data sets. Computational intelligence, 20(1):18–36.
Weifeng Ge, Sibei Yang, and Yizhou Yu. 2018. Multievidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Philip J. Hayes and Steven P. Weinstein. 1990. Construe/tis: A system for content-based indexing of a database of news stories. In Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence, IAAI ’90, page 49–64. AAAI Press.
Gakuto Kurata, Bing Xiang, and Bowen Zhou. 2016. Improved neural network-based multi-label classification with better initialization leveraging label cooccurrence. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 521–526.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics.
Jianqiang Li, Guanghui Fu, Yueda Chen, Pengzhi Li, Bo Liu, Yan Pei, and Hui Feng. 2020a. A multilabel classification model for full slice brain computerised tomography image. BMC Bioinformatics, 21(6):200.
Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2020b. Dice loss for dataimbalanced NLP tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 465–476, Online. Association for Computational Linguistics.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2999–3007, Los Alamitos, CA, USA. IEEE Computer Society.
Zachary C. Lipton, Charles Elkan, and Balakrishnan Naryanaswamy. 2014. Optimal thresholding of classifiers to maximize f1 measure. In Machine Learning and Knowledge Discovery in Databases, pages 225–239, Berlin, Heidelberg. Springer Berlin Heidelberg. Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 565–571.
Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes Fürnkranz. 2017. Maximizing subset accuracy with recurrent neural networks in multilabel classification. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
Ankit Pal, Muru Selvakumar, and Malaikannan Sankarasubbu. 2020. Magnet: Multi-label text classification using attention-based graph neural network. In ICAART (2), pages 494–505.
F. Pedregosa, G. Varoqu
aux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pages 993–1002.
Che-Ping Tsai and Hung-yi Lee. 2020. Order-free learning alleviating exposure bias in multi-label classification. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty- Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 6038–6045. AAAI Press.
George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artieres, Axel Ngonga, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16:138.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Jiawei Wu, Wenhan Xiong, and William Yang Wang. 2019. Learning to learn and predict: A metalearning approach for multi-label classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4354– 4364, Hong Kong, China. Association for Computational Linguistics.
Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. 2020. Distribution-balanced loss for multi-label classification in long-tailed datasets. In Computer Vision – ECCV 2020, pages 162–178, Cham. Springer International Publishing.
Wenshuo Yang, Jiyi Li, Fumiyo Fukumoto, and Yanming Ye. 2020. HSCNN: A hybrid-Siamese convolutional neural network for extremely imbalanced multi-label text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6716–6722, Online. Association for Computational Linguistics.
Yiming Yang and Xin Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99, page 42–49, New York, NY, USA. Association for Computing Machinery.
雷鋒網(wǎng)
雷峰網(wǎng)版權(quán)文章,未經(jīng)授權(quán)禁止轉(zhuǎn)載。詳情見(jiàn)轉(zhuǎn)載須知。