Evaluating neural multi-field document representations for patent classification

Pujari, Subhash Chandra; Mantiuk, Fryderyk; Giereth, Mark; Strötgen, Jannik; Friedrich, Annemarie

Evaluating neural multi-field document representations for patent classification

Subhash Chandra Pujari, Fryderyk Mantiuk, Mark Giereth, Jannik Strötgen, Annemarie Friedrich

Patent classification constitutes a long-tailed hierarchical learning problem. Prior work has demonstrated the efficacy of neural representations based on pre-trained transformers, however, due to the limited input size of these models, using only title and abstract of patents as input. Patent documents consist of several textual fields, some of which are quite long. We show that a baseline using simple tf.idf-based methods can easily leverage this additional information. We propose a new architecture combining the neural transformer-based representations of the various fields into a meta-embedding, which we demonstrate to outperform the tf.idf-based counterparts especially on less frequent classes. Using a relatively simple architecture, we outperform the previous state of the art on CPC classification by a margin of 1.2 macro-avg. F1 and 2.6 micro-avg. F1. We identify the textual field giving a “brief-summary” of the patent as most informative with regard to CPC classification, whichPatent classification constitutes a long-tailed hierarchical learning problem. Prior work has demonstrated the efficacy of neural representations based on pre-trained transformers, however, due to the limited input size of these models, using only title and abstract of patents as input. Patent documents consist of several textual fields, some of which are quite long. We show that a baseline using simple tf.idf-based methods can easily leverage this additional information. We propose a new architecture combining the neural transformer-based representations of the various fields into a meta-embedding, which we demonstrate to outperform the tf.idf-based counterparts especially on less frequent classes. Using a relatively simple architecture, we outperform the previous state of the art on CPC classification by a margin of 1.2 macro-avg. F1 and 2.6 micro-avg. F1. We identify the textual field giving a “brief-summary” of the patent as most informative with regard to CPC classification, which points to interesting future directions of research on less computation-intensive models, e.g., by summarizing long documents before neural classification.…

Metadaten
Author:	Subhash Chandra Pujari, Fryderyk Mantiuk, Mark Giereth, Jannik Strötgen, Annemarie Friedrich ORCiD GND
URN:	urn:nbn:de:bvb:384-opus4-1055704
Frontdoor URL	https://opus.bibliothek.uni-augsburg.de/opus4/105570
Parent Title (English):	BIR 2022 - Bibliometric-enhanced Information Retrieval: Proceedings of the 12th International Workshop on Bibliometric-enhanced Information Retrieval co-located with 44th European Conference on Information Retrieval (ECIR 2022), April 10th 2022, Stravanger, Norway
Publisher:	CEUR-WS
Place of publication:	Aachen
Editor:	Ingo Frommholz, Philipp Mayr, Guillaume Cabanac, Suzan Verberne
Type:	Conference Proceeding
Language:	English
Date of Publication (online):	2023/07/05
Year of first Publication:	2022
Publishing Institution:	Universität Augsburg
Release Date:	2023/07/10
First Page:	13
Last Page:	27
Series:	CEUR Workshop Proceedings ; 3230
Institutes:	Fakultät für Angewandte Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik
	Fakultät für Angewandte Informatik / Institut für Informatik / Lehrstuhl für Computerlinguistik
Dewey Decimal Classification:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Licence (German):	CC-BY 4.0: Creative Commons: Namensnennung

Open Access

Evaluating neural multi-field document representations for patent classification

Download full text files

Export metadata

Statistics

Additional Services