Sitemap

A list of all the posts and pages found on the site. For you robots out there is an XML version available for digesting as well.

Pages

Posts

On the Duality of Intelligence

5 minute read

Published:

“All models are wrong, but some are useful” goes the saying from George Box. I often find this saying best illustrated by the short story from Jorge Luis Borges. In the story, the ruler is eager to get a map of the empire. Upon his servants returning to him with a map, he states that it does not contain enough detail and wants it to be bigger. Upon increasing the level of detail - and therefore also the size of the map - they return to the ruler, who requests again more detail in the map. This continues for a couple of times until the map contains so much details that it becomes the size of the empire itself. “Have you used it much?” asks the emperor. “It has never been spread out, yet,” says the servant: “the farmers objected: they said it would cover the whole country, and shut out the sunlight! So we now use the country itself, as its own map, and I assure you it does nearly as well.”

publications

Estimating Post-OCR Denoising Complexity on Numerical Texts

Published in Communications in Computer and Information Science, Volume 1863, 2023

Post-OCR processing has significantly improved over the past few years. However, these have been primarily beneficial for texts consisting of natural, alphabetical words, as opposed to documents of numerical nature such as invoices, payslips, medical certificates, etc. To evaluate the OCR post-processing difficulty of these datasets, we propose a method to estimate the denoising complexity of a text and evaluate it on several datasets of varying nature, and show that texts of numerical nature have a significant disadvantage. We evaluate the estimated complexity ranking with respect to the error rates of modern-day denoising approaches to show the validity of our estimator.

Recommended citation: Arthur Hemmer, Jérôme Brachat, Mickaël Coustaty, Jean-Marc Ogier. (2023). "Estimating Post-OCR Denoising Complexity on Numerical Texts." Communications in Computer and Information Science, Volume 1863. https://arxiv.org/abs/2307.01020

Lazy-k: Decoding for Constrained Information Extraction

Published in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023

We explore the possibility of improving probabilistic models in structured prediction. Specifically, we combine the models with constrained decoding approaches in the context of token classification for information extraction. The decoding methods search for constraintsatisfying label-assignments while maximizing the total probability. To do this, we evaluate several existing approaches, as well as propose a novel decoding method called Lazy-k. Our findings demonstrate that constrained decoding approaches can significantly improve the models’ performances, especially when using smaller models. The Lazy-k approach allows for more flexibility between decoding time and accuracy. The code for using Lazy-k decoding can be found here: https://github.com/ArthurDevNL/lazyk.

Recommended citation: Hemmer, Arthur, Mickaël Coustaty, Nicola Bartolo, Jérôme Brachat, and Jean-Marc Ogier. "Lazy-k Decoding: Constrained Decoding for Information Extraction." In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 6727-6736. 2023. https://aclanthology.org/2023.emnlp-main.416