Extracting Particular Information from Swedish Public Procurement Using Machine Learning

The Swedish procurement process has a yearly value of 706 Billion SEK over approximately 18 000 procurements. With each process comes many documents written in different formats that need to be understood to be able to be a possible tender. With the development of new technology and the age of Machi...

Full description

Bibliographic Details
Main Author: Waade, Eystein
Format: Others
Language:English
Published: Uppsala universitet, Avdelningen för systemteknik 2020
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-414562
Description
Summary:The Swedish procurement process has a yearly value of 706 Billion SEK over approximately 18 000 procurements. With each process comes many documents written in different formats that need to be understood to be able to be a possible tender. With the development of new technology and the age of Machine Learning it is of huge interest to investigate how we can use this knowledge to enhance the way we procure. The goal of this project was to investigate if public procurements written in Swedish in PDF format can be parsed and segmented into a structured format. This process was divided into three parts; pre-processing, annotation, and training/evaluation. The pre-processing was accomplished using an open-source pdf-parser called pdfalto that produces structured XML-files with layout and lexical information. The annotation process consisted of generalizing a procurement into high-level segments that are applicable to different document structures as well as finding relevant features. This was accomplished by identifying frequent document formats so that many documents could be annotated using deterministic rules. Finally, a linear chain Conditional Random Field was trained and tested to segment the documents. The models showed a high performance when they were tested on documents of the same format as it was trained on. However, the data from five different documents were not sufficient or general enough to make the model able to make reliable predictions on a sixth format that it had not seen before. The best result was a total accuracy of 90,6% where two of the labels had a f1-score above 95% and the two other labels had a f1-score of 51,8% and 63,3%.