Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts

<p>A large amount of descriptive information is available in geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descript...

Full description

Bibliographic Details
Main Authors: J. Padarian, I. Fuentes
Format: Article
Language:English
Published: Copernicus Publications 2019-07-01
Series:SOIL
Online Access:https://www.soil-journal.net/5/177/2019/soil-5-177-2019.pdf
id doaj-27f3f2b580c2490db0880790aca1f765
record_format Article
spelling doaj-27f3f2b580c2490db0880790aca1f7652020-11-25T01:15:06ZengCopernicus PublicationsSOIL2199-39712199-398X2019-07-01517718710.5194/soil-5-177-2019Word embeddings for application in geosciences: development, evaluation, and examples of soil-related conceptsJ. PadarianI. Fuentes<p>A large amount of descriptive information is available in geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings, which encode information about a word and its linguistic relationships with other words, lay on a multidimensional space where angles and distances have a linguistic interpretation. We used 280&thinsp;764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. As this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9&thinsp;%. We also presented an example were we successfully emulated part of a taxonomic analysis of soil profiles that was originally applied to soil numerical data, which would not be possible without the use of embeddings. The resulting embedding and test suite will be made available for other researchers to use and expand upon.</p>https://www.soil-journal.net/5/177/2019/soil-5-177-2019.pdf
collection DOAJ
language English
format Article
sources DOAJ
author J. Padarian
I. Fuentes
spellingShingle J. Padarian
I. Fuentes
Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
SOIL
author_facet J. Padarian
I. Fuentes
author_sort J. Padarian
title Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
title_short Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
title_full Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
title_fullStr Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
title_full_unstemmed Word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
title_sort word embeddings for application in geosciences: development, evaluation, and examples of soil-related concepts
publisher Copernicus Publications
series SOIL
issn 2199-3971
2199-398X
publishDate 2019-07-01
description <p>A large amount of descriptive information is available in geosciences. This information is usually considered subjective and ill-favoured compared with its numerical counterpart. Considering the advances in natural language processing and machine learning, it is possible to utilise descriptive information and encode it as dense vectors. These word embeddings, which encode information about a word and its linguistic relationships with other words, lay on a multidimensional space where angles and distances have a linguistic interpretation. We used 280&thinsp;764 full-text scientific articles related to geosciences to train a domain-specific language model capable of generating such embeddings. To evaluate the quality of the numerical representations, we performed three intrinsic evaluations: the capacity to generate analogies, term relatedness compared with the opinion of a human subject, and categorisation of different groups of words. As this is the first attempt to evaluate word embedding for tasks in the geosciences domain, we created a test suite specific for geosciences. We compared our results with general domain embeddings commonly used in other disciplines. As expected, our domain-specific embeddings (GeoVec) outperformed general domain embeddings in all tasks, with an overall performance improvement of 107.9&thinsp;%. We also presented an example were we successfully emulated part of a taxonomic analysis of soil profiles that was originally applied to soil numerical data, which would not be possible without the use of embeddings. The resulting embedding and test suite will be made available for other researchers to use and expand upon.</p>
url https://www.soil-journal.net/5/177/2019/soil-5-177-2019.pdf
work_keys_str_mv AT jpadarian wordembeddingsforapplicationingeosciencesdevelopmentevaluationandexamplesofsoilrelatedconcepts
AT ifuentes wordembeddingsforapplicationingeosciencesdevelopmentevaluationandexamplesofsoilrelatedconcepts
_version_ 1725154395674378240