Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants

We present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the c...

Full description

Bibliographic Details
Main Authors: Ari Pirkola, Heikki Keskustalo, Erkka Leppänen, Antti-Pekka Känsälä, Kalervo Järvelin
Format: Article
Language:English
Published: University of Borås 2002-01-01
Series:Information Research: An International Electronic Journal
Online Access:http://informationr.net/ir/7-2/paper126.html
id doaj-a8b2685a6b2247878a5fa6b7ce332b39
record_format Article
spelling doaj-a8b2685a6b2247878a5fa6b7ce332b392020-11-25T01:08:41ZengUniversity of BoråsInformation Research: An International Electronic Journal1368-16132002-01-0172126Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variantsAri PirkolaHeikki KeskustaloErkka LeppänenAntti-Pekka KänsäläKalervo JärvelinWe present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the conventional n-gram technique using adjacent characters as n-grams. Several types of words and word pairs were studied. English, German, and Swedish query keys were matched against their Finnish spelling variants and Finnish morphological variants using a target word list of 119 000 Finnish words. In all cross-lingual tests done, the targeted s-gram matching technique outperformed the conventional n-gram matching technique. The technique was highly effective also for monolingual word form variants. The effects of query key length and the length of the longest common subsequence (LCS) of the variants on the performance of s-grams were analyzed.http://informationr.net/ir/7-2/paper126.html
collection DOAJ
language English
format Article
sources DOAJ
author Ari Pirkola
Heikki Keskustalo
Erkka Leppänen
Antti-Pekka Känsälä
Kalervo Järvelin
spellingShingle Ari Pirkola
Heikki Keskustalo
Erkka Leppänen
Antti-Pekka Känsälä
Kalervo Järvelin
Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
Information Research: An International Electronic Journal
author_facet Ari Pirkola
Heikki Keskustalo
Erkka Leppänen
Antti-Pekka Känsälä
Kalervo Järvelin
author_sort Ari Pirkola
title Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
title_short Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
title_full Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
title_fullStr Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
title_full_unstemmed Targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
title_sort targeted s-gram matching: a novel n-gram matching technique for cross- and monolingual word form variants
publisher University of Borås
series Information Research: An International Electronic Journal
issn 1368-1613
publishDate 2002-01-01
description We present a novel n-gram based string matching technique, which we call the targeted s-gram matching technique. In the technique, n-grams are classified into categories on the basis of character contiguity in words. The categories are then utilized in matching. The technique was compared with the conventional n-gram technique using adjacent characters as n-grams. Several types of words and word pairs were studied. English, German, and Swedish query keys were matched against their Finnish spelling variants and Finnish morphological variants using a target word list of 119 000 Finnish words. In all cross-lingual tests done, the targeted s-gram matching technique outperformed the conventional n-gram matching technique. The technique was highly effective also for monolingual word form variants. The effects of query key length and the length of the longest common subsequence (LCS) of the variants on the performance of s-grams were analyzed.
url http://informationr.net/ir/7-2/paper126.html
work_keys_str_mv AT aripirkola targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants
AT heikkikeskustalo targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants
AT erkkaleppanen targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants
AT anttipekkakansala targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants
AT kalervojarvelin targetedsgrammatchinganovelngrammatchingtechniqueforcrossandmonolingualwordformvariants
_version_ 1725182038728769536