Summary: | In Machine Learning, one typically has access to datasets which can be represented either as feature vectors for the items of interest or as pairwise distances or similarity scores between items. However, sometimes one has neither, but only an ordinal notion of similarity, where one might be able to identify the most similar pair from among three items but not say precisely how similar any pair is. For example, it is much easier for a human assessor to say which pair of
movies or of songs seems the most similar from among three than to assign consistent similarity scores to arbitrary pairs, or even to say exactly which properties of those movies or songs are important to compare. This thesis considers the task of accepting as input such ordinal datasets, namely as a set T of triplets (a, b, c) meaning that "object a is more similar to b than to c," and assigning a Euclidean vector x_i to each item i so that (a, b, c) ∈ T ⇒ ∥x_a − x_b∥ < ∥x_a − x_c∥.
We explore how to identify a minimal subset of triplets to adequately constrain such an embedding, how to efficiently recover an embedding for an adaptively-chosen subset of triplets, and ultimately how ordinal triplets directly imply geometric properties. The methods presented achieve significant improvements in the amount of time needed to produce an ordinal embedding, in the quality of embeddings produced, and in the number of objects and dimensions which can be accurately
represented.
|