Open datasets are typically catalogued on official portals under defined lists of categories, and accompanied by short metadata. Despite the search functions provided by such catalogues, it is often not possible for an ordinary user to find relevant pieces of information quickly. This can be caused by non-intuitive or limited data descriptions, misleading naming conventions, incorrect assignment of categories to datasets, a user’s lack of in-depth knowledge of the subject, or simply that the search is only conducted over the metadata records rather than the data itself. Functional and non-functional requirements are not sufficiently met by current solutions.

In order to improve the current process of data discovery the data should be indexed and ranked in a meaningful way in a search operation. A first investigation will explore how to apply lossless compression/summarisation techniques to tabular data to create input records for a search index.

The final outcome from the proposed improvements is a better understanding of the data structure. Aggregating and summarising methods with support of indexing and ranking algorithms can improve the coarse process of obtaining information on every level and result in a better quality of service for the end users and to authorities releasing data.



Learning when searching for web data. Laura Koesten, Emilia Kacprzak, Jenifer Tennison. Search as Learning (SAL) workshop @SIGIR 2016.

Position Paper: Dataset profiling for un-Linked Data. Emilia Kacprzak, Laura Koesten, Tom Heath, Jeni Tennison. PROFILES Workshop. ESWC 2016 (Satellite Events)

The Trials and Tribulations of Working with Structured Data - a Study on Information Seeking Behaviour. Laura Koesten, Emilia Kacprzak, Jeni Tennison, Elena Simperl. CHI 2017

A Query Log Analysis of Dataset Search. Emilia Kacprzak, Laura M Koesten, Luis-Daniel Ibáñ ez, Elena Simperl, Jeni Tennison. ICWE 2017