A typical QA system is empirically only as good as the performance of its indexing module. The performance of indexing serves as an upper bound to the overall output of the QA system, since it can process only as much data as is being presented/served to it from the indices. The precision and recall of the system may be good, but if all or most of the top relevant documents are not indexed in the system, the system performance suffers and hence does the end user. In the present scenario, information retrieval systems are carefully tailored and optimised to deliver highly accurate results for specific tasks. Over the years, efforts of developing such task specific systems have been diversified based on a variety of factors discussed in the following. Based on the type of the data and the application setting, a wide range of indexing techniques are deployed. They can broadly be categorized into three categories based on the format and type of data indexed, namely: structured (e.g., RDF, SQL), semi-structured (e.g., HTML, XML, JSON, or CSV) and/or unstructured data (e.g., text dumps). They are further distinguished by the type of technique they use for indexing and/or also by the type of queries that a particular technique can address. The different techniques inherently make use of a wide spectrum of underlying fundamental data structures in order to achieve the desirable result. Most of the systems dealing with unstructured or semistructured data make use of inverted indices and lists for indexing. For structured systems, a variety of data structures such as AVL trees, B-Trees, sparse indices, or IR trees, have been developed in the past decades. Many data management systems combine two or more data structures to maintain different indices for different data attributes. ESR2 plans to benchmark the performance of different cross domain tools in a wide variety of data management solutions like Openlink Virtuoso, RDF-3X, Neo4j, Sparksee, Apache TinkerPop, etc. In the reporting period, ESR2 has successfully surveyed the existing data management solutions for indexing heterogeneous data. He has also been working developing the LITMUS benchmarking framework for automating the process of benchmarking data management solutions, i.e., reproducing existing benchmarks, identifying the independent factors, and analyzing the experimental results via machine readable reports.
In addition, ESR2 in collaboration with ESR1 has surveyed existing data quality assessment dimensions which are most relevant to QA settings. In the reporting period, he has published three papers at international conferences and workshops. ESR2 has participated at the 1st and 2nd WDAqua Learning Week as well as the 1st and 2nd R&D Week where he received technical and non-technical training. He has participated at international conferences such as ESWC 2016 and has won together with ESR4 the best paper award at the PROFILES 2016 workshop.
Are Linked Datasets fit for Open-domain Question Answering? A Quality Assessment. Harsh Thakkar, Kemele M. Endris, José M. Giménez-García, Jeremy Debattista, Christoph Lange, Sören Auer. WIMS 2016: 19:1-19:12 URL PDF
Question Answering on Linked Data: Challenges and Future Directions. Saeedeh Shekarpour, Denis Lukovnikov, Ashwini Jaya Kumar, Kemele Endris, Kuldeep Singh, Harsh Thakkar, Christoph Lange. Q4APS at WWW 2016: 693-698 URL PDF
QAestro–semantic-based composition of question answering pipelines. Kuldeep Singh, Ionna Lytra, Maria Esther-Vidal, Dharmen Punjani, Harsh Thakkar, Christoph Lange, Sören Auer. DEXA 2017: 19-34 URL PDF
Trying Not to Die Benchmarking - Orchestrating RDF and Graph Data Management Solution Benchmarks Using LITMUS. Harsh Thakkar, Yashwant Keswani, Mohnish Dubey, Jens Lehmann, Sören Auer. Semantics 2017 PDF
Dataset reuse: An analysis of references in community discussions, publications and data. Endris, Kemele and Giménez-García, José M. and Thakkar, Harsh and Demidova, Elena and Zimmermann, Antoine and Lange, Christoph and Simperl, Elena. K-CAP 2017 PDF