This intervention draws on experimentations ongoing in the context of the OECD-led Statistical Information System Collaboration Community (SIS-CC) to enable AI applications with SDMX. One important use case is to use AI for better accessibility and discoverability of the data: whilst UX techniques, lexical search improvements, and data harmonisation can take statistical organisations to a good level of accessibility, however, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints. That is where AI – and most importantly, NLP and LLM techniques – could potentially make a difference. The “StatsBot” could be this natural language, conversational engine that could facilitate access and usage of the data. The “StatsBot” could leverage the semantics of any SDMX source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal and create the StatsBot as a universal, open asset usable by all statistical organisations. In a first step, the concept tested is to use Large Language Models with the Apache Solr index of SDMX objects so as to transform natural language queries into SDMX queries. In a second step, results could be framed as a natural language statement complementing the top-k search results. For the purpose of initial PoCs – aimed to demonstrate functional features and feasibility – a commercial LLM (such as OpenAI GPT-4) will be used; in a later stage substitution with an open source LLM will be analysed. The presentation will include the results of the first experimental work, lessons learnt, and scope future work that should lead to defining the path for production-grade, fully open source, and universal StatsBot.