Charting large materials dataspaces: AI methods and scalability

on the June 8, 2022

Across a wide range of fields and in particular in materials science, there is increasing awareness that big data is a fundamental resource for fostering deeper understanding of physical systems and ensuring reproducibility of calculations.

It is crucial to realize that “big” does not refer only to the sheer amount of data, but also to their complexity. For example, in materials science, a material is typically characterized by an intricate hierarchy of observables including ensemble averages at various thermodynamic conditions. Another crucial aspect is the need to validate and quantify uncertainty, i.e., being able to assign to any single entry in the database a level of accuracy so that data points from disparate sources can be used concurrently in an analysis.

Such awareness has motivated the creation of large computational materials-science databases. Some are “project-based”, i.e., collections of high-throughput scans of given materials classes (e.g., AFLOW [1], Materials Project [2], OQMD [3]), others collect data from heterogeneous sources (e.g., NOMAD [4], Materials Cloud [5]).

In order for the data to be (re-)usable for new analyses and possibly discoveries, they have to comply with the so-called FAIR (findable - accessible - interoperable - reusable/repurposable/recyclable) principles [6].

This requires complex, hierarchical metadata structures that annotate the data, so that the users know the provenance (settings, purpose) of a calculation in order to judge whether an entry can be part of a dataset to be analysed [7].

The complexity and extent of the existing databases, which can only grow in both respects, reveals a rarely addressed challenge: the possibility to efficiently explore the databases themselves in order to reveal patterns and trends.

Here, exploration refers specifically to the possibility of producing dynamic, visual maps of the databases’ content. For instance, a user may be looking for ternary materials, not containing radioactive species, and would like to understand how diverse are the entries, i.e., whether they are somewhat uniformly spanning the materials space or are clustered into classes, where understanding what is common among class’ members is a challenge in itself.

This and similar kinds of questions call for interactive, dynamic, and intelligent (i.e., artificial-intelligent-driven) tools, which are also efficient, i.e., they are able to propose a meaningful solution within seconds.

In summary, in order to harvest the yet unhearted richness contained in presently known and future materials-science data, four pillars need to be concurrently developed:

  • FAIR-compliant materials databases
  • Identification of proper descriptors and metrics for capturing the similarity amongst materials, including the complex restructuring occurring at varying environmental conditions [8]
  • Artificial-intelligence (AI) approaches for exploratory analysis: clustering, dimension reduction and corresponding visualization that can reveal hidden patterns [9]
  • Scalable implementations, combining clever choice of the hardware as well as algorithmic speed-up (e.g., landmarking) [10]

In the proposed workshop, experts in all these aspects, not necessarily limited to materials-science applications, will interact to confront ideas and solutions for performing flexible, interactive, efficient, and insightful analyses of materials databases.

Published on June 8, 2022