QuChemPedIA@home

QuChemPedIA@home
Project
Status	Completed
Category	Chemistry / Computational chemistry
Compute	CPU
Requires	VirtualBox (Windows, macOS); native (Linux)
Development
Developer	Benoit Da Mota, Thomas Cauchy
Author	Benoit Da Mota, Thomas Cauchy
Sponsor	University of Angers
Maintainer	Benoit Da Mota (LERIA)
Initial release	July 1, 2019 (7 years ago)
Completed	September 29, 2022 (4 years ago)
Software
Written in	Python, Shell (wrapper); NWChem (computation)
Operating system	Windows, Linux, macOS
Metadata
Website	https://quchempedia.univ-angers.fr/athome/

QuChemPedIA@home (short for Quantum Chemistry Collaborative EncyclopedIA, with a nod to Intelligence Artificielle) was a volunteer computing project running on the BOINC platform, hosted at the University of Angers in France. Active from October 2019 to April 2023, the project recruited volunteers worldwide to donate idle CPU time to run density functional theory (DFT) quantum chemistry calculations on small organic molecules. Its ultimate goal was to build an open encyclopedia of quantum molecular chemistry results and to fuel artificial intelligence research aimed at drug discovery, materials science, and energy applications.

Background and motivation

Quantum chemistry has become nearly indispensable in modern chemical research. By solving approximations to the Schrödinger equation for electrons in a molecule, researchers can predict properties such as total energy, molecular geometry, and frontier orbital energies (the HOMO and LUMO) without ever needing to synthesize the compound in a laboratory. Despite the importance of these calculations, the researchers behind QuChemPedIA@home observed that the raw data produced by such computations was almost never shared openly: files were kept in local lab storage or simply discarded after a paper was published.^[1]

At the same time, the field of machine learning was producing models capable of predicting molecular properties at a fraction of the cost of a full quantum chemical calculation, but only when trained on large, diverse, high-quality datasets. The bottleneck was computing power: generating hundreds of thousands of DFT-level results for novel molecules could require years on a typical academic cluster. QuChemPedIA@home addressed both problems at once by building a large, freely available dataset of quantum chemical results with the help of volunteer computing.^[1]

The name is a portmanteau of quantum chemistry, encyclopedia, and the French phrase intelligence artificielle (artificial intelligence), reflecting the project's dual mission of archiving and enabling AI-driven molecular exploration.^[2]

Research team

The project was founded and operated by two researchers at the University of Angers:

Benoit Da Mota, an assistant professor of computer science at the LERIA laboratory (Laboratoire d'Etude et de Recherche en Informatique d'Angers), who handled the BOINC infrastructure, input generation, and AI/optimization aspects of the work.^[3]
Thomas Cauchy, an assistant professor of chemistry at the MOLTECH-Anjou laboratory (UMR CNRS 6200), who oversaw the computational chemistry and molecular generation aspects.^[3]

Other significant contributors to the associated research included Jules Leguy, Marta Glavatskikh, and Beatrice Duval, all affiliated with the University of Angers.^[4]

Scientific objectives

Molecular orbitals of formaldehyde computed with quantum chemistry methods. QuChemPedIA@home computed similar properties for hundreds of thousands of small organic molecules.

Building an open quantum chemistry encyclopedia

The project's first and foundational goal was to create a large, openly accessible repository of quantum chemical calculation results. Every output file produced by a volunteer's computer was to be retained and made available for reuse, in contrast to the prevailing norm of discarding raw data after publication.^[1]

The philosophy was one of open science: if quantum mechanical models of molecules are used in virtually every major chemistry publication, then the computational results underpinning those models should be treated as a public scientific resource rather than a private lab asset.

Exploring chemical space with artificial intelligence

The project's second goal was to enable AI-driven exploration of chemical space. The space of theoretically possible small organic molecules is astronomically large: even restricting attention to molecules with up to nine non-hydrogen atoms drawn from carbon, oxygen, nitrogen, and fluorine, the combinatorial space is far beyond what any human can survey. Generative molecular models, which learn to propose new molecules with desired properties, were seen as a way to navigate this space efficiently.^[1]

Such models require large, diverse datasets of accurate molecular properties for training. QuChemPedIA@home was designed to be the engine that provided those properties at scale.

Computing methodology

NWChem and density functional theory

All molecular calculations in the project were carried out with NWChem, an open-source ab initio computational chemistry software package developed at the Environmental Molecular Sciences Laboratory of Pacific Northwest National Laboratory (PNNL) and distributed under the Educational Community License 2.0.^[5]

The primary level of theory used was DFT with the B3LYP functional, a widely used hybrid exchange-correlation functional. The 3-21G basis set was used for lighter tasks (short work units), while the project later introduced longer tasks at higher accuracy levels.^[4]^[6]

For each candidate molecule, the key quantities computed were:

The total electronic energy $E_{total}$
The geometry of the molecule at its energy minimum
The energy of the Highest Occupied Molecular Orbital (HOMO), $E_{HOMO}$
The energy of the Lowest Unoccupied Molecular Orbital (LUMO), $E_{LUMO}$
The HOMO-LUMO gap, $Δ E = E_{LUMO} - E_{HOMO}$ , which describes the molecule's chemical reactivity and optical properties

Work unit distribution

Input files for NWChem were generated automatically by scripts written by the team.^[6] On Linux, a native wrapper ran NWChem directly. On Windows and macOS, the calculations ran inside a VirtualBox virtual machine to ensure consistent results across heterogeneous hardware.^[2]

The project offered two types of work unit: standard ("NWChem") tasks lasting several hours, and "NWChem long" tasks that could run for several dozen hours, intended for higher-accuracy calculations.^[2] Because many candidate molecules are chemically unstable or otherwise unsuitable for the level of theory employed, a significant fraction of work units were expected to fail; the BOINC infrastructure resubmitted failed tasks to other hosts before declaring a definitive error.^[6]

To guard against cheating and erroneous results, redundancy and validation checks were built into the distribution system. Credits and rankings provided an incentive for volunteers to continue contributing.^[6]

Project history

Molecular orbital diagram for a water molecule. QuChemPedIA@home calculated similar frontier orbital properties (HOMO and LUMO energies) for hundreds of thousands of small organic molecules.

Launch (October 2019)

QuChemPedIA@home opened to the public in October 2019, following an earlier alpha/testing phase during which registration required an invitation code. Benoit Da Mota first announced the project to the BOINC community in September 2019 via the BOINCstats forum, noting that work units were challenging and could last several days apiece.^[7] After working out early technical difficulties, the project opened fully, and statistics exports were made available for tracking sites.^[7]

First scientific publication (September 2020): EvoMol

In September 2020 the team published EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation in the Journal of Cheminformatics. The paper described an open-source evolutionary algorithm that builds molecular graphs by applying a set of seven generic atomic-level mutations, starting from as simple a compound as methane, and can optimize molecules for multiple objectives simultaneously (for example, drug-likeness scores such as QED, synthesizability, or frontier orbital energies).^[4]

QuChemPedIA@home's contribution to this paper was substantial: the BOINC community recalculated more than 200,000 compounds from the union of the QM9 and PC9 benchmark datasets at the B3LYP/3-21G level of theory, providing a consistent reference corpus for EvoMol's benchmarking.^[4]

Server failure and resilience (March 2021)

In March 2021 the server hosting the project went offline for five days due to a failure on the system disks. The team worked to restore the server without data loss and brought disk redundancy back online.^[1]

Second scientific publication (October 2021): OD9 dataset

In October 2021 the team published a second open-access paper, Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization, in the Journal of Cheminformatics. This paper introduced both a fast, chemically meaningful diversity metric and a new quantum chemical dataset named OD9, containing 435,032 molecules.^[6]

The 435,032 NWChem log files that make up the OD9 dataset were all calculated by the QuChemPedIA@home BOINC project volunteers.^[6] The dataset encompasses QM9 and PC9 molecules as well as approximately 250,000 new compounds generated by EvoMol with a chemical diversity objective, covering molecules with up to nine heavy atoms from carbon, oxygen, nitrogen, and fluorine. The OD9 dataset was released as an open-access Figshare collection.^[6]

The publication opened with the observation that the well-known QM9 benchmark dataset, used extensively in molecular machine learning, suffers from insufficient chemical diversity, which limits the generalizability of models trained on it. The OD9 dataset was designed to address this by incorporating molecularly diverse compounds generated specifically to fill gaps in chemical space.^[6]

Winding down (August 2022)

On 29 August 2022, Da Mota posted a message to project members explaining that the campaign was nearing its end. While the project could potentially submit many more calculations, the priority had shifted to analyzing and publishing the data already generated. He also noted that the team had developed a new approach to greatly reduce the rate of failed calculations on unstable molecules, which would appear in a future article.^[1]

He acknowledged that the project had been built and operated on a very limited budget and had not secured major external funding despite multiple grant applications, a situation he described as discouraging.^[8]

Retirement (April 2023)

QuChemPedIA@home retired in April 2023, as recorded by the BC-Wiki tracking database.^[2] The project website and server remained accessible for some time after retirement.

Datasets produced

The project produced two major publicly available quantum chemical datasets:

Dataset	Molecules	Level of theory	Notes
QM9 + PC9 recalculation	~200,000	B3LYP/3-21G	Recalculated reference corpus for EvoMol benchmarking; acknowledged in the EvoMol paper^[4]
OD9	435,032	B3LYP/3-21G (NWChem)	Released as open Figshare collection; includes QM9, PC9, and ~250k new diversity-optimized molecules^[6]

The OD9 dataset is available at: https://doi.org/10.6084/m9.figshare.c.5180513.v1

Context: the QM9 and PC9 benchmarks

Electrostatic potential map (ESP) of gallic acid, a type of visualization generated from DFT output files like those collected by QuChemPedIA@home.

QuChemPedIA@home's work was closely connected to two benchmark datasets used in molecular machine learning:

QM9: A dataset of 133,885 small organic molecules, each with up to nine non-hydrogen atoms (C, H, N, O, F), with geometries optimized and properties computed at the B3LYP/6-31G(2df,p) level. QM9 became a standard benchmark for machine learning predictions of molecular properties.^[9]

PC9: A dataset of 99,234 molecules extracted from PubChem with the same heavy-atom constraints as QM9, computed at B3LYP/6-31G(d). PC9 was introduced by the Angers team in a 2019 paper co-authored by Glavatskikh, Leguy, Hunault, Cauchy, and Da Mota, which showed that PC9 encompassed greater chemical diversity than QM9 and that ML models trained on it generalised better across datasets.^[9]

The central statistical quantity used in both DFT computations and ML model evaluation is the atomisation energy, defined as:

$E_{atomisation} = E_{total} - \sum_{A} E_{A}$

where $E_{A}$ is the energy of each constituent atom in isolation.

Software and infrastructure

The molecular generator EvoMol, developed in conjunction with the project, is open-source and hosted on GitHub at https://github.com/jules-leguy/EvoMol. The related QuChemReport tool for reading and analyzing NWChem output logs is also open-source, available at https://github.com/tcauchy/quchemreport.

The BOINC server ran on hardware maintained by the LERIA laboratory at the University of Angers. Computing tasks for Linux users ran natively, while Windows and macOS volunteers required VirtualBox to be installed to run the virtual machine wrapper.^[2]

Scientific publications

The following peer-reviewed papers were produced using data generated by or in connection with QuChemPedIA@home:

(2019).Dataset's chemical diversity limits the generalizability of machine learning predictions. Journal of Cheminformatics. pp. 69. DOI: 10.1186/s13321-019-0391-2.

(2020).EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. Journal of Cheminformatics. pp. 55. DOI: 10.1186/s13321-020-00458-z.

(2021).Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization. Journal of Cheminformatics. pp. 76. DOI: 10.1186/s13321-021-00554-8.

"Predicting Interatomic Distances of Molecular Quantum Chemistry Calculations". In:(2022).Advances in Knowledge Discovery and Management. Springer. DOI: 10.1007/978-3-030-90287-2_8.(Studies in Computational Intelligence, vol. 1004).

(2023).Definition and exploration of realistic chemical spaces using the connectivity and cyclic features of ChEMBL and ZINC. Digital Discovery. DOI: 10.1039/D2DD00092J.

Legacy

QuChemPedIA@home demonstrated that volunteer distributed computing could meaningfully contribute to cutting-edge computational chemistry and machine learning research on a minimal budget. The OD9 dataset produced by the project's volunteers remains publicly available and has been cited in subsequent research on molecular generation and chemical space exploration.^[6]

The project is a notable example of the convergence of open science, volunteer computing, and AI-driven chemistry: raw quantum chemical data collected from thousands of volunteer computers around the world was transformed into freely available training material for molecular AI systems with potential applications in drug discovery, organic photovoltaics, and materials design.

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 What is QuChemPedIA@home?. University of Angers. Retrieved 2025-05-01.
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 QuChemPedIA@home (English). BC-Wiki. Retrieved 2025-05-01.
↑ ^3.0 ^3.1 quchemreport on GitHub. Retrieved 2025-05-01.
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 (2020).EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. Journal of Cheminformatics. pp. 55. DOI: 10.1186/s13321-020-00458-z.
↑ NWChem. Wikipedia. Retrieved 2025-05-01.
↑ ^6.00 ^6.01 ^6.02 ^6.03 ^6.04 ^6.05 ^6.06 ^6.07 ^6.08 ^6.09 (2021).Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization. Journal of Cheminformatics. pp. 76. DOI: 10.1186/s13321-021-00554-8.
↑ ^7.0 ^7.1 (2019-09-10).New projects: QuChemPedIA@home. BOINCstats/BAM!. Retrieved 2025-05-01.
↑ QuChemPedIA project update (August 2022). University of Angers. Retrieved 2025-05-01.
↑ ^9.0 ^9.1 (2019).Dataset's chemical diversity limits the generalizability of machine learning predictions. Journal of Cheminformatics. pp. 69. DOI: 10.1186/s13321-019-0391-2.

External links

[project-about-1] 1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 What is QuChemPedIA@home?. University of Angers. Retrieved 2025-05-01.

[bcwiki-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 QuChemPedIA@home (English). BC-Wiki. Retrieved 2025-05-01.

[github-report-3] 3.0 ^3.1 quchemreport on GitHub. Retrieved 2025-05-01.

[evomol-paper-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 (2020).EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation. Journal of Cheminformatics. pp. 55. DOI: 10.1186/s13321-020-00458-z.

[nwchem-wiki-5] NWChem. Wikipedia. Retrieved 2025-05-01.

[od9-paper-6] 6.00 ^6.01 ^6.02 ^6.03 ^6.04 ^6.05 ^6.06 ^6.07 ^6.08 ^6.09 (2021).Scalable estimator of the diversity for de novo molecular generation resulting in a more robust QM dataset (OD9) and a more efficient molecular optimization. Journal of Cheminformatics. pp. 76. DOI: 10.1186/s13321-021-00554-8.

[boincstats-forum-7] 7.0 ^7.1 (2019-09-10).New projects: QuChemPedIA@home. BOINCstats/BAM!. Retrieved 2025-05-01.

[boinc-conf-8] QuChemPedIA project update (August 2022). University of Angers. Retrieved 2025-05-01.

[pc9-paper-9] 9.0 ^9.1 (2019).Dataset's chemical diversity limits the generalizability of machine learning predictions. Journal of Cheminformatics. pp. 69. DOI: 10.1186/s13321-019-0391-2.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]