MLC@Home

From BOINC Projects
Jump to navigation Jump to search









MLC@Home
Project
StatusCompleted
CategoryArtificial intelligence
ComputeCPU & GPU
Development
DeveloperJohn Clemens
AuthorJohn Clemens
SponsorCognition, Robotics, and Learning (CORAL) Lab, University of Maryland, Baltimore County
Initial releaseJuly 4, 2020  (6 years ago)
CompletedOctober 2022
Software
Written inC++17 (PyTorch C++ API)
Operating systemWindows (x64), Linux (AMD64, ARM, AARCH64)
Metadata
Websitehttps://www.mlcathome.org/

MLC@Home (Machine Learning Comprehension @ Home) was a BOINC volunteer computing project dedicated to understanding and interpreting complex machine learning models, with an emphasis on neural networks. Launched in mid-2020 and hosted at the University of Maryland, Baltimore County (UMBC), it became the first BOINC project focused specifically on machine learning research.[1] The project completed its initial goals and shut down in October 2022, leaving behind several publicly available datasets and a published research paper.

File:Colored neural network.png
A diagram of an artificial neural network. MLC@Home trained thousands of small neural networks to study how their weights differ when trained on similar or modified data.

Background

Modern neural networks have driven remarkable progress in machine learning across fields as varied as image recognition, natural language processing, and autonomous vehicles. However, these models are often described as "black boxes": they may have hundreds of millions of parameters, making it extremely difficult to understand precisely why a given network produces a given output, or to detect when something has gone wrong internally.[2]

This opacity is not merely a theoretical problem. As neural networks are deployed in safety-critical applications such as medical diagnosis and self-driving vehicles, researchers have growing need to detect "trojan" or backdoored networks — models that have been trained on maliciously modified data and may behave deceptively in certain circumstances.[1] Traditional evaluation criteria such as training loss are inadequate for detecting such behavior, motivating the search for direct structural methods of inspecting trained models.

Origins

MLC@Home was created by John Clemens, a doctoral candidate in computer science at UMBC and a member of the Cognition, Robotics, and Learning (CORAL) Lab.[2] The project was publicly announced and made available to BOINCstats in early July 2020, with the admin noting at the time that it was initially a one-person effort.[3] Shortly after its launch, MLC@Home was formally adopted as a project of the CORAL Lab at UMBC, giving it institutional backing and a longer-term home.[3]

The project described its mandate clearly from the start: to turn the tools of data science inward, and apply the same analytical techniques used to build machine learning models to understand those models from the inside. As stated on the project homepage:

Template:Blockquote

MLC@Home is notable as the first BOINC project specifically devoted to machine learning research.[1]

Technical approach

BOINC infrastructure

Like other BOINC volunteer computing projects such as SETI@home and Rosetta@home, MLC@Home distributed computational work to volunteers whose computers ran the BOINC client in the background during idle time. Volunteers downloaded work units from the project server, processed them locally, and returned the results automatically.[2]

The MLC@Home BOINC application was built using PyTorch's C++ API, making it one of the first BOINC applications to leverage a modern deep learning framework at the infrastructure level.[1] Computations were intentionally fixed at 32-bit floating-point precision to ensure consistent numerical results across CPUs and GPUs from different manufacturers.[1] The client supported Windows (x64) and Linux (AMD64, ARM, and AARCH64) platforms, with optional GPU acceleration for NVIDIA and AMD graphics cards.[1]

The application's source code was released publicly under an open-source license.[1]

Weight-space analysis

The central scientific hypothesis motivating MLC@Home is that the weights of a trained neural network carry rich structural information about both the training process and the training data. A neural network with N parameters can be represented as a point in an N-dimensional weight space:

𝐰=(w1,w2,,wN)N

If many networks are trained on identical data, they will converge to different points in this space due to random initialization and stochastic training, but those points should cluster together relative to networks trained on different data. MLC@Home sought to test this hypothesis empirically — and to determine whether networks containing hidden malicious behaviors (trojan networks) could be detected by their position in weight space, without running any examples through the network at all.[1]

File:Gated Recurrent Unit, base type.png
A Gated Recurrent Unit (GRU), the type of recurrent neural network layer used in the MLC@Home MLDS-DS1 and MLDS-DS2 datasets.

Machine Learning Dataset Generator (MLDS)

The primary subproject run under MLC@Home was the Machine Learning Dataset Generator (MLDS), whose goal was to build one of the first and largest public collections of trained neural networks with carefully controlled training conditions.[4]

MLDS took a phased approach, releasing multiple datasets of increasing complexity. All datasets were published under a CC BY-SA 4.0 license and remain available for download from the project website.[4]

MLDS-DS1: RNNs mimicking simple machines

The first dataset trained up to 50,000 recurrent neural network (RNN) models to mimic five simple abstract machines: the EightBitMachine, SingleDirectMachine, SingleInvertMachine, SimpleXORMachine, and ParityMachine.[4] These were originally described in the paper "Learning Device Models with Recurrent Neural Networks" (arXiv:1805.07869). Each network used four GRU layers followed by four linear layers, totalling 4,364 parameters, making them small enough to train efficiently on a standard CPU.[4] The dataset contains 10,000 examples of each of the five machine types. The full dataset (MLDS-DS1-10000) is approximately 3.5 GB.

MLDS-DS2: RNNs with a hidden trigger sequence

DS2 used networks with the same architecture as DS1, but with a critical modification: if a specific 3-command input sequence was present, the network's output would be inverted for the following 3 commands. This "magic sequence" created networks superficially identical to DS1 but with a hidden back-door behavior, directly simulating the kind of trojan functionality researchers want to detect.[4] Pairing DS1 and DS2 enables research into whether weight-space analysis can distinguish clean from compromised models.

MLDS-DS3: RNNs mimicking randomly-generated automata

DS3 stepped up significantly in complexity by training networks to mimic 100 different randomly-generated finite-state automata rather than the five fixed machines in DS1/DS2.[4] Each automaton was required to have at least one Hamiltonian cycle, valid transitions for all inputs, and some hidden internal states not reflected in the output. The networks trained for DS3 were correspondingly larger: 64-wide, 4-layer deep LSTM networks followed by 2 linear layers, totalling 136,846 parameters per network.[4] The full DS3 dataset (1,000,000 networks) occupies approximately 1.3 TB and is distributed via BitTorrent due to its size.

The progress milestones achieved under DS3 were:

  • Milestone 1 (10,000 networks): Complete
  • Milestone 2 (100,000 networks): Complete
  • Milestone 3 (1,000,000 networks): Complete[5]

MLDS-DS4: CNNs and TrojAI

A fourth dataset, DS4, was planned to expand beyond RNNs into CNNs and image classification, in collaboration with the DARPA TrojAI program.[4] The design called for training networks on both standard and adversarially poisoned image data to test whether insights from the RNN datasets generalized to the CNN domain. However, the admin noted that smaller LeCun-style CNNs proved fast enough to train locally without involving the volunteer network, and plans for distributing AlexNet-style CNNs were later dropped as their scientific marginal value over the existing datasets was unclear.[6]

Volunteer community

At the time of the MLDS paper's publication in April 2021, MLC@Home had attracted more than 2,200 volunteers contributing computing time across over 8,000 separate computers.[1] Those volunteers had collectively trained more than 750,000 neural networks, at a rate of more than two newly completed models per minute.[1] The project was listed and tracked by BOINCstats, Free-DC, and other BOINC statistics platforms.

File:BOINC logo 2013.png
The BOINC logo. MLC@Home used the BOINC platform, the same infrastructure as projects like SETI@home and Rosetta@home.

Research findings

The key findings reported in the 2021 MLDS paper were:

  • Networks trained on identical data cluster together in weight space, forming distinct groupings even when initialized randomly and trained stochastically.
  • Even a small change to the training data produces meaningful divergence in weight space, demonstrating that weight-space analysis is sensitive to differences in training data.
  • Trojan networks (those trained on data with a hidden trigger) can be detected by their position in weight space, without running any test examples through the network at all.

These results suggested that weight-space analysis is a "viable and effective alternative to loss for evaluating neural networks,"[1] opening a new avenue for interpretability and AI safety research.

The MLDS dataset has since been cited in subsequent research on trojan detection in neural networks, including work on linear weight classification methods for defeating backdoor-detection competitions.[7]

Shutdown

In early October 2022, John Clemens announced that MLC@Home would be shutting down as a BOINC project. The announcement cited four completed datasets, the need to shift focus from data generation to analysis and paper writing, and the practical difficulty of attracting new research collaborators to generate new experiments.[6] He wrote in part:

Template:Blockquote

The project homepage was updated to reflect the shutdown by October 2022, stating: "As of October 2022, the MLC@Home BOINC project has completed its initial goals and is shutting down until there are new experiments to run."[5] The admin expressed hope that the project could be revived if other researchers came forward with new experiments, and that all datasets would remain publicly available.[6]

Publications

  • Clemens, John.(2021). "MLDS: A Dataset for Weight-Space Analysis of Neural Networks". arXiv: 2104.10555.Retrieved 2026-06-11.

MLC@Home is also cited as a case study in the broader volunteer computing literature, including:

  • Diskin, Michael et al..(2021). "Distributed Deep Learning in Open Collaborations". arXiv: 2106.10207.Retrieved 2026-06-11. (cites MLC@Home as a notable example of applying volunteer computing to machine learning)
  • Narayanan, Sridhar et al..(2021). "Distributed Deep Learning Using Volunteer Computing-Like Paradigm". arXiv: 2103.08894.Retrieved 2026-06-11. (discusses the MLDS project as a model for distributed ML data generation)

See also

References

  1. 1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09 1.10 Clemens, John.(2021). "MLDS: A Dataset for Weight-Space Analysis of Neural Networks". arXiv: 2104.10555.Retrieved 2026-06-11.
  2. 2.0 2.1 2.2 MLC@Home: Frequently Asked Questions. MLC@Home Team. Retrieved 2026-06-11.
  3. 3.0 3.1 (2020-07-04).New project: MLC@Home. BOINCstats. Retrieved 2026-06-11.
  4. 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 MLC@Home: Machine Learning Dataset Generator. MLC@Home Team. Retrieved 2026-06-11.
  5. 5.0 5.1 MLC@Home: Machine Learning Comprehension @ Home. MLC@Home Team. Retrieved 2026-06-11.
  6. 6.0 6.1 6.2 (2022-10-04).MLC@Home shutting down for now, and thank you!. BOINC Combined Statistics forum. Retrieved 2026-06-11.
  7. Solving Trojan Detection Competitions with Linear Weight Classification. arXiv. Retrieved 2026-06-11.

External links