Al Piskun: first light

2026-06-11T12:17:16Z

first light

New page

{{Infobox software
| name = MLC@Home
| description = MLC@Home was a completed Artificial intelligence BOINC volunteer computing project dedicated to understanding and interpreting neural networks, developed by John Clemens of the CORAL Lab at the University of Maryland, Baltimore County.

| status = Completed
| category = Artificial intelligence
| compute = CPU & GPU

| developer = John Clemens
| author = John Clemens
| sponsor = Cognition, Robotics, and Learning (CORAL) Lab, University of Maryland, Baltimore County
| released = {{Start date and age|2020|07|04}}
| completed = October 2022

| programming language = C++17 (PyTorch C++ API)
| operating system = Windows (x64), Linux (AMD64, ARM, AARCH64)

| website = {{URL|https://www.mlcathome.org/}}
}}

'''MLC@Home''' ('''Machine Learning Comprehension @ Home''') was a [[BOINC|BOINC volunteer computing]] project dedicated to understanding and interpreting complex [[machine learning]] models, with an emphasis on [[neural network]]s. Launched in mid-2020 and hosted at the University of Maryland, Baltimore County (UMBC), it became the first [[Berkeley Open Infrastructure for Network Computing|BOINC]] project focused specifically on machine learning research.<ref name="mlds_paper">{{Cite arxiv |author=Clemens, John |year=2021 |title=MLDS: A Dataset for Weight-Space Analysis of Neural Networks |eprint=2104.10555 |access-date=2026-06-11}}</ref> The project completed its initial goals and shut down in October 2022, leaving behind several publicly available datasets and a published research paper.

[[File:Colored neural network.png|thumb|right|250px|A diagram of an [[artificial neural network]]. MLC@Home trained thousands of small neural networks to study how their weights differ when trained on similar or modified data.]]

== Background ==

Modern [[neural network]]s have driven remarkable progress in machine learning across fields as varied as image recognition, natural language processing, and autonomous vehicles. However, these models are often described as "black boxes": they may have hundreds of millions of parameters, making it extremely difficult to understand precisely why a given network produces a given output, or to detect when something has gone wrong internally.<ref name="mlcfaq">{{cite web |url=https://www.mlcathome.org/FAQ.html |title=MLC@Home: Frequently Asked Questions |publisher=MLC@Home Team |access-date=2026-06-11}}</ref>

This opacity is not merely a theoretical problem. As neural networks are deployed in safety-critical applications such as medical diagnosis and self-driving vehicles, researchers have growing need to detect "trojan" or backdoored networks — models that have been trained on maliciously modified data and may behave deceptively in certain circumstances.<ref name="mlds_paper"/> Traditional evaluation criteria such as [[loss function|training loss]] are inadequate for detecting such behavior, motivating the search for direct structural methods of inspecting trained models.

== Origins ==

MLC@Home was created by John Clemens, a doctoral candidate in computer science at [[UMBC]] and a member of the Cognition, Robotics, and Learning (CORAL) Lab.<ref name="mlcfaq"/> The project was publicly announced and made available to [[BOINCstats]] in early July 2020, with the admin noting at the time that it was initially a one-person effort.<ref name="boincstats_forum">{{cite web |url=https://www.boincstats.com/forum/11/12543,1 |title=New project: MLC@Home |publisher=BOINCstats |date=2020-07-04 |access-date=2026-06-11}}</ref> Shortly after its launch, MLC@Home was formally adopted as a project of the CORAL Lab at UMBC, giving it institutional backing and a longer-term home.<ref name="boincstats_forum"/>

The project described its mandate clearly from the start: to turn the tools of data science inward, and apply the same analytical techniques used to build machine learning models to understand those models from the inside. As stated on the project homepage:

{{blockquote|MLC@Home is a distributed computing project dedicated to understanding and interpreting complex machine learning models, with an emphasis on neural networks.|MLC@Home FAQ<ref name="mlcfaq"/>}}

MLC@Home is notable as the first BOINC project specifically devoted to machine learning research.<ref name="mlds_paper"/>

== Technical approach ==

=== BOINC infrastructure ===

Like other [[BOINC|BOINC volunteer computing]] projects such as [[SETI@home]] and [[Rosetta@home]], MLC@Home distributed computational work to volunteers whose computers ran the BOINC client in the background during idle time. Volunteers downloaded [[work unit]]s from the project server, processed them locally, and returned the results automatically.<ref name="mlcfaq"/>

The MLC@Home BOINC application was built using [[PyTorch]]'s C++ API, making it one of the first BOINC applications to leverage a modern deep learning framework at the infrastructure level.<ref name="mlds_paper"/> Computations were intentionally fixed at 32-bit [[floating point|floating-point]] precision to ensure consistent numerical results across CPUs and GPUs from different manufacturers.<ref name="mlds_paper"/> The client supported Windows (x64) and Linux (AMD64, ARM, and AARCH64) platforms, with optional [[GPU]] acceleration for NVIDIA and AMD graphics cards.<ref name="mlds_paper"/>

The application's source code was released publicly under an open-source license.<ref name="mlds_paper"/>

=== Weight-space analysis ===

The central scientific hypothesis motivating MLC@Home is that the ''weights'' of a trained neural network carry rich structural information about both the training process and the training data. A neural network with <math>N</math> parameters can be represented as a point in an <math>N</math>-dimensional weight space:

:<math>\mathbf{w} = (w_1, w_2, \ldots, w_N) \in \mathbb{R}^N</math>

If many networks are trained on identical data, they will converge to different points in this space due to random initialization and stochastic training, but those points should cluster together relative to networks trained on different data. MLC@Home sought to test this hypothesis empirically — and to determine whether networks containing hidden malicious behaviors (trojan networks) could be detected by their position in weight space, without running any examples through the network at all.<ref name="mlds_paper"/>

[[File:Gated Recurrent Unit, base type.png|thumb|right|220px|A [[Gated Recurrent Unit]] (GRU), the type of [[recurrent neural network]] layer used in the MLC@Home MLDS-DS1 and MLDS-DS2 datasets.]]

== Machine Learning Dataset Generator (MLDS) ==

The primary subproject run under MLC@Home was the '''Machine Learning Dataset Generator''' (MLDS), whose goal was to build one of the first and largest public collections of trained neural networks with carefully controlled training conditions.<ref name="mlds_home">{{cite web |url=https://www.mlcathome.org/mlds.html |title=MLC@Home: Machine Learning Dataset Generator |publisher=MLC@Home Team |access-date=2026-06-11}}</ref>

MLDS took a phased approach, releasing multiple datasets of increasing complexity. All datasets were published under a [[Creative Commons license|CC BY-SA 4.0]] license and remain available for download from the project website.<ref name="mlds_home"/>

=== MLDS-DS1: RNNs mimicking simple machines ===

The first dataset trained up to 50,000 [[recurrent neural network]] (RNN) models to mimic five simple abstract machines: the EightBitMachine, SingleDirectMachine, SingleInvertMachine, SimpleXORMachine, and ParityMachine.<ref name="mlds_home"/> These were originally described in the paper "Learning Device Models with Recurrent Neural Networks" (arXiv:1805.07869). Each network used four [[Gated Recurrent Unit|GRU]] layers followed by four linear layers, totalling 4,364 parameters, making them small enough to train efficiently on a standard CPU.<ref name="mlds_home"/> The dataset contains 10,000 examples of each of the five machine types. The full dataset (MLDS-DS1-10000) is approximately 3.5 GB.

=== MLDS-DS2: RNNs with a hidden trigger sequence ===

DS2 used networks with the same architecture as DS1, but with a critical modification: if a specific 3-command input sequence was present, the network's output would be inverted for the following 3 commands. This "magic sequence" created networks superficially identical to DS1 but with a hidden back-door behavior, directly simulating the kind of trojan functionality researchers want to detect.<ref name="mlds_home"/> Pairing DS1 and DS2 enables research into whether weight-space analysis can distinguish clean from compromised models.

=== MLDS-DS3: RNNs mimicking randomly-generated automata ===

DS3 stepped up significantly in complexity by training networks to mimic 100 different randomly-generated finite-state automata rather than the five fixed machines in DS1/DS2.<ref name="mlds_home"/> Each automaton was required to have at least one [[Hamiltonian path|Hamiltonian cycle]], valid transitions for all inputs, and some hidden internal states not reflected in the output. The networks trained for DS3 were correspondingly larger: 64-wide, 4-layer deep [[Long short-term memory|LSTM]] networks followed by 2 linear layers, totalling 136,846 parameters per network.<ref name="mlds_home"/> The full DS3 dataset (1,000,000 networks) occupies approximately 1.3 TB and is distributed via [[BitTorrent]] due to its size.

The progress milestones achieved under DS3 were:
* Milestone 1 (10,000 networks): Complete
* Milestone 2 (100,000 networks): Complete
* Milestone 3 (1,000,000 networks): Complete<ref name="mlchome_main">{{cite web |url=https://www.mlcathome.org/ |title=MLC@Home: Machine Learning Comprehension @ Home |publisher=MLC@Home Team |access-date=2026-06-11}}</ref>

=== MLDS-DS4: CNNs and TrojAI ===

A fourth dataset, DS4, was planned to expand beyond [[recurrent neural network|RNNs]] into [[convolutional neural network|CNNs]] and image classification, in collaboration with the [[DARPA]] [[TrojAI]] program.<ref name="mlds_home"/> The design called for training networks on both standard and adversarially poisoned image data to test whether insights from the RNN datasets generalized to the CNN domain. However, the admin noted that smaller LeCun-style CNNs proved fast enough to train locally without involving the volunteer network, and plans for distributing AlexNet-style CNNs were later dropped as their scientific marginal value over the existing datasets was unclear.<ref name="shutdown_announcement">{{cite web |url=https://boinc.netsoft-online.com/forum/general-discussion/1553/?p=1 |title=MLC@Home shutting down for now, and thank you! |date=2022-10-04 |publisher=BOINC Combined Statistics forum |access-date=2026-06-11}}</ref>

== Volunteer community ==

At the time of the MLDS paper's publication in April 2021, MLC@Home had attracted more than 2,200 volunteers contributing computing time across over 8,000 separate computers.<ref name="mlds_paper"/> Those volunteers had collectively trained more than 750,000 neural networks, at a rate of more than two newly completed models per minute.<ref name="mlds_paper"/> The project was listed and tracked by [[BOINCstats]], Free-DC, and other BOINC statistics platforms.

[[File:BOINC logo 2013.png|thumb|left|180px|The [[BOINC]] logo. MLC@Home used the BOINC platform, the same infrastructure as projects like [[SETI@home]] and [[Rosetta@home]].]]

== Research findings ==

The key findings reported in the 2021 MLDS paper were:

* Networks trained on identical data '''cluster together''' in weight space, forming distinct groupings even when initialized randomly and trained stochastically.
* Even a '''small change''' to the training data produces meaningful divergence in weight space, demonstrating that weight-space analysis is sensitive to differences in training data.
* '''Trojan networks''' (those trained on data with a hidden trigger) can be detected by their position in weight space, without running any test examples through the network at all.

These results suggested that weight-space analysis is a "viable and effective alternative to loss for evaluating neural networks,"<ref name="mlds_paper"/> opening a new avenue for interpretability and AI safety research.

The MLDS dataset has since been cited in subsequent research on trojan detection in neural networks, including work on linear weight classification methods for defeating backdoor-detection competitions.<ref name="trojan_detection">{{cite web |url=https://arxiv.org/abs/2411.03445 |title=Solving Trojan Detection Competitions with Linear Weight Classification |publisher=arXiv |access-date=2026-06-11}}</ref>

== Shutdown ==

In early October 2022, John Clemens announced that MLC@Home would be shutting down as a BOINC project. The announcement cited four completed datasets, the need to shift focus from data generation to analysis and paper writing, and the practical difficulty of attracting new research collaborators to generate new experiments.<ref name="shutdown_announcement"/> He wrote in part:

{{blockquote|We've achieved the goals I set out to accomplish (and more!) with 4 complete datasets comprising dozens of terabytes of data to analyze. Now we need to focus on analyzing the results and writing papers.|John Clemens, MLC@Home admin<ref name="shutdown_announcement"/>}}

The project homepage was updated to reflect the shutdown by October 2022, stating: "As of October 2022, the MLC@Home BOINC project has completed its initial goals and is shutting down until there are new experiments to run."<ref name="mlchome_main"/> The admin expressed hope that the project could be revived if other researchers came forward with new experiments, and that all datasets would remain publicly available.<ref name="shutdown_announcement"/>

== Publications ==

* {{Cite arxiv |author=Clemens, John |year=2021 |title=MLDS: A Dataset for Weight-Space Analysis of Neural Networks |eprint=2104.10555 |access-date=2026-06-11}}

MLC@Home is also cited as a case study in the broader volunteer computing literature, including:

* {{Cite arxiv |author=Diskin, Michael et al. |year=2021 |title=Distributed Deep Learning in Open Collaborations |eprint=2106.10207 |access-date=2026-06-11}} (cites MLC@Home as a notable example of applying volunteer computing to machine learning)
* {{Cite arxiv |author=Narayanan, Sridhar et al. |year=2021 |title=Distributed Deep Learning Using Volunteer Computing-Like Paradigm |eprint=2103.08894 |access-date=2026-06-11}} (discusses the MLDS project as a model for distributed ML data generation)

== See also ==

* [[BOINC]]
* [[SETI@home]]
* [[Rosetta@home]]
* [[World Community Grid]]
* [[BOINC projects]]

== References ==

{{Reflist}}

== External links ==

* {{URL|https://www.mlcathome.org/|MLC@Home official website}}
* {{URL|https://www.mlcathome.org/mlds.html|MLDS Datasets}}
* {{URL|https://arxiv.org/abs/2104.10555|MLDS paper on arXiv}}
* {{URL|https://coral-lab.umbc.edu|CORAL Lab, UMBC}}

[[Category:BOINC projects]]
[[Category:Completed BOINC projects]]
[[Category:Artificial intelligence]]
[[Category:Machine learning]]
[[Category:University of Maryland]]
[[Category:Volunteer computing]]
[[Category:2020 software]]

MLC@Home - Revision history

Al Piskun: first light