Massively-Multitask Regression Models (MMRMs) trained on millions of compounds and many thousands of assays can predict bioactivity with accuracy comparable to 4-concentration IC50 experiments. Recent advances in hardware and algorithms have produced a variety of methods for multitask modeling. This report compares the performance of six MMRM algorithms: Profile-QSAR (pQSAR), Alchemite, a meta learner (MetaNN), a multitask feed-forward neural network (MT-DNN), Bayesian factorization with side information (Macau) and Inductive Matrix Completion (IMC). To ensure a fair comparison, each was trained by an expert, in several cases an author, of each method. All used the same sets of 159 kinase and 4276 diverse ChEMBL assays, employing the same realistically novel training/test set splits. MMRMs generally performed much better than a benchmark of single-task random forest regression models for our use case of virtually screening the compound collection on which the models were trained. The comparison was complicated, because methods that train all models simultaneously must leave out the test-set measurements for all assays to avoid test-set leakage, here 75% of measurements. MMRMs which train models one-at-a-time need only leave out data for each assay as it is trained, training on 99+% of data. This does not affect the accuracy of the final production models trained on 100% of data but does affect evaluation of how the final models will perform. The comparisons, therefore, included 3 training/test set collections: “all-out” models that leave out all test sets during training; “one-out” models where practical; and “subset-out” models, which only built models for about 10% of kinase assays or 1% of diverse assays, but could thus train evaluation models on about 90% or 99% of the measurements respectively. Many methods achieve similar accuracy. However, models trained on only 75% of the data performed much worse than those trained on 99+%. This indicates that all-out models seriously underestimate the performance of the final production models. Subset-out models were closer to one-out. A compromise method is to assess performance of the final models by multiple subset-out models, a more practicable computation for 1000s of assays. MMRMs demonstrated little advantage over single-task models for “cold-start” predictions on our novel test-set compounds not only unlike the specific assay’s training set, but also never tested on any of the other multitask supporting assays. Instead, the accuracy advantage was mainly from imputations within these sparse assay collections, compounds unlike the training set for the assay of interest, but with some measurements on other assays. This implies that MMRMs are best suited for hit-finding, off-target, promiscuity, mechanism-of-action, polypharmacology or drug repurposing prediction of compounds from the source used to train the overall multitask model. They have little advantage over single-task models, at much higher cost, for virtual screening of vendor archives or exploratory generative chemistry. Given that accuracy of the final models is often comparable between several of the algorithms, the paper concludes with a detailed discussion of other practical pros and cons of each method that might help choose which method to employ.