Distributionally Robust Machine Learning with Multi-source Data

Journal article

Zhenyu Wang, Peter Bühlmann, Zijian Guo
arXiv:2309.02211, 2023

Cite

APA Click to copy
Wang, Z., Bühlmann, P., & Guo, Z. (2023). Distributionally Robust Machine Learning with Multi-source Data. ArXiv:2309.02211.

Chicago/Turabian Click to copy
Wang, Zhenyu, Peter Bühlmann, and Zijian Guo. “Distributionally Robust Machine Learning with Multi-Source Data.” arXiv:2309.02211 (2023).

MLA Click to copy
Wang, Zhenyu, et al. “Distributionally Robust Machine Learning with Multi-Source Data.” ArXiv:2309.02211, 2023.

BibTeX Click to copy

@article{zhenyu2023a,
  title = {Distributionally Robust Machine Learning with Multi-source Data},
  year = {2023},
  journal = {arXiv:2309.02211},
  author = {Wang, Zhenyu and Bühlmann, Peter and Guo, Zijian}
}

Empirical risk minimization often performs poorly when the distribution of the target domain differs from those of the source domains. To address such potential distributional shifts, we develop an unsupervised domain adaptation framework that leverages labeled data from multiple source domains and unlabeled data from the target domain. We introduce a distributionally robust prediction model that optimizes an adversarial reward based on explained variance across a class of target distributions, ensuring generalization to the target domain. We show that the proposed robust prediction model is a weighted average of conditional outcome models from the source domains. This formulation allows the framework to integrate diverse machine learning algorithms, such as random forests, boosting, and neural networks. Additionally, we introduce a bias-corrected estimator for the optimal aggregation weights, which is effective for various machine learning algorithms and improves convergence rates. Our framework can be interpreted as a distributionally robust federated learning approach that satisfies privacy constraints while providing insights into the importance of each source for prediction on the target domain. The performance of our method is evaluated on both simulated and real data.