Rethinking Unseen Data: Not All are Out-of-Distribution

Abstract Summary

This study, “Are All Unseen Data Out-of-Distribution?” by Songming Zhang, et. al., December 25, 2023, challenges the common assumption that all unseen data in machine learning models are out-of-distribution (OOD). The authors argue that not all generalization errors decrease monotonically with increased training data, especially when test data exhibits distribution drifts. They introduce a new definition of OOD data, propose a theorem framework for OOD generalization, and explore strategies like data augmentation and a novel source domain selection algorithm to improve model performance on unseen data.

Introduction

The study begins by highlighting the challenge of OOD generalization in multi-environment scenarios, revealing that not all unseen data are necessarily OOD. Different generalization error trends are observed, questioning traditional assumptions about OOD data.

Analyzing OOD Generalization Errors

Empirical evidence from various datasets (MNIST, CIFAR-10, PACS, DomainNet) and experiments (like Fisher’s Linear Discriminant analysis) shows varying generalization error trends. These include non-monotonic trends and scenarios where errors do not always decrease despite increased training data.

Redefining OOD Data and Theoretical Insights

The authors redefine OOD data as those outside the convex hull of training domains, presenting a new generalization bound based on this definition. They argue that unseen data within the convex hull can be effectively handled by well-trained models, while data outside this hull presents challenges.

Strategies for Improved Generalization

Techniques like data augmentation and pre-training are investigated for their effectiveness in OOD generalization. The study also introduces a reinforcement learning-based selection algorithm, focusing on source domain samples to enhance model performance on unseen data.

Conclusion

The paper concludes that the generalization to unseen data is more nuanced than previously thought. The findings emphasize the importance of considering the nature of unseen data and employing appropriate strategies for effective generalization in machine learning models.