[null,null,["最后更新时间 (UTC):2025-08-28。"],[[["\u003cp\u003eImbalanced datasets occur when one label (majority class) is significantly more frequent than another (minority class), potentially hindering model training on the minority class.\u003c/p\u003e\n"],["\u003cp\u003eDownsampling the majority class and upweighting it can improve model performance by balancing class representation and reducing prediction bias.\u003c/p\u003e\n"],["\u003cp\u003eExperimenting with rebalancing ratios is crucial for optimal performance, ensuring batches contain enough minority class examples for effective training.\u003c/p\u003e\n"],["\u003cp\u003eUpweighting the minority class is simpler but may increase prediction bias compared to downsampling and upweighting the majority class.\u003c/p\u003e\n"],["\u003cp\u003eDownsampling offers benefits like faster convergence and less disk space usage but requires manual effort, especially for large datasets.\u003c/p\u003e\n"]]],[],null,["This section explores the following three questions:\n\n- What's the difference between class-balanced datasets and class-imbalanced datasets?\n- Why is training an imbalanced dataset difficult?\n- How can you overcome the problems of training imbalanced datasets?\n\nClass-balanced datasets versus class-imbalanced datasets\n\nConsider a dataset containing a\n[**categorical**](/machine-learning/glossary#categorical-data) label whose value\nis either the positive class or the negative class. In a\n[**class-balanced dataset**](/machine-learning/glossary#class-balanced-dataset),\nthe number of [**positive classes**](/machine-learning/glossary#positive-class)\nand [**negative classes**](/machine-learning/glossary#negative-class) is\nabout equal. For example, a dataset containing 235 positive classes and 247\nnegative classes is a balanced dataset.\n\nIn a [**class-imbalanced dataset**](/machine-learning/glossary#class-imbalanced-dataset),\none label is considerably more common than the other. In the real world,\nclass-imbalanced datasets are far more common than class-balanced datasets.\nFor example, in a dataset of credit card transactions, fraudulent purchases\nmight make up less than 0.1% of the examples. Similarly, in a medical diagnosis\ndataset, the number of patients with a rare virus might be less than 0.01% of\nthe total examples. In a class-imbalanced dataset:\n\n- The *more* common label is called the [**majority class**](/machine-learning/glossary#majority_class).\n- The *less* common label is called the [**minority class**](/machine-learning/glossary#minority_class).\n\nThe difficulty of training severely class-imbalanced datasets\n\nTraining aims to create a model that successfully distinguishes the positive\nclass from the negative class. To do so,\n[**batches**](/machine-learning/glossary#batch) need a sufficient\nnumber of *both* positive classes and negative classes. That's not a problem\nwhen training on a mildly class-imbalanced dataset since even small batches\ntypically contain sufficient examples of both the positive class and the\nnegative class. However, a severely class-imbalanced dataset might not contain\nenough minority class examples for proper training.\n\nFor example, consider the class-imbalanced dataset illustrated in Figure 6\nin which:\n\n- 200 labels are in the majority class.\n- 2 labels are in the minority class.\n\n**Figure 6.** A highly imbalanced floral dataset containing far more sunflowers than roses.\n\nIf the batch size is 20, most batches won't contain any examples of the minority\nclass. If the batch size is 100, each batch will contain an average of only one\nminority class example, which is insufficient for proper training. Even a much\nlarger batch size will still yield such an imbalanced proportion that the model\nmight not train properly.\n| **Note:** [**Accuracy**](/machine-learning/glossary#accuracy) is usually a poor metric for assessing a model trained on a class-imbalanced dataset. See [Classification: Accuracy, recall, precision, and related\n| metrics](/machine-learning/crash-course/classification/accuracy-precision-recall) for details.\n\nTraining a class-imbalanced dataset\n\nDuring training, a model should learn two things:\n\n- What each class looks like; that is, what feature values correspond to what class?\n- How common each class is; that is, what is the relative distribution of the classes?\n\nStandard training conflates these two goals. In contrast, the following two-step\ntechnique called **downsampling and upweighting the majority class** separates\nthese two goals, enabling the model to achieve *both* goals.\n| **Note:** Many students read the following section and say some variant of, \"That just can't be right.\" Be warned that downsampling and upweighting the majority class is somewhat counterintuitive.\n\nStep 1: Downsample the majority class\n\n[**Downsampling**](/machine-learning/glossary#downsampling) means training\non a disproportionately low percentage of majority class examples.\nThat is, you artificially force a class-imbalanced dataset to become\nsomewhat more balanced by omitting many of the majority class examples from\ntraining. Downsampling greatly increases the probability that each batch\ncontains enough examples of the minority class to train the model properly\nand efficiently.\n\nFor example, the class-imbalanced dataset shown in Figure 6 consists of\n99% majority class and 1% minority class examples. Downsampling the majority\nclass by a factor of 25 artificially creates a more balanced training set\n(80% majority class to 20% minority class) suggested in Figure 7:\n**Figure 7.** Downsampling the majority class by a factor of 25.\n\nStep 2: Upweight the downsampled class\n\nDownsampling introduces a\n[**prediction bias**](/machine-learning/glossary#prediction-bias)\nby showing the model an artificial world where the classes are more balanced\nthan in the real world. To correct this bias, you must \"upweight\" the majority\nclasses by the factor to which you downsampled. Upweighting means treating the\nloss on a majority class example more harshly than the loss on a minority\nclass example.\n\nFor example, we downsampled the majority class by a factor of 25, so we must\nupweight the majority class by a factor of 25. That is, when the model\nmistakenly predicts the majority class, treat the loss as if it were 25 errors\n(multiply the regular loss by 25).\n**Figure 8.** Upweighting the majority class by a factor of 25.\n\nHow much should you downsample and upweight to rebalance your dataset?\nTo determine the answer, you should experiment with different downsampling\nand upweighting factors just as you would experiment with other\n[**hyperparameters**](/machine-learning/glossary#hyperparameter).\n\nBenefits of this technique\n\nDownsampling and upweighting the majority class brings the following benefits:\n\n- **Better model:** The resultant model \"knows\" both of the following:\n - The connection between features and labels\n - The true distribution of the classes\n- **Faster convergence:** During training, the model sees the minority class more often, which helps the model converge faster.\n\n| **Key terms:**\n|\n| - [Batch](/machine-learning/glossary#batch)\n| - [Class-balanced dataset](/machine-learning/glossary#class-balanced-dataset)\n| - [Class-imbalanced dataset](/machine-learning/glossary#class_imbalanced_data_set)\n| - [Dataset](/machine-learning/glossary#dataset)\n| - [Downsampling](/machine-learning/glossary#downsampling)\n| - [Hyperparameter](/machine-learning/glossary#hyperparameter)\n| - [Majority class](/machine-learning/glossary#majority_class)\n| - [Minority class](/machine-learning/glossary#minority_class)\n| - [Prediction bias](/machine-learning/glossary#prediction-bias)\n- [Upweighting](/machine-learning/glossary#upweighting) \n[Help Center](https://support.google.com/machinelearningeducation)"]]