{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Validation\n", "\n", "When a classification is performed, whether it is generated with machine learning tools such as Random Forest or based on thresholds of a spectral index for example, it should be validated on the basis of *in situ* data (ground data). \n", "\n", "**In the case where *in situ* data are also used to generate the classification, the data used to validate the classification must be independent.**\n", "\n", "The validation data are used to assess the map accuracy defined by the agreement between the map output and the validation data assumed to be the truth. The most common way to derive the map accuracy is to analyse the confusion matrix, which is a square co-occurrence matrix compiling the number of samples matching a given land cover class with validation information. Diagonal values represent the agreement frequency between the validation data and the map output, while non-diagonal values represent the errors.\n", "\n", "\n", "It is recommended to use as primary accuracy measures :\n", "- user accuracy (UA),\n", "- producer accuracy (PA)\n", "- overall accuracy (OA)\n", "\n", "For binary maps such as the cropland mask, the OA depends to a large extent on the respective proportion of both classes in the validation data set. In this case, the F-Score, the use of which has been recently adopted, is a more informative accuracy metric.\n", "\n", "\n", "The **Overall Accuracy** (OA) is computed as the ratio of the number of all correctly classified samples to the total number (N) of all validation samples. A standard target for the overall accuracy of a land cover map is typically 85 percent. In some cases, simple land cover maps that include very few classes can reach 90 percent.\n", "\n", "$\\mathrm{OA}(\\%)=\\left(100 \\times \\sum_{k=1}^{q} n_{\\mathrm{kk}}\\right) / \\mathrm{N}$\n", "\n", "\n", "The **User Accuracy** (UA) for a given class i is the ratio between the number of correctly classified samples as belonging to this class and **all samples classified in this class**.\n", "\n", "$\\mathrm{UA}_{\\mathrm{i}}(\\%)=100 \\times \\frac{n_{i i}}{n_{i+}}$\n", "\n", "\n", "The **Producer Accuracy** (PA) for a given class i is the ratio between the number of correctly classified samples and all samples belonging to this class, **according to the validation data**.\n", "\n", "$\\mathrm{PA}_{\\mathrm{i}}(\\%)=100 \\times \\frac{n_{i i}}{n_{+i}}$\n", "\n", "The **F-score** is calculated as a combination of PA and UA for a given land cover class i:\n", "\n", "$\\mathrm{F-score}_{\\mathrm{i}} =\\frac{2 \\times \\mathrm{PA}_{\\mathrm{i}} \\times \\mathrm{UA}_{\\mathrm{i}}}{\\mathrm{PA}_{\\mathrm{i}} + \\mathrm{UA}_{\\mathrm{i}}}$\n", "\n", "---\n", "\n", "[Handbook on remote sensing for agricultural statistics](https://nicolasdeffense.github.io/eo-toolbox/docs/Remote_Sensing_for_Agricultural_Statistics.pdf)\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ " \n", " " ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "All libraries successfully imported!\n", "Scikit-learn: 1.2.0\n" ] } ], "source": [ "import os\n", "import numpy as np\n", "import pandas as pd\n", "import geopandas as gpd\n", "import rasterio\n", "from rasterio import features\n", "\n", "import matplotlib.pyplot as plt\n", "import sklearn\n", "from sklearn.metrics import confusion_matrix, f1_score, accuracy_score, classification_report\n", "\n", "from pathlib import Path\n", "\n", "import plotly.express as px\n", "import plotly.figure_factory as ff\n", "import plotly.graph_objects as go\n", "import plotly.offline\n", "plotly.offline.init_notebook_mode()\n", "\n", "print('All libraries successfully imported!')\n", "print(f'Scikit-learn: {sklearn.__version__}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Set directory**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Accuracy Metrics path is set to : /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/ACCURACY_METRICS/\n" ] } ], "source": [ "computer_path = 'X:/'\n", "grp_nb = '99'\n", "\n", "data_path = f'{computer_path}data/' # Directory with data shared by the assistant\n", "work_path = f'{computer_path}STUDENTS/GROUP_{grp_nb}/' # Directory for all work files\n", "\n", "# Input directories\n", "in_situ_path = f'{work_path}C_IN_SITU_CAL_VAL/'\n", "classif_path = f'{work_path}CLASSIF/'\n", "lut_path = f'{data_path}LUT/'\n", "\n", "# Output directory\n", "am_path = f'{work_path}ACCURACY_METRICS/'\n", "\n", "\n", "Path(am_path).mkdir(parents=True, exist_ok=True)\n", "\n", "print(f'Accuracy Metrics path is set to : {am_path}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Set parameters**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "site = 'NAMUR'\n", "year = '2020'\n", "\n", "feat_nb = 2\n", "no_data = -999\n", "\n", "ws = 3 # Window size radius (filtering post classification)\n", "\n", "reclassif_flag = True\n", "filter_flag = True\n", "\n", "# Field used for classification\n", "field_classif_code = 'grp_1_nb'\n", "field_classif_name = 'grp_1'\n", "\n", "# Field used for reclassification\n", "field_reclassif_code = 'grp_A_nb'\n", "field_reclassif_name = 'grp_A'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Set filenames**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reclassification : True\n", "Filter : True\n", "Classification file used : \n", " /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/CLASSIF/NAMUR_2020_classif_RF_feat_2_grp_1_reclassify_grp_A_filter_ws_3.tif\n", "Validation polygons used : \n", " /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/C_IN_SITU_CAL_VAL/NAMUR_2020_IN_SITU_ROI_VAL.shp\n" ] } ], "source": [ "if not reclassif_flag and not filter_flag:\n", " print(f'Reclassification : {reclassif_flag}')\n", " print(f'Filter : {filter_flag}')\n", " classif_tif = f'{classif_path}{site}_{year}_classif_RF_feat_{feat_nb}_{field_classif_name}.tif'\n", "\n", " field_code = field_classif_code\n", " field_name = field_classif_name\n", "\n", "elif reclassif_flag and not filter_flag:\n", " print(f'Reclassification : {reclassif_flag}')\n", " print(f'Filter : {filter_flag}')\n", " classif_tif = f'{classif_path}{site}_{year}_classif_RF_feat_{feat_nb}_{field_classif_name}_reclassify_{field_reclassif_name}.tif'\n", "\n", " field_code = field_reclassif_code\n", " field_name = field_reclassif_name\n", "\n", "elif reclassif_flag and filter_flag:\n", " print(f'Reclassification : {reclassif_flag}')\n", " print(f'Filter : {filter_flag}')\n", " classif_tif = f'{classif_path}{site}_{year}_classif_RF_feat_{feat_nb}_{field_classif_name}_reclassify_{field_reclassif_name}_filter_ws_{ws}.tif'\n", "\n", " field_code = field_reclassif_code\n", " field_name = field_reclassif_name\n", " \n", "else:\n", " print('No classfication file is available !')\n", " classif_tif = \"\"\n", "\n", "\n", "# LUT to get class names\n", "lut_xlsx = f'{lut_path}crop_dictionary.xlsx'\n", "\n", "\n", "# In situ data used for validation\n", "in_situ_val_shp = f'{in_situ_path}{site}_{year}_IN_SITU_ROI_VAL.shp'\n", "in_situ_val_tif = f'{in_situ_path}{site}_{year}_IN_SITU_ROI_VAL_{field_name}.tif'\n", "\n", "# Confusion matrix and Accuracy metrics files\n", "cm_csv = f'{am_path}{os.path.basename(classif_tif)}_CM.csv'\n", "cm_html = f'{am_path}{os.path.basename(classif_tif)}_CM.html'\n", "am_html = f'{am_path}{os.path.basename(classif_tif)}_AM.html'\n", "\n", "\n", "# USED ONLY FOR VISU ON WEBSITE !\n", "#cm_html = f'/export/miro/ndeffense/LBRAT2104/GIT/eo-toolbox/figures/{site}_{year}_CM.html'\n", "#am_html = f'/export/miro/ndeffense/LBRAT2104/GIT/eo-toolbox/figures/{site}_{year}_AM.html'\n", "\n", "print(f'Classification file used : \\n {classif_tif}')\n", "print(f'Validation polygons used : \\n {in_situ_val_shp}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Prepare data " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Rasterize *in situ* data validation shapefile" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Raster template file : /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/CLASSIF/NAMUR_2020_classif_RF_feat_2_grp_1_reclassify_grp_A_filter_ws_3.tif\n", "The CRS of in situ data is : 32631\n", "The CRS of raster template is : 32631\n", "CRS are the same\n", "Rasterize starts : /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/C_IN_SITU_CAL_VAL/NAMUR_2020_IN_SITU_ROI_VAL.shp\n", "Rasterize is done : /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/C_IN_SITU_CAL_VAL/NAMUR_2020_IN_SITU_ROI_VAL_grp_A.tif\n" ] } ], "source": [ "# Open the shapefile with GeoPandas\n", "in_situ_gdf = gpd.read_file(in_situ_val_shp)\n", "\n", "# Open the raster file you want to use as a template for rasterize\n", "print(f'Raster template file : {classif_tif}')\n", "\n", "src = rasterio.open(classif_tif, \"r\")\n", "\n", "# Update metadata\n", "\n", "out_meta = src.meta\n", "out_meta.update(nodata=no_data)\n", "\n", "crs_shp = str(in_situ_gdf.crs).split(\":\",1)[1]\n", "crs_tif = str(src.crs).split(\":\",1)[1]\n", "\n", "print(f'The CRS of in situ data is : {crs_shp}')\n", "print(f'The CRS of raster template is : {crs_tif}')\n", "\n", "if crs_shp == crs_tif:\n", " print(\"CRS are the same\")\n", "\n", " print(f'Rasterize starts : {in_situ_val_shp}')\n", "\n", " # Burn the features into the raster and write it out\n", "\n", " dst = rasterio.open(in_situ_val_tif, 'w+', **out_meta)\n", " dst_arr = dst.read(1)\n", "\n", " # This is where we create a generator of geom, value pairs to use in rasterizing\n", "\n", " geom_col = in_situ_gdf.geometry\n", " code_col = in_situ_gdf[field_code].astype(int)\n", "\n", " shapes = ((geom,value) for geom, value in zip(geom_col, code_col))\n", "\n", " in_situ_arr = features.rasterize(shapes=shapes,\n", " fill=no_data,\n", " out=dst_arr,\n", " transform=dst.transform)\n", " \n", " dst.write_band(1, in_situ_arr)\n", "\n", " print(f'Rasterize is done : {in_situ_val_tif}')\n", "\n", " # Close rasterio objects\n", " src.close()\n", " dst.close()\n", "\n", "else:\n", " print('CRS are different --> repoject in-situ data shapefile with \"to_crs\"')" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Build `y_pred` and `y_true`\n", "\n", "- `y_pred` : Estimated targets as returned by a classifier\n", "- `y_true` : Ground truth (correct) target values" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reference data (truth) : [ 3 3 3 ... 111 111 111]\n", "Classification data : [ 3 3 3 ... 111 111 111]\n", "---------------------------------------------\n", "Number of validation samples : 72599\n" ] } ], "source": [ "# Open in-situ used for validation\n", "src = rasterio.open(in_situ_val_tif, \"r\")\n", "val_arr = src.read(1)\n", "src.close()\n", "\n", "# Open classification map\n", "src = rasterio.open(classif_tif, \"r\")\n", "classif_arr = src.read(1)\n", "src.close()\n", "\n", "# Get the postion of validation pixels\n", "idx = np.where(val_arr == no_data, 0, 1).astype(bool)\n", "\n", "\n", "# Ground truth (correct) target values\n", "y_true = val_arr[idx]\n", "\n", "print(f'Reference data (truth) : {y_true}')\n", "\n", "# Estimated targets as returned by a classifier.\n", "y_pred = classif_arr[idx]\n", "\n", "print(f'Classification data : {y_pred}')\n", "\n", "print('---------------------------------------------')\n", "print(f'Number of validation samples : {len(y_true)}')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes, some classes do not appear in the classification map, they are not predicted by the Random Forest.\n", "\n", "This means that some classes in `y_true` don't appear in `y_pred`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 classes are missing in the classification (y_pred) : set() \n", "\n", "All training classes :\n", " [3, 6, 8, 9, 17, 21, 22, 111, 112, 115, 117, 119, 121, 143, 151, 181, 192]\n", "All predicted classes (at least once) :\n", " [3, 6, 8, 9, 17, 21, 22, 111, 112, 115, 117, 119, 121, 143, 151, 181, 192]\n" ] } ], "source": [ "# Check if there are missing classes in the classification\n", "\n", "classes_all = sorted(np.unique(y_true))\n", "classes_pred = sorted(np.unique(y_pred))\n", "\n", "classes_missing = set(y_true) - set(y_pred)\n", "\n", "print(f'{len(classes_missing)} classes are missing in the classification (y_pred) : {classes_missing} \\n')\n", "\n", "print(f'All training classes :\\n {classes_all}')\n", "print(f'All predicted classes (at least once) :\\n {classes_pred}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3 Find class names" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3 - Grassland and meadows\n", "6 - Forest\n", "8 - Build-up surface\n", "9 - Water bodies\n", "17 - Leguminous crops\n", "21 - Fruits trees\n", "22 - Vineyards\n", "111 - Wheat\n", "112 - Maize\n", "115 - Barley\n", "117 - Oats\n", "119 - Other cereals\n", "121 - Leafy or stem vegetables\n", "143 - Other oilseed crops\n", "151 - Potatoes\n", "181 - Sugar beet\n", "192 - Fibre crops\n" ] } ], "source": [ "lut_df = pd.read_excel(lut_xlsx)\n", "\n", "classes_name = lut_df[lut_df[field_code].isin(classes_all)].sort_values(field_code)[field_name].drop_duplicates().to_list()\n", "\n", "for code,name in zip(classes_all, classes_name):\n", " print(f'{code} - {name}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Confusion Matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Compute Confusion Matrix" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | 3 | \n", "6 | \n", "8 | \n", "9 | \n", "17 | \n", "21 | \n", "22 | \n", "111 | \n", "112 | \n", "115 | \n", "117 | \n", "119 | \n", "121 | \n", "143 | \n", "151 | \n", "181 | \n", "192 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | \n", "17533 | \n", "297 | \n", "11 | \n", "0 | \n", "0 | \n", "8 | \n", "7 | \n", "59 | \n", "0 | \n", "0 | \n", "3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 6 | \n", "57 | \n", "729 | \n", "27 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 8 | \n", "27 | \n", "13 | \n", "443 | \n", "8 | \n", "0 | \n", "0 | \n", "0 | \n", "35 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 9 | \n", "0 | \n", "0 | \n", "32 | \n", "24 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 17 | \n", "152 | \n", "0 | \n", "10 | \n", "0 | \n", "1138 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "1 | \n", "
| 21 | \n", "688 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 22 | \n", "20 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "34 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 111 | \n", "128 | \n", "7 | \n", "32 | \n", "0 | \n", "7 | \n", "0 | \n", "1 | \n", "20204 | \n", "0 | \n", "1066 | \n", "32 | \n", "718 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "35 | \n", "
| 112 | \n", "17 | \n", "2 | \n", "29 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "6457 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "28 | \n", "0 | \n", "
| 115 | \n", "282 | \n", "1 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1288 | \n", "0 | \n", "2931 | \n", "136 | \n", "357 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 117 | \n", "8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "8 | \n", "0 | \n", "0 | \n", "100 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "126 | \n", "
| 119 | \n", "29 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "4152 | \n", "0 | \n", "0 | \n", "121 | \n", "334 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 121 | \n", "5 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "726 | \n", "0 | \n", "0 | \n", "0 | \n", "35 | \n", "0 | \n", "0 | \n", "19 | \n", "0 | \n", "
| 143 | \n", "79 | \n", "2 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20 | \n", "0 | \n", "6 | \n", "0 | \n", "0 | \n", "0 | \n", "542 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 151 | \n", "3 | \n", "15 | \n", "2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "159 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "3834 | \n", "848 | \n", "0 | \n", "
| 181 | \n", "8 | \n", "8 | \n", "0 | \n", "0 | \n", "0 | \n", "4 | \n", "0 | \n", "0 | \n", "349 | \n", "0 | \n", "3 | \n", "0 | \n", "103 | \n", "0 | \n", "1 | \n", "5074 | \n", "0 | \n", "
| 192 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "359 | \n", "0 | \n", "0 | \n", "81 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "311 | \n", "