In situ split calibration-validation with complex strategies (optional)

[2]:

import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import numpy as np
import plotly.express as px
from collections import defaultdict
from pathlib import Path

Set directory

[3]:

computer_path = '/export/miro/ndeffense/LBRAT2104/'
grp_nb        = '99'

# Directory for all work files
work_path = f'{computer_path}STUDENTS/GROUP_{grp_nb}/TP/'

in_situ_prepro_path  = f'{work_path}B_IN_SITU_PREPRO/'
in_situ_cal_val_path = f'{work_path}C_IN_SITU_CAL_VAL/'

Path(in_situ_cal_val_path).mkdir(parents=True, exist_ok=True)

Set parameters

[4]:

site = 'NAMUR'
year = '2020'

field_classif_code = 'grp_1_nb'
field_classif_name = 'grp_1'

pix_best = 10
pix_ratio_threshold = 0.05

sample_ratio_hi = 0.25
sample_ratio_lo = 0.75

Set filenames

[5]:

# Input file
in_situ_prepared_shp = f'{in_situ_prepro_path}{site}_{year}_IN_SITU_ROI_prepared.shp'

# Output files
in_situ_cal_shp = f'{in_situ_cal_val_path}{site}_{year}_IN_SITU_ROI_CAL.shp'
in_situ_val_shp = f'{in_situ_cal_val_path}{site}_{year}_IN_SITU_ROI_VAL.shp'

1. Open in situ data prepared

[6]:

gdf = gpd.read_file(in_situ_prepared_shp)

display(gdf.head())

	id	lc_nb	lc	grp_nb	grp	class_nb	class	sub_nb	sub	grp_1_nb	grp_1	grp_A_nb	grp_A	area	pix_count	geometry
0	4	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	1735	17	MULTIPOLYGON (((630003.573 5594258.004, 630003...
1	5	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	14331	146	POLYGON ((636962.589 5595674.757, 636966.536 5...
2	6	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	1768	18	POLYGON ((635692.119 5593303.601, 635688.405 5...
3	7	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	7828	79	POLYGON ((627911.368 5595749.375, 627942.033 5...
4	9	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	3197	39	POLYGON ((633236.525 5596165.100, 633397.286 5...

2. Get “pixel ratio” for each polygons

Pixel ratio = number of 10m pixels belonging to specific crop type divided by the total number of 10m pixels

$ pix_ratio = \frac{crop\_pix}{total\_pix} $

All polygons from a same class will have the same pixel ratio !

[7]:

gdf['crop_pix'] = gdf.groupby(field_classif_code)['pix_count'].transform('sum')

gdf['pix_ratio'] = gdf['crop_pix'] / gdf['pix_count'].sum()

display(gdf.head())

	id	lc_nb	lc	grp_nb	grp	class_nb	class	sub_nb	sub	grp_1_nb	grp_1	grp_A_nb	grp_A	area	pix_count	geometry	crop_pix	pix_ratio
0	4	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	1735	17	MULTIPOLYGON (((630003.573 5594258.004, 630003...	23889	0.222207
1	5	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	14331	146	POLYGON ((636962.589 5595674.757, 636966.536 5...	23889	0.222207
2	6	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	1768	18	POLYGON ((635692.119 5593303.601, 635688.405 5...	23889	0.222207
3	7	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	7828	79	POLYGON ((627911.368 5595749.375, 627942.033 5...	23889	0.222207
4	9	3	Grassland and meadows	31	Grassland and meadows	319	Grassland and meadows	3199	Grassland and meadows	3	Grassland and meadows	3	Grassland and meadows	3197	39	POLYGON ((633236.525 5596165.100, 633397.286 5...	23889	0.222207

3. Assign sampling design strategy for each polygons

The polygons that are selected for the classification are used either for the calibration of the classifier model or for the validation of the results. Even though the classification is run per-pixel, this split needs to be done at the polygon-level to ensure a proper independent validation following international standards.

This split at the polygon-level is done into 3 steps:

the smallest polygons are automatically NOT considered for the calibration and thus assigned to the validation pool;
for each class, the number of training pixels $crop\_target$ is defined considering the size of the class;
calibration polygons are randomly selected to reach this $crop\_target$ value and the remaining polygons are allocated to the validation.

Strategy 1 — majority classes

By default, the majority classes are defined as those counting a number of 10m pixels corresponding to more than 5% (pix_ratio_threshold) of the total number of pixels considering all classes to map.

Crop type belongs to strategy 1 if :

$pix\_ratio ≥ pix\_ratio\_threshold$

In this case, for each majority class, the number of training pixels $crop\_target$ will correspond to 25% of the number of pixels of the class (crop_pixels) :

$crop\_target = sample\_ratio\_hi \times crop\_pix$

Strategy 2 — minority classes

The minor classes are defined as those counting a number of S2 pixels corresponding to less than 5% of the total number of pixels considering all classes to map.

Crop type belongs to strategy 2 if :

$pix\_ratio < pix\_ratio\_threshold$

In this case, for each minor class, the number of training pixels $crop\_target$ will correspond to 75% of the number of pixels of the class :

$crop\_target = sample\_ratio\_lo \times crop\_pix$

[9]:

# Assign sampling design strategy for each polygons
# -------------------------------------------------

gdf['strategy'] = np.where(gdf['pix_count'] < pix_best, 3,
                           np.where(gdf['pix_ratio'] >= pix_ratio_threshold, 1, 2))


# Build a table to summarize the strategy assign to each class
df = gdf[[field_classif_code,field_classif_name,'crop_pix','pix_ratio','strategy']].drop_duplicates()
df = df[df['strategy'] != 3].sort_values('pix_ratio', ascending=False)

display(df)


# Split polygons in training and validation datasets
# --------------------------------------------------

training_pixels = defaultdict(lambda: 0)
training_target = {}

cal_id_list = []
val_id_list = []

gdf_shuffle = gdf.sample(len(gdf), random_state=1)  # Shuffle geodataframe randomly

for i, row in gdf_shuffle.iterrows():

    id          = row['id']
    crop_code   = row[field_classif_code]
    crop_pix    = row['crop_pix']
    strategy    = row['strategy']
    pix_count   = row['pix_count']

    if strategy != 3:
        if strategy == 1:
            crop_target = sample_ratio_hi * crop_pix
            training_target[crop_code] = crop_target

        elif strategy == 2:
            crop_target = sample_ratio_lo * crop_pix
            training_target[crop_code] = crop_target

        pixels = training_pixels[crop_code]

        if pixels + pix_count <= crop_target:  # If below crop target, add polygons in training set
            training_pixels[crop_code] = pixels + pix_count  # Update number of training pixels by crop code
            purpose = 0  # training

        else:
            purpose = 1  # validation because "crop target" is reached
    else:
        purpose = 1  # validation because polygons do not contain enough pixels (= strategy 3)

    if purpose == 0:
        cal_id_list.append(id)
    else:
        val_id_list.append(id)


print(f'Number of polygons for training   : {len(cal_id_list)}')
print(f'Number of polygons for validation : {len(val_id_list)}')

cal_gdf = gdf[gdf['id'].isin(cal_id_list)]

val_gdf = gdf[gdf['id'].isin(val_id_list)]

	grp_1_nb	grp_1	crop_pix	pix_ratio	strategy
221	1111	Winter wheat	29633	0.275635	1
0	3	Grassland and meadows	23889	0.222207	1
295	1121	Maize	8667	0.080617	1
331	1811	Sugar beet	7400	0.068832	1
352	1152	Barley six-row	6659	0.061940	1
441	1511	Potatoes	6439	0.059893	1
376	1192	Other cereals	6117	0.056898	1
411	1771	Peas	3606	0.033542	2
502	69	Forest	3109	0.028919	2
403	1923	Flax hemp and other similar crops	2810	0.026138	2
455	21	Fruits trees	2239	0.020826	2
421	1435	Rapeseed	1826	0.016985	2
466	81	Urban	1795	0.016696	2
409	121	Leafy or stem vegetables	1351	0.012567	2
394	1171	Oats	776	0.007218	2
434	1911	Alfalfa	554	0.005153	2
429	84	Greenhouses	252	0.002344	2
462	9212	Permanent rivers and streams	208	0.001935	2
372	22	Vineyards	178	0.001656	2

Number of polygons for training   : 205
Number of polygons for validation : 332

Get summary of the CAL/VAL splitting

[10]:

# Get number of pixels per class

pix_per_class_cal_df = cal_gdf.groupby([field_classif_code, field_classif_name])['pix_count'].agg('sum').to_frame().reset_index()
pix_per_class_cal_df = pix_per_class_cal_df.sort_values(by='pix_count', ascending=False)
pix_per_class_cal_df = pix_per_class_cal_df.rename(columns={"pix_count": "CAL_pix"})

pix_per_class_val_df = val_gdf.groupby([field_classif_code, field_classif_name])['pix_count'].agg('sum').to_frame().reset_index()
pix_per_class_val_df = pix_per_class_val_df.sort_values(by='pix_count', ascending=False)
pix_per_class_val_df = pix_per_class_val_df.rename(columns={"pix_count": "VAL_pix"})

# Get number of polygons per class

poly_per_class_cal_df = cal_gdf.groupby([field_classif_code, field_classif_name])[field_classif_code].agg('count').reset_index(name='poly_count')
poly_per_class_cal_df = poly_per_class_cal_df.sort_values(by='poly_count', ascending=False)
poly_per_class_cal_df = poly_per_class_cal_df.rename(columns={"poly_count": "CAL_poly"})

poly_per_class_val_df = val_gdf.groupby([field_classif_code, field_classif_name])[field_classif_code].agg('count').reset_index(name='poly_count')
poly_per_class_val_df = poly_per_class_val_df.sort_values(by='poly_count', ascending=False)
poly_per_class_val_df = poly_per_class_val_df.rename(columns={"poly_count": "VAL_poly"})


# Merge 2 previous dataframe in a single one
#pix_poly_per_class_df = pix_per_class_cal_df.merge(poly_per_class_df, on=[field_classif_code, field_classif_name], how="outer")


# Merge 4 previous dataframe in a single one
summary_df = pix_per_class_cal_df.merge(pix_per_class_val_df,on=[field_classif_code, field_classif_name]).merge(poly_per_class_cal_df,on=[field_classif_code, field_classif_name]).merge(poly_per_class_val_df,on=[field_classif_code, field_classif_name])

display(summary_df)

	grp_1_nb	grp_1	CAL_pix	VAL_pix	CAL_poly	VAL_poly
0	1111	Winter wheat	7403	22230	15	59
1	3	Grassland and meadows	5971	17918	60	161
2	1771	Peas	2454	1152	8	2
3	69	Forest	2296	813	28	7
4	1121	Maize	2132	6535	10	26
5	1923	Flax hemp and other similar crops	2059	751	5	3
6	1811	Sugar beet	1850	5550	4	17
7	1152	Barley six-row	1663	4996	8	12
8	1511	Potatoes	1577	4862	4	10
9	21	Fruits trees	1551	688	6	1
10	1192	Other cereals	1477	4640	7	10
11	81	Urban	1338	457	27	9
12	1435	Rapeseed	1176	650	4	2
13	121	Leafy or stem vegetables	566	785	2	2
14	1171	Oats	534	242	5	3
15	1911	Alfalfa	403	151	4	3
16	84	Greenhouses	183	69	3	2
17	9212	Permanent rivers and streams	152	56	3	1
18	22	Vineyards	124	54	2	2

Plot calibration and validation in situ data

[11]:

fig, ax = plt.subplots(1, 1, figsize=(15,15))

cal_gdf.plot(ax=ax,
            color='red')

val_gdf.plot(ax=ax,
            color='green')

ax.set_title(f'In situ with {site} {year}',fontsize=20)

cal_patch = mpatches.Patch(color='red', label='Training')
val_patch = mpatches.Patch(color='green', label='Validation')

plt.legend(handles=[cal_patch, val_patch])

plt.box(False)

../_images/classification_in_situ_split_cal_val_complex_strategies_17_0.png

Write geodataframe into a shapefile

[13]:

cal_gdf.to_file(in_situ_cal_shp)
val_gdf.to_file(in_situ_val_shp)

print(f'Two shapefiles have been created :\n {in_situ_cal_shp} \n {in_situ_val_shp}')

Two shapefiles have been created :
 /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/C_IN_SITU_CAL_VAL/NAMUR_2020_IN_SITU_ROI_CAL.shp
 /export/miro/ndeffense/LBRAT2104/STUDENTS/GROUP_99/TP/C_IN_SITU_CAL_VAL/NAMUR_2020_IN_SITU_ROI_VAL.shp