Estimation of nitrogen in cotton leaves using different hyperspectral region data

As an important index of a plant’s N nutrition, leaf nitrogen content (LNC) can be quickly monitored in real time with hyperspectral information, which is helpful to guide the precise application of N in cotton leaves. In this study, taking cotton dripping in Xinjiang, China, as the object of study, five N application treatments (0, 120, 240, 360, 480 kg·ha) were set up, and the hyperspectral data and the N content of main stem functional leaves at the cotton flower and boll stage were collected. The results showed that (1) comparing the correlations of the three types of spectral data from the original spectra, first derivative spectra, and second derivative spectra with the LNC of cotton, the first derivative spectra increased the correlation between the reflectance in the peak and valley ranges of the spectral curves and the LNC of cotton; (2) in the three hyperspectral regions of VIS, NIR, and SWIR, all R values of the estimation model for the LNC of cotton established based on the characteristic wavelengths of the original and the first derivative spectra were greater than 0.8, and the model accuracy was better than that of the second derivative spectra; and (3) the normalized root mean square error (n-RMSE) values of the validated model using MLR, PCR, and PLSR regression methods were all in the range of 10–20%, indicating that the established model could well estimate the nitrogen content of cotton leaves. The results of this study demonstrate the potential of the three hyperspectral domains of VIR, NIR, and SWIR to estimate the LNC of cotton and provide a new basis for hyperspectral data application in crop nutrient monitoring.

2 nitrogen. Be-cause of the correlation between nitrogen and other elements, characteristic wavelengths that are significant for nitrogen estimation are found in the vegetation spectrum (Curran, 1989;Homolová et al., 2013).
In C3 plants, approximately 19% of leaf nitrogen is allocated to the light-harvesting complex, where it is mainly used to synthesize chlorophylls and chlorophyll-binding proteins in both photosystems (Chapin et al., 1987). This relationship explains the strong correlation between nitrogen and chlorophyll suggested by many studies. In the early stage of crop growth, chlorophyll content and nitrogen content in leaves have a high correlation, and the characteristics of chlorophyll absorption in the visible spectrum can be directly used to estimate nitrogen content (le Maire et al., 2008;. Chlorophyll content in leaves of cotton was correlated with Vogelmann1 in the four growing periods (squaring stage, full budding stage, flowering and boll stage, and boll opening stage); the correlation coefficients were 0.944, 0.907, 0.895, and 0.930, respectively (Hong et al., 2019). In the reproductive growth stage, due to the dynamic transfer of nitrogen, nitro-gen is redistributed from vegetative organs to reproductive structures (Ohyama, 2010), and the chlorophyll content in leaves decreases, but the nitrogen content in plants will remain unchanged. The correlation between chlorophyll content in the canopy and nitrogen con-tent in the aboveground parts will gradually weaken with the emergence or aging of re-productive organs (Berger et al., 2020). Moreover, some studies suggest that the relation-ship between chlorophyll and nitrogen is highly dependent on the growing season of crops (Spaner et al., 2005;Xu et al., 2021). This relationship is unstable in different cultivars, so it is limited to estimate nitrogen by chlorophyll (Hosgood, 1994). The different absorption peaks of protein (and nitrogen) are mainly in the range of the SWIR spectra in Curran's research work (1989). Furthermore, Kokaly (2009) suggested that the VIS and NIR could be used to directly evaluate the nitrogen status of crops. Zhang et al. (2010) constructed and evaluated the NIR spectral model of rice leaf nitrogen content by using partial least squares regression (PLS), principal component regression (PCR), and stepwise multiple regression (SMR) based on NIR spectroscopy. Lee et al. (2020) pointed out that the accuracy of the nitrogen concentration monitoring model for cotton leaves based on the ratio of red edge position and short wave near-infrared band was relatively high. Wang et al. (2021) believed that leaf structure parameters (leaf thickness) and dry matter content (protein, cellulose, and lignin) lead to high reflectivity and reflectivity fluctuations in the near-infrared and short-wave infrared regions and that 1500-1900 nm and 2000-2400 nm in the short-wave infrared region are the main wavelengths of nitrogen.
Using a field-portable spectrometer with spectral range of 350-2500 nm, researchers have established a spectral index based on the composition of the visible spectral domain and the near-infrared spectral domain to estimate the total nitrogen concentration of winter wheat leaves (Duan et al., 2019). Wang et al. (2021) believed that the spectral signal in the short-wave infrared region is closely related to the nitrogen concentration in the leaves, but only the effects of chlorophyll (visible light region) and nitrogen concentration (short-wave infrared region) on the Vmax of photosynthesis were studied. Shi et al. (Shi et al., 2015) established the normalized difference spectral index (NDSI) and the three-band spectral index (TBSI) of five varieties using wavelengths in the spectral range of 400-2400 nm in rice, corn, tea, sesame, and soybean. The bands used involved three spectral domains of visible, near-infrared, and short-wave infrared, and the authors believed that species-specific differences could affect the hyperspectral estimation of crop leaf nitrogen concentration. Kennedy et al. (2013) studied the biochemical components of birch and eucalyptus through near-infrared (1100-2300 nm) spectral information. The results showed that the relative prediction standard deviation (RPD) value of the established nitrogen model was 3.90-4.78, showing excellent model performance (RPD ≥ 3, indicating that the model had a favourable prediction effect).
The research conducted thus far is mainly based on the estimation and analysis of crop nutrient components in a few sensitive bands, such as considering only a single spectral region of visible light, nearinfrared, and short-wave infrared, or using vegetation indices involving two spectral regions to construct models. Studies have not fully ac-counted for the interaction effects between the spectral domains, nor have they fully taken advantage of the detailed spectral information provided by hyperspectral equipment. In this 3 study, we measured the nitrogen content of main-stem functional leaves at the flower and boll stage of cotton in drip irrigation. The high spectral range of 350-2500 nm was divided into three spectral regions: visible (VIS), near-infrared (NIR), and short-wave infra-red (SWIR). By screening the characteristic wavelengths of the three spectral regions, the estimation accuracy of the high spectral regions of cotton leaves was analysed by multiple linear regression (MLR), principal component regression (PCR), and partial least square regression (PLSR) to explore and analyse the potential of the three hyperspectral regions in estimating nitrogen content in cotton leaves, and to provide a new idea for monitoring crop nutrition information.

Acquisition of hyperspectral data of cotton leaves
The main stem function leaf is an important source organ of cotton. Six cotton plants were selected for each treatment, and 226 samples were obtained. The hyperspectral instrument adopted was the sr-3500 portable ground object spectrometer of the Spectral Evolution Company of the United States, with the spectral range of 350-2500 nm. The spectrometer has a built-in light source and leaf clip. When obtaining hyperspectral data, we avoided the vein of cotton leaves. Three positions of each cotton leaf were measured, and each position was repeated three times. The average value was taken as the spectral data of the position, and the average reflectivity of the three positions was taken as the spectral data of the leaf (Figure 1). Before measuring different cotton leaves, the spectrometer was calibrated with the whiteboard. The nitrogen content of cotton leaves was determined by the Kjeldahl method. The main stems and leaves of cotton were put into an envelope and dried in an oven at 105 ℃ for 30 min, and then dried at 85 ℃ until attaining a constant mass. The dried sample was crushed and then sieved through 100 mesh.
We took 0.1000 g of the sample with one-ten-thousandth balance and put it into the digesting tube. Then, it was digested with H2SO4-H2O2. The volume of digestive juice was fixed to 50 ml, and 10 ml was put into the Kjeldahl nitrogen analyzer for distillation, using sulfuric acid for titration. The nitrogen content of leaves was calculated according to the following formula: where c is the concentration of the dilute sulfuric acid solution (mol/L); v and v0 are the volumes of dilute sulfuric acid used in titrating sample solution and the volume of dilute sulfur used in titrating blank, respectively (ml); 0.014 is the molar mass of nitrogen (kg·mol -1 ); ts is the separation multiple, the ratio of constant volume to separated volume; 10 -3 is the conversion multiple between kilograms and grams; and m is the mass of the weighted sample (g).

Selection of characteristic wavelength of the LNC in cotton
According to previous studies, hyperspectral bands are divided into three spectral regions: visible region (VIS, 380-700 nm), near-infrared region (NIR, 700-1300 nm), and short-wave infrared region (SWIR, 1300-2500 nm). The VIS region contains 321 wavelengths, the NIR region contains 600 wavelengths, and the SWIR region contains 1200 wavelengths. Because each spectral region contains substantial redundant information and multicollinear data, to reduce the number of input variables and improve the modelling accuracy, we used the successive projections algorithm (SPA) to select the characteristic wavelength of the blade. The SPA can find the least redundant variables from the spectral information, effectively eliminate the collinearity between variables, minimize the collinearity between variables, and reduce the complexity of the model. The SPA of this study was completed in MATLAB R2014b.

Establishment and evaluation of multiple linear regression model
The model was constructed using multivariate linear regression (MLR), partial least square regression (PLSR), and principal component regression (PCR) in The Unscrambler X 10.4 software based on feature wavelength screening.
The correction set determination coefficient (R 2 C), prediction set determination coefficient (R 2 P), correction root mean square error (RMSEC), and prediction root mean square error (RMSEP) were used to quantify the accuracy and thereby evaluate the model. A favourable model should have higher R 2 C and R 2 P and lower RMSEC and RMSEP. The model was validated by the normalized root mean square error (n-RMSE), and then the 1:1 histogram of simulated and measured values was drawn to show the fitting and reliability of the model. When n-RMSE is less than 10%, the simulation performance of the model is excellent; when n-RMSE is (10%, 20%), the simulation performance of the model is good; when n-RMSE is (20%, 30%), the simulation performance of the model is average; when n-RMSE is more than 30%, the simulation performance is poor (Ma et al., 2018;Duan et al., 2019).

Statistical analysis of the LNC in cotton
In this study, before the establishment of the hyperspectral inversion model of the LNC in cotton, the total samples were divided into the correction set and verification set by using the content gradient method. The content gradient method is a method of sample selection-that is, the samples are arranged in order of 5 composition content from large to small; the prediction set obtains every two data points, and the other data comprise the correction set (Yu et al., 2018). There were 151 samples in the calibration set and 75 samples in the prediction set. The statistical parameters are shown in Table 1. The results show that the calibration set is consistent with the overall sample data range distribution, and contains the numerical range of the prediction set samples which can be used to establish and verify the model. The results show that the calibration set is consistent with the range of the total sample data, and contains the data range of the prediction set, which can be used for modelling and verification of the model. The correlation between spectral reflectance of different pretreatment and the LNC The derivative is a common spectral pre-processing method that can effectively eliminate the influence of linear background noise and baseline drift and improve the spectral resolution and sensitivity. The correlation between the original spectrum, the first derivative spectrum, and the second derivative spectrum and the LNC in cotton was studied ( Figure 2). There was a negative correlation between the original spectral reflectance in the visible region and the nitrogen content in cotton leaves, and both reached an extremely significant level. The correlation at 507-541 nm and 697-700 nm was above 0.75 (F < 0.01), indicating that the increase in the LNC would reduce the spectral reflectance of visible light, which is mainly because the increase in leaf nitrogen content would increase the chlorophyll content of leaves, and then increase the light absorption of leaves. After the first derivative transformation, there was a significant negative correlation between the spectral reflectance and the LNC at 567-673 nm, and the significance was more than 0.80 (F < 0.01) at 633-636 nm, and more than 0.85 (F < 0.01) at 697-700 nm (Figure 2, VIS-1-Derivative spectrum). After the second derivative treatment, although there were many bands with significant correlations with leaf nitrogen content, the correlation was greater 0.75 only at the four wavelengths of 650 nm in red valley and 688-690 nm in the red edge region, indicating that the correlation between leaf nitrogen concentration and spectral reflectance was different under different spectral pre-treatments. For the VIS, the first derivative spectrum can enhance the correlation between some spectral bands and leaf nitrogen content.
In the near-infrared spectral domain (NIR), there was a significant negative correlation between the original spectral reflectance and leaf nitrogen concentration at 701-763 nm, and a maximum negative correlation at 720 nm (r = -0.871, F < 0.01). This band range is the red edge region of leaf reflectance spectrum and the sensitive region of leaf structure and physicochemical parameter change response. The correlation between the first derivative spectrum and leaf nitrogen content changed rapidly from an extremely significant negative correlation to an extremely significant positive correlation in this band, and the correlation was more than 0.75 at 720-763 nm and reached the maximum positive correlation at 743 nm (r = -0.868, F < 0.01). Moreover, the second derivative spectra showed an extremely significant positive correlation with leaf nitrogen content in the range nm bands 924-961, 1077-1162, and 1262-1300, and an extremely significant negative correlation in the range of bands 701-706, 996-1062, 1167-1186, 1197-1205, and 1212-1255. These bands are due to the peaks and troughs of the high reflection platform in the near-infrared spectral region. It is considered that the first derivative spectrum will enhance the relationship between the reflectivity of the peak and trough range of the spectral curve and the nitrogen content of the blade. The correlation between the second derivative of the NIR spectrum and leaf nitrogen content reached the maximum positive correlation at 714 nm (r = − 0.866, F < 0.01). There is an extremely significant positive correlation at 1110-1137 nm and an extremely significant negative correlation at 1141-1168 nm. This phenomenon is also caused by the peaks and troughs of the high reflection platform.

6
There are also several peaks and troughs in the SWIR due to the influence of water absorption and chemical bonds of proteins and other substances. At 1450, 1650, and 1900 nm, the inflection points of the SWIR infrared spectrum region were positive and had a negative correlation with leaf nitrogen content. From the correlation between the LNC and spectral reflectance in the SWIR in Figure 2, it can be seen that derivative can enhance the spectral correlation more obviously, and the correlation range between the original spectrum and leaf nitrogen content is − 0.329 to 0.655. The correlation between the first derivative and leaf nitrogen content ranged from − 0.817 to 0.838, with the maximum positive correlation at 1306 nm and the maximum negative correlation at 1682 nm. The correlation between the second derivative and leaf nitrogen content ranged from − 0.839 to 0.807, with the maximum positive correlation at 1410 nm and the maximum negative correlation at 1383 nm.   Feature wavelength selection based on SPA SPA was used to select the characteristic wavelength of the original spectrum, the first derivative spectrum, and the second derivative spectrum. In VIS, there were only two characteristic wavelengths selected by the second derivative spectrum, and the two characteristic wavelengths were in the red region. Seven characteristic wavelengths were selected in the blue, green, and red bands of the original spectrum and the first derivative spectrum. Among the selected characteristic wavelengths, the wavelengths of 450 nm and 459 nm, 544 nm and 561 nm, and 674 nm and 676 nm were near the green reflection peak, blue reflection valley, and red reflection valley, respectively, indicating that the selected characteristic wavelengths conformed to the absorption characteristics of chlorophyll in visible light to a certain extent, and wavelengths of 693 nm, 699 nm, and 700 nm were at the red edge. The selected characteristic wavelengths significantly correlated with leaf nitrogen content.

7
In the near-infrared region, the higher reflection platform is formed mainly due to the influence of leaf structures such as mesophyll tissue. In this region, 19 characteristic wavelengths were selected. In the red edge position range of 700-760 nm, there are two characteristic wavelengths selected by the original spectrum and the first derivative spectrum, and one characteristic wavelength selected by the second derivative spectrum. The wavelengths of 812 nm, 842 nm, 845 nm, 880 nm, and 885 nm may be related to the fourth harmonic generation of N-H bonds. The wavelengths of 913 nm, 935 nm, 938 nm, 950 nm, 953 nm, and 957 nm may be related to the fourth harmonic generation of C-H bonds in leaves. The wavelengths of 975 nm, 976 nm, 977 nm, 995 nm, and 1000 nm may be related to the elongation of N-H bonds. SWIR is mainly the spectral region of water absorption, but the absorption of protein is also in this band. In this region, the second derivative spectrum had the most characteristic wavelengths, but only two wavelengths had a significant relationship with leaf nitrogen content. The characteristic wavelengths of 1501 nm, 1977 nm, 2145 nm, and 2308 nm were related to the first overtone and elongation of the N-H bond, and the characteristic wavelengths of 1712 nm, 1748 nm, 1723 nm, and 2405 nm were related to the first overtone and elongation of the C-H bond. The C-H bond and N-H bond are the main components of protein. Notes: ** and * represent correlations significant at the probability level of P < 0.01 and P < 0.05, respectively.

Establishment and validation of cotton leaf nitrogen content estimation model based on different hyperspectral regions
Using the characteristic wavelengths selected by the continuous projection algorithm (Table 2), the estimation model of cotton leaf nitrogen content was established based on simple multiple linear regression (MLR), principal component regression (PCR), and partial least squares regression (PLSR). The model parameters are shown in Figures 3-5. In the constructed model, the best estimation models of different regression models in different spectral domains were obtained based on the determination coefficient (R 2 ) and the optimal principle of n-RMSE.

Establishment and validation of the LNC in cotton estimation model based on original spectrum
Based on the characteristic wavelengths screened by the original spectral reflectance, the LNC estimation model was built (Figure 3). By comprehensively comparing the modeling effects of the three spectral regions, the MLR model established in the NIR had the highest accuracy (R 2 c = 0.909, R 2 p = 0.899) due to more characteristic wavelengths), and the model constructed in the SWIR had the lowest accuracy (R 2 c = 0.789, R 2 p = 0.848). In the VIS and NIR domain, the accuracy of the PCR model was relatively low, and the R 2 c values of the model were 0.843 and 0.873, respectively. In the SWIR, the accuracy of the model based on the PLSR was slightly lower, and there were no differences in the model parameters of R 2 , RMSE, and n-RMSE 8 based on the MLR and PCR. This may be due to the significant correlation between the selected characteristic wavelengths and leaf nitrogen content, and the less characteristic wavelengths. Using the characteristic wavelengths of three spectral regions, the n-RMSE of LNC in the cotton estimation model based on the original spectrum was 10.193-12.630%, indicating that the accuracy of the models was sufficient. According to the estimation model established by analysing the characteristic wave-length of the first derivative spectrum (1-Derivative spectrum) (Figure 4), the model con-structed by PCR in the NIR domain had the highest accuracy of R 2 c and R2p, which were 0.843 and 0.852, respectively. In the VIS and SWIR, the accuracy of R 2 c and R 2 p of the three methods was above 0.8. The accuracy of the model in the SWIR was increased. R 2 c values were 0.807, 0.806, and 0.804, respectively; and R 2 p values were 0.821, 0.820, and 0.813, respectively, indicating that the first derivative spectrum increases the accuracy of the model in the SWIR. For the validation model, the n-RMSE ranged from 12.303% to 15.406%. Compared with the validation model of the original spectrum, the n-RMSE value increased, and the model accuracy was low. In the estimation model of the LNC using second derivative spectral reflectance (2-Derivative spectrum, Figure 5), the R 2 c and R 2 p values established by MLR, PCR, and regression methods in three spectral domains did not reach 0.8, and the values of R 2 p were greater than R 2 c. The accuracy of the MLR model was also the best in the NIR domain (R 2 c = 0.746, R 2 p = 0.790). Although the n-RMSE value of the validation model was 10-20%, it was the highest among the three types of data, indicating that the modelling effect using the second derivative spectra in the data processing method of three was worse than the other two types of data.
Through the analysis of the above results, when estimating the LNC by using different spectrum regions, the model based on the original spectrum in the VIS and NIR was superior, along with the effect of the model based on the first derivative spectrum in the SWIR. In contrast, the model established by using the second derivative spectrum in the three spectrum regions was poor. The optimal values of R 2 c, R 2 p, and n-RMSE are the multiple linear regression models based on the original spectrum in the NIR. Perhaps due to the small number of the characteristic wavelengths selected, the accuracy of the PCR and PLSR models was not better than that of multiple linear regression models in the 27 estimation models of the LNC in cotton leaves.

Discussion
The application of spectrum regions to estimate the biochemical components of crops started at the earliest based on leaf spectra, mainly because the leaf spectra were not affected by crop canopy structure, soil background, or the monitoring environment, and the obtained spectral data were relatively ideal (Hasituya et al., 2020;Lee et al., 2020). In this study, the LNC of cotton was researched. Based on the VIS (380-700 nm), NIR (701-1300 nm), and SWIR (1301-2500 nm) of hyperspectral imaging, we studied the potential of three spectrum regions to estimate the LNC of crops, respectively.
The VIS is affected by the content of the chlorophyll, while the NIR band is mainly affected by the change in the spectral albedo caused by the structure of the leaves. In this study, the characteristic wavelength in the visible spectrum was mainly near the reflection peak or valley of red, green, and blue light and the range of the red edge position. The characteristic wavelength of the NIR was mainly related to the activities of N-H bonds and C-H bonds, which is consistent with previous research results, indicating that they have universal applicability as the key band for the study of plant spectral reflection (Cochrane, 2000;Liu et al., 2011;Rubert-Nason et al., 2013;Fan et al., 2019).
In the VIS and NIR, it is considered that the accuracy of the model established by using the characteristic wavelength of the original spectrum and the first derivative spectrum is higher, which is consistent with previous research results. In this study, the characteristic wavelengths in the visible spectrum were mainly near the reflection peak or valley of red, green, and blue light and the range of the red edge position. These results indicate that the information reflected by the leaves can be more retained by the uncomplicated data transformation in the estimation of biochemical components of crops by using the hyperspectral spectrum information of the leaves (Cochrane, 2000;Rubert-Nason et al., 2013;Fan et al., 2019;. In the VIS and NIR, the accuracy of the model established by using the characteristic wavelengths of the original spectrum and the first derivative spectrum is higher, which is consistent with previous research results. These results indicate that the information reflected by the leaves can be more retained by the original spectrum or the uncomplicated data transformation in the estimation of bio-chemical components of crops by using the hyperspectral spectrum information of the leaves (Yi et al., 2010;Fan et al., 2019).
The SWIR is mainly the reflection curve related to the chemical bond (C-H, N-H, and O-H bonds) interactions and various biochemical nutrition characteristics of crops (Rubert-Nason et al., 2013;Zhang et al., 2018). For the SWIR, the characteristic wave-lengths selected based on the original spectrum have an extremely significant correlation with the LNC in cotton, which is consistent with the performance results of the VIS and NIR (Table 2). In terms of the characteristic wavelengths considering first derivative and second derivative screening, the characteristic wavelengths of 1301-1397 nm were related to the second overtone of the C-H and O-H bonds in the protein, the characteristic wave-lengths of 1712-1748 nm were mainly generated by the first overtone absorption of the C-H bond, and the characteristic wavelengths of 1977-2308 nm were related to the first overtone and extension of the N-H bond, which is consistent with the research results of  on the protein. Comparing and analysing the models established by the characteristic wavelength of three spectrum types, the R 2 c and R 2 p values of each regression model established by the first derivative were the highest.
By comparing and analysing the three modelling methods for the application of three spectrum domains, this study shows that PCR and PLSR have no advantage over MLR, which is different from the research results of Yi et al. (2010) and . This may be due to the small number of characteristic wavelengths selected in this study, which makes PCR and PLSR unable to reduce and re-screen the characteristic wavelengths. The n-RMSE of the models established based on the three modelling methods were all in the range of 10-20%, indicating that the models by dividing the whole band into three spectrum domains and using various modelling methods were relatively stable and with suitable prediction accuracy, and could be applied to the estimation of LNC in cotton.

Conclusions
In this study, we researched cotton leaf samples in the range of 7.66-38.88 g/kg. Through the correlations between the wavelength of three spectrum regions (VIS, NIR, and SWIR) and the LNC and the comparative analysis of the models constructed, the results showed that (1) the first derivative pre-treatment method can enhance the correlation between spectral band and leaf nitrogen content; (2) the multiple linear regression and principal component regression model based on the characteristic band screened by original spectrum and first derivative spectrum is more suitable for estimating leaf nitrogen content, with R 2 c = 0.794 ~ 0.909 and R 2 p = 0.774 ~ 0.899; (3) the model validation accuracy n-RMSE of the estimation models established by using VIS, NIR and SWIR is 10-20%, indicating that the three spectral regions have the potential to estimate the nitrogen content of cotton leaves, which provides a new idea for more comprehensive estimation of crop nutrient information by using hyperspectral data.