Estimation of the Weights of Almond Nuts Based on Physical Properties through Data Mining

Quality attributes are the major parameters designating market values of the agricultural goods and commodities. Several practices are applied to improve quality parameters of the fruits and vegetables. Such quality attributes should also be estimated through various approaches before to design of equipment and tools used in handling and processing of these goods and to design storage facilities. Data mining is a novel approach used to estimate various attributes or quality parameters of the fruits from previously measured attributes. Different algorithms embedded into data mining operations may yield quite accurate and reliable equations for estimation of quality attributes. Almond is a significant cash crop for growers. Since almond is quite tolerant to droughts and salinity, it is preferred in various parts of the country by producers. Weight is the primary quality parameter designating market value of the almonds. This study was conducted to estimate nut weights of seven different almond varieties and to develop an equation for the estimation of nut weights. Data mining approach was used to estimate nut weights from physical fruit quality attributes (kernel length, width, thickness, arithmetic mean diameter, geometric mean diameter, sphericity, surface area, volume, shape index and aspect ratio). Present findings revealed quite significant, accurate and practicable rules to estimate the nut weights of different almond varieties. It was concluded that data mining could be used as a reliable tool to estimate the nut weights of different almond varieties from the physical attributes of the fruits.


Introduction
Almond (Prunus dulcis) belongs to Prunoideae subfamily of Rosaceae family.The fruit of the almond is drupe consisting of an outer hull and a hard shell with the seed (kernel).The edible seed is covered with brown skin (Monagas et al., 2007).Almond nuts are quite rich in protein and vitamins and thus have quite high economic values (Sivaci and Duman, 2014), thus are commonly preferred by consumers (Piscopo et al., 2010).
Almond is a significant fruit also in Turkey and grown throughout the country except for East Black Sea coasts and high plateaus.Since it is quite tolerant to droughts and salinity (Sorkheh et al., 2011), it is preferred in most parts of the country by producers (Ercisli, 2004).Annual production was reported as 3.214.303tons worldwide.Turkey has an annual almond production of 85.000 tons constituting around 2.6% of world production.With such a production quantity, Turkey has the 6 th place in world production (FAO, 2016).
To fulfill the industrial requirements and consumer desires, almond nuts should have high quality attributes (Kodad et al., 2018).Quality is largely designated by physical attributes like thickness, length and width of the nut and kernels and such attributes directly related to weight (Imani and Shamili, 2018).
The present study was conducted to estimate the weight of almond nuts (WAN) from the physical attributes with data mining approaches and to develop an equation to calculate nut weights.

Biological material
The nuts of 'Bertina', 'Ferragnes', 'Ferraduel', 'Ferrostar', 'Glorieta', 'Lauranne' and 'Marta' almond varieties were used as the material of the present study.Measurements were performed over these nuts.The fruits were supplied from Pistachio Research Institute in Gaziantep, Turkey.Experimental nuts were cleaned thoroughly to free them all kinds of foreign materials, undesired debris and broken nuts.Present almond nuts of different varieties are presented in Fig. 1.

Measurement and calculation of seed parameters
To determine the average seed size, a sample of 100 nuts was randomly selected from each variety and their three principal axes were measured.Measurements of three major perpendicular dimensions (length-L, mm, width-W, mm and thickness-T, mm) were carried out with a digital caliper (± 0.01 mm).The weight of almond nuts was measured with an electronic balance (± 0.001 g).
The arithmetic mean diameter (D a), geometric mean diameter (Dg) and sphericity (φ) were calculated by using the following equations given by Mohsenin (1986) and cited by Rasouli et al. (2010), Vishwakarma et al. (2012): The surface area (S) was obtained from the following equation given by McCabe et al. (1986) and cited by Gupta and Das (1997); Arslan and Vursavus (2008): The volume (V) of the seed was calculated by using the following equations cited by Arslan and Vursavus (2008): (5) 6( 2) The shape index (SI) was calculated with the following equation cited by Ercisli et al. (2012); Sayinci et al. (2015): Fruit weights are commonly estimated from the other quality attributes while designing packaging, transportation and marketing operations.Quality and standard inspections are also implemented by using weight estimations (Vivek Venkatesh et al., 2015).Various approached were developed to estimate the weights of apricots (Naderi-Boldaji et al., 2008), citrus fruits (Omid et al., 2010), tangerines (Rashidi and Keshavarzpour, 2011), bananas (Soares et al., 2013), mangos (Schulze et al., 2015).Tabatabaeefar et al. (2000) measured the size-attributes of oranges with a micrometer and ∆T area-meter and developed a model to estimate the fruit weights from dimensional characteristics.Lorestani and Tabatabaeefar (2006) used dimensional attributes to develop a model to estimate fruit mass of kiwi fruits.Khoshnam et al. (2007) employed linear and nonlinear models to model fruit mass of pomegranates by using physical attributes of the fruits.Omid et al. (2010) estimated volumes and mass of citrus fruits by using image processing techniques.Shahbazi and Rahmati (2013) used physical attributes to estimate the mass of cherry fruit and employed four different models in estimations as of; Linear, Power, Quadratic and S-curve.Vivek Venkatesh et al. (2015) investigated the correlations between the volumes and mases of axi-symmetric fruits (apples, lemon and orange).Imani and Shamili (2018) employed multi-variate analyses procedures to estimate almond weights from the physical chracteristics including flower, nut and kernel physical attributes.Despite all these above specified literatures, it was observed that data mining techniques have not been applied to estimate nut weights of the almonds.
Data mining offers easy-to-apply procedures for classification of existing datasets and estimation of real values (Muhammad and Muhammad, 2016).Conventional statistics are not able to uncover the hidden information within data sets, thus data mining approaches have been developed to reveal such a hidden knowledge in data sets (Maione et al., 2016).Data mining approaches can extract quite much predictive information from large databases.Therefore, they can be used in diverse disciplines ranging from science to engineering.Regression analysis, decision tree, classification and prediction, clustering, association rule analysis are the primary data mining techniques and combination of these techniques are also used in some cases (Rathod and Garg 2016).Research data and objectives primarily designate the approach to be used in data mining and the model to be built.Data mining techniques are basically classified as descriptive and predictive.While descriptive methods include association rules and clustering, predictive methods includes classifications (Tintarev and Masthoff, 2011).The aspect ratio (Ra) was calculated with the following equation cited by Mpotokwane et al. (2008):

Data mining
PolyAnalyst is a high-power multi-strategy data mining approach and allows user to do an automatic data analysis.It is composed of 11 exploration engines and each one of them is used for different purposes and at different places of data mining operations (Fig. 2).Find Laws is a prediction engine of PolyAnalyst and commonly used as a data mining tool in operations.Find Dependencies engine of PolyAnalyst was employed before Find Laws engine to decide about the parameters to be used in weight estimation of almond nuts.

Prediction methods and find laws
Conventional statistics are also used as classification and prediction tools in some cases (Gurbuz et al., 2009).However, Find Laws composes hybrid symbolical rules to explain non-linear dependencies in datasets.R-squared is commonly employed to prove the accuracy of the models used in predictions (PolyAnalyst -Megaputer Intelligence Inc., 2004).In present Polyanalyst-Find Laws datamining tool was used to estimate the weights of almond nuts (WAN).

Attribute selection, preprocessing and find dependencies
Attribute selection is generally used to eliminate irrelevant features and to select more relevant and dependent attributes of a dataset (Dash and Liu, 1997;Guyon and Elisseeff, 2003).Since the present data were gathered from the laboratory measurements, attribute selection was applied to current dataset and only the measured attributes were selected.Then, the Find Dependencies algorithm was applied to find relations in the dataset (PolyAnalyst -Megaputer Intelligence Inc., 2004).

Experimental results with real data
The data used in this study was associated with actual laboratory measurements.A total of 100 records for each almonds variety were used.The data contains the measurements of physical parameters of almond nuts (L, W, T, Da, Dg, φ, S, V, B, SI and Ra).These 11 parameters are the inputs of the study and the output is WAN.All 700 records for seven almond varieties were used to predict WAN and to obtain a general formulation for seven almond varieties.
Find Dependencies was applied for the preprocessing stage of all data.P-value was employed for specification of the dependency degree when it is probable.If the P-value is nearby to zero, it is more suggestive for dependency.When P-value is under 10 -7 , it is ranked as zero (PolyAnalyst -Megaputer Intelligence Inc., 2004).
At the end of find dependencies analysis, the dependent variables for WAN of all almond varieties were decided.The results of pre-processing stage are provided in Table 1.The attribute sets shown in Table 1 were used in prediction stage of the study.Following the pre-processing, Find Laws tool was employed to estimate the WAN.During the analysis, in the first stage, training step was implemented through random selection of 70% of the data used and then the resultant rules were verified through testing process with the remaining 30% of the dataset.Findings for training and testing stages are provided Tables 2-8.Linear regression engine of PolyAnalyst was used to show the advantage of Find Laws.Although a simple formula was obtained with linear regression, Find Laws yielded more complex (with more attributes) and more accurate results.

Results and Discussion
The prediction engines were applied to obtain a formula for WAN parameter.Data-mining process was performed and some rules for WAN of almond varieties were obtained through Find Laws.The derived rules were tested for their trueness and correctness, and also the results seemed to be suggestive.R 2 values of train and test results of Find Laws and linear regression are provided in Tables 2-8.
As seen by R 2 values, the most significant rule was obtained for predicting the WAN of 'Ferraduel' variety of almond.The greatest R 2 value of train and test results was obtained for 'Ferraduel' with 95% (Table 4).The next highest R 2 value was obtained for 'Lauranne' variety with 88% (Table 7).Two rules were obtained for 'Glorieta' and the second one was more complex than the first one and its correctness was also higher (90% > 85% as seen in Table 6).The rules obtained for 'Bertina', 'Ferrgnes' and 'Marta' had nearly the same correctness respectively with R 2 values of 83%, 84% and 84%.The rules and R 2 values are provided in Table 2, 3 and 8.The lowest correctness was observed in 'Ferrostar' almond with an R 2 value of 77%.It can be stated that this rule was applicable and acceptable.
At the end of the analysis, Find Laws was applied to all data to have a general formula for all seven varieties of almonds.The results also proved that the rules were acceptable and applicable (Table 9).Two rules were obtained for all almond varieties.In the first analysis, all parameters provided in Table 1 were used.But in the second analysis, the almond variety was added to see the effects of almond variety on current results.With the inclusion of almond variety, R 2 value increased from 89% to 95%.Fig. 2. PolyAnalyst Software Demir et al. (2018a) in a study, proposed highly accurate rules to estimate the width and depth of stalk cavity, width and depth of eye basin of different apple varieties by using physical attributes of the fruits.Demir et al. (2018b) in another study employed Data Mining approaches to 582 estimate color properties of the fruits and obtained quite accurate rules with Find laws algorithm for estimations of color index, hue angle and chroma values of the fruits.
When the R 2 values of linear regression results were compared with Find Laws result, it was observed that R 2 values of Find Laws were all higher than R 2 values of linear regression.

Conclusions
Data mining is a broad discipline able to integrate various techniques from different disciplines such as statistics, machine learning techniques, artificial intelligence, and pattern recognition and database systems with massive data sets.Present findings revealed quite significant, accurate and practicable rules to estimate the nut weights of different almond varieties.Present findings also indicated that Find Laws algorithm was more efficient than linear regression in estimations.It was concluded based on present findings that data mining could be used as a reliable tool to estimate the nut weights of different almond varieties from the physical attributes of the fruits.

Table 1 .
Results of find dependencies

Table 6 .
Prediction results of WAN for cv.'Glorieta'

Table 8 .
Prediction results of WAN for cv.'Marta'