Research and IPM
Research Tools: California Pesticide Use Summaries
UC IPM developed this database from data from the California Department of Pesticide Regulation. The database includes summaries of pesticides used on California crops datailed by commodity, pesticide, county, and month. The database has no information about a pesticide or its label.
DPR Outlier Explanation
To improve data quality, the California Department of Pesticide Regulation (DPR) flags values for rate of use which are so large they are probably errors. Errors occur, for example, when those reporting pesticide use inadvertently shift decimal points during data entry. DPR used three different criteria to identify outliers, or values likely to be errors, by comparing each use rate with an estimate of the maximum allowable rate for that type of use. For data since 1998, UC IPM has accepted DPR flagged values as errors, but continues to apply its own quality control checks to all DPR data.
These flags are supplied in a separate table with the original data. Each row in this table corresponds either to one pesticide application for production agricultural reports or to a monthly summary for other uses. Reports of applications for any use other than production agriculture include only the total of all uses in a month for each pesticide, site treated, and applicator. The type of report is identified in the Pesticide Use Report (PUR) by the field "record_id." Production agricultural reports have record_id values of 1, 4, A, or B; monthly summary reports have record_id values of 2 or C. Each row is uniquely identified. The other three columns in the outlier table contain the flags for the three different criteria. A >Y= value in one of these columns indicates that the rate is an outlier by that criterion. A >N= value indicates it is not an outlier by the criterion. A blank or space indicates that the criterion could not be applied to that particular record. If no criterion applies to a row in the PUR, there is no corresponding row in the outlier table.
Outlier Table. The first criterion column in the outlier table, ai_a_1000_200, flags records with rates higher than 200 pounds of active ingredient per acre (or greater than 1000 pounds per acre for fumigants). The second column, prd_u_50m, flags rates 50 times larger than the median rate for all uses with the same pesticide product, crop treated, unit treated, and record type (that is, production agriculture or monthly report). The third column, nn4, flags rates higher than a value determined by a neural network procedure that approximates what a group of 12 scientists believed were obvious outliers. These criteria are explained in more detail below.
Although applications or rows are flagged, the only values tested are rates. Thus, there is no reason to believe that the other data in a row, such as time and location of the application, are incorrect. Also, note that rate is not one of the fields in the PUR table. Rates are calculated by dividing the pounds of pesticide used by the acres or unit treated. Thus, an extremely high rate value could occur from either extremely high pounds used or extremely low unit treated.
Only extremely large ratesnot extremely small onesare flagged because only large values will have a major influence on statistics involving pounds of pesticide use. The value to use for the maximum rate in each criterion is somewhat arbitrary; the value determines how conservative one wants to be. DPR chose maximum rates to be close to what were considered obvious outliers by a scientific process described below in the description of the neural network criteria.
Determining When a Value Is an Outlier. Many methods can be used to determine if a value is an outlier. If DPR knew the maximum label rates for particular uses, then rates in the PUR could be compared to these maximum rates, but unfortunately, this information is unavailable in the PUR or in the Pesticide Label Database. The other methods to identify outliers involve looking at the statistical distribution of the actual use rates. If the values are normally distributed, then one can identify outliers using a number of statistical procedures. If the values have an unknown or nonstandard distribution, then there exist no standard statistical procedures for identifying outliers. Nevertheless, people can look at a distribution and usually say with different degrees of confidence whether some value is an outlier. This suggests that some kind of procedure can be developed to make similar judgments.
For most of the pesticide use data, distributions of rates are not even close to normal. They may have several different peaks (multi-modal). They may have either very broad distributions or very narrow distributions. None of the standard statistical measures of outliers is very useful for these data. The best single method is one based on neural networks. However, each different criterion will catch different outlier values, so it is usually best to use all three criteria. It should be noted that these criteria are not perfect. They are conservative, meaning a value must be very extreme to be flagged, so some errors will be missed. On the other hand, occasionally an extreme value will be flagged that is actually correct. Because the criteria are conservative, these later kinds of errors are minimized.
Criterion 1: Pounds per acre of active ingredient is larger than 200 (for non-fumigants), or 1000 (for fumigants).
Records were flagged in the PUR by Criterion 1 if the pounds per acre of a non-fumigant active ingredient were greater than 200 or if the pounds per acre of a fumigant active ingredient were greater than 1000 (column ai_a_1000_200 in the outlier table). These limit values were chosen based on what is known about typical rates of use for most pesticides.
Note that this criterion uses the pounds of active ingredient. Also, this criterion applies only to records where the unit treated is acres. The other criteria use pounds of pesticide product and apply to any unit treated, such as square feet or cubic feet.
Criterion 2: Pounds per unit treated of a product is larger than 50 times the median.
Records were flagged by Criterion 2 if the pounds of pesticide product per unit treated were greater than 50 times the median value of all rates with similar types of use (column prd_u_50m in the outlier table). The median, like the mean (average), is a measure of the location of a set of values and is defined as the value in the set that has an equal number of values above and below it. It was used rather than the mean because it is not as likely to be affected be a few extreme outliers. The median was calculated from the set of all use rates of the same pesticide product and the same type of use as that of each record being examined. The same type of use means the uses of a product on the same crop or site, same unit treated, and same record type. A record type is either a production agriculture report (which includes a single application) or a monthly summary report.
Criterion 3: Pounds per unit of product is larger than a value generated using a neural network.
Records were flagged by Criterion 3 if the pounds of a pesticide product per unit treated were greater than a limit value that was calculated using a neural network procedure (column nn4 in the outlier table).
Neural Network. A neural network is a function that maps a set of input values to a set of output values. This function has a large number of parameters that must be determined so that the function will give the correct outputs for every possible set of inputs. The values for these parameters are found by a training procedure that involves presenting many sets of input and corresponding output values to the neural network program. The program then adjusts the parameters in the neural network function until it produces the correct output values for each input set. Once the neural network has been successfully trained, it can then be used to produce appropriate output values for any input data set provided to it.
The data used to train the neural network used in the PUR outlier program were generated from frequency distributions of the pounds of pesticide product per unit treated for a selected set of pesticides and sites. Groups of pesticides and sites were chosen that included a wide range of types of distributions, many with unusual distributions. Two hundred frequency distributions were plotted and then 12 DPR scientists independently examined these plots, marking rates on each plot they thought were outliers.
DPR summarized the results of this survey by finding an outlier maximum rate for each distribution. The maximum rate was set at a value where all 12 scientists thought higher rates were obvious outliers. These maximum rates were used as the output values for training the neural network. The input values were a set of statistical measures that described the frequency distributions. These sets of input and output values were used to train the neural network. After the neural network was successfully trained, it was used to find the outlier maximum rate for all sets of pesticide use types in the PUR.