# Identifying outliers in CSV data

An **outlier** is a data-point which differs significantly
from the other data points in the set. This difference can be
measured using statistical methods. The difference could be due
to variability in the value itself or due to an observation
error. In the latter case, the outlier data points can be
removed from the data set to clean it up.

ArgonStudio implements one way of detecting outliers. This
method uses the inter-quartile range as follows: Assuming
Q_{1} and Q_{3} represent the upper and lower
quartiles, the inter-quartile range *IQR* is defined as:

IQR = Q3 - Q1

With this definition of the *IQR*, an **outlier** is
defined as any point outside the range:

(Q_{1}- 1.5 * IQR) and (Q_{3}+ 1.5 * IQR)

Let us now look at how to easily identify outliers using ArgonStudio.

## Load CSV data into the grid editor

The data we are looking at in this example is US census data which contains information about seniors (60+ adults).

We begin by loading the CSV data into the grid editor.

## Identifying outliers of a single column

- To identify outliers on a single column, click the drop-down
button in the column header, and choose
*Show outliers*.

- Note that the
*Show outliers*menu-item will be present and applicable for numeric-type columns only. - Some of the rows containing the outliers are now highlighted
as shown.

- Checking
the boxplot shows the various parameters - upper and
lower quartiles, the median, the left and right whiskers
(which are the upper and lower limits) as well as the right
and left outliers.

## Showing all outliers in a dataset

- Showing the outliers in a single column works when there are a few columns for which you want to check the outliers. But when you have a large number of numeric columns, it becomes tedious. For such situations, we have a different solution.
- Once the data file is loaded into the grid editor, click the
menu-bars at the right-side of the grid, and
choose
*Highlight outliers*as shown.

- Below we can see that all outliers in all numeric columns
are highlighted. The image shows states where population of
various races (asian, black, etc.) are higher or lower than
expected. However, it is useful to show the state too when
browsing data. The state is in the first column and the first
column should be frozen for a better view.

- To freeze the first column, move the horizontal scroll-bar at
the bottom so the first column is visible, and
choose
*Freeze columns*.

- Now we can browse the data much better with the first (and
second) columns frozen.

## Outliers in US college rankings

Let us now check the US colleges ranking file for outliers.

- Let us load the CSV file into the grid editor.

- Some of the columns in this file contain numbers mixed with
other characters such as
`%`(*Acceptance rate*and*Graduation rate*) and`$`(*Fees-undergraduate*). To be able to process these as**numeric**fields, we need to modify the values. - Click on the drop-down button on
*Acceptance rate*column and choose*Update Value*.

- In the
*Update Column Value*dialog, make sure the*Pre-configured*tab is selected and choose*Convert to number*from the drop-down. Click*Update*to perform the update.

- As we can now see, the
`%`has been removed from the*Acceptance rate*column. Repeat this procedure for the*Graduation rate*and*Fees-undergraduate*fields.

- Now to find the outliers. Click on the menu-bars icon at the
top-right of the grid, and choose
*Highlight outliers*. You get a notification and the single outlier is highlighted.

- This means that
**New Jersey Institute of Technology**has the least*Graduation rate*of**65%**and is an outlier. For some reason, only**65%**of students graduate from this college. Note that the median across the data set is**85%**and the minimum (excluding this college) is**73%**as can be seen from this boxplot.

## Finding major outliers in data

Till now, we have talked about outliers which lie outside
the *inter-quartile range* by a factor
of `1.5`. These outliers are considered **minor
outliers**. A **major outlier** is defined as a data-point
that lies outside the inter-quartile range by a factor
of `3`. In other words, a data point which lies outside
the range of:

(Q_{1}- 3 * IQR) and (Q_{3}+ 3 * IQR)

Let us now find out how to identify major outliers in your data with ArgonStudio.

For this example, we are looking at the United States per-capita income by county from Wikipedia. We have covered the procedure for loading the CSV data from a HTML table of a Wikipedia page here.

- After the data is loaded into the grid editor, we need to
convert the data from the
*Per-capita income*column to numeric format. As it is loaded, it contains the`$`sign and a`,`. To remove these, click the drop-down button at the end of the column (1) and select*Update Value*(2).

- In the
*Update Column Value*dialog, ensure the*Pre-configured*tab is selected (1), click the*Choose operation*drop-down and select*Convert to number*(2). Click*Update*to apply the update (3).

- The data is updated and the column is now a
*numeric*column. Click the column drop-down again, and choose*Show major outliers*from the menu.

- The software now highlights the
**21 major outliers**. The results are shown below.

County | State | Per-capita income |
---|---|---|

New York County | New York | 62498 |

Arlington | Virginia | 62018 |

Falls Church City | Virginia | 59088 |

Marin | California | 56791 |

Alexandria City | Virginia | 54608 |

Pitkin | Colorado | 51814 |

Los Alamos | New Mexico | 51044 |

Fairfax County | Virginia | 50532 |

Hunterdon | New Jersey | 50349 |

Borden | Texas | 50042 |

Montgomery | Maryland | 49038 |

Morris | New Jersey | 48814 |

Fairfield | Connecticut | 48721 |

San Francisco | California | 48486 |

Howard | Maryland | 48172 |

Westchester | New York | 47984 |

Somerset | New Jersey | 47803 |

Nantucket | Massachusetts | 47331 |

Loudoun | Virginia | 46565 |

North Slope | Alaska | 46457 |

San Mateo | California | 45732 |

## Summary

Outliers are data points which differ significantly from the rest of the points in the dataset. Identifying them is important to decide whether the data is anomalous or represents a variability in the value.

- ArgonStudio provides a simple way to identify such outliers.
- Outliers in a single numerical column can be identified which highlights the entire row.
- Alternatively, outliers in all numerical columns can be identified with a single click.