Identifying outliers in CSV data
An outlier is a data-point which differs significantly from the other data points in the set. This difference can be measured using statistical methods. The difference could be due to variability in the value itself or due to an observation error. In the latter case, the outlier data points can be removed from the data set to clean it up.
ArgonStudio implements one way of detecting outliers. This method uses the inter-quartile range as follows: Assuming Q1 and Q3 represent the upper and lower quartiles, the inter-quartile range IQR is defined as:
IQR = Q3 - Q1
With this definition of the IQR, an outlier is defined as any point outside the range:
(Q1 - 1.5 * IQR) and (Q3 + 1.5 * IQR)
Let us now look at how to easily identify outliers using ArgonStudio.
Load CSV data into the grid editor
Identifying outliers of a single column
- To identify outliers on a single column, click the drop-down
button in the column header, and choose Show outliers.
- Note that the Show outliers menu-item will be present and applicable for numeric-type columns only.
- Some of the rows containing the outliers are now highlighted
the boxplot shows the various parameters - upper and
lower quartiles, the median, the left and right whiskers
(which are the upper and lower limits) as well as the right
and left outliers.
Showing all outliers in a dataset
- Showing the outliers in a single column works when there are a few columns for which you want to check the outliers. But when you have a large number of numeric columns, it becomes tedious. For such situations, we have a different solution.
- Once the data file is loaded into the grid editor, click the
menu-bars at the right-side of the grid, and
choose Highlight outliers as shown.
- Below we can see that all outliers in all numeric columns
are highlighted. The image shows states where population of
various races (asian, black, etc.) are higher or lower than
expected. However, it is useful to show the state too when
browsing data. The state is in the first column and the first
column should be frozen for a better view.
- To freeze the first column, move the horizontal scroll-bar at
the bottom so the first column is visible, and
choose Freeze columns.
- Now we can browse the data much better with the first (and
second) columns frozen.
Outliers in US college rankings
Let us now check the US colleges ranking file for outliers.
- Let us load the CSV file into the grid editor.
- Some of the columns in this file contain numbers mixed with other characters such as % (Acceptance rate and Graduation rate) and $ (Fees-undergraduate). To be able to process these as numeric fields, we need to modify the values.
- Click on the drop-down button on Acceptance rate
column and choose Update Value.
- In the Update Column Value dialog, make sure
the Pre-configured tab is selected and
choose Convert to number from the
drop-down. Click Update to perform the update.
- As we can now see, the % has been removed from
the Acceptance rate column. Repeat this procedure
for the Graduation rate and Fees-undergraduate
- Now to find the outliers. Click on the menu-bars icon at the
top-right of the grid, and choose Highlight
outliers. You get a notification and the single outlier is
- This means that New Jersey Institute of Technology
has the least Graduation rate of 65% and is an
outlier. For some reason, only 65% of students graduate
from this college. Note that the median across the data set
is 85% and the minimum (excluding this college)
is 73% as can be seen from this boxplot.
Finding major outliers in data
Till now, we have talked about outliers which lie outside
the inter-quartile range by a factor
of 1.5. These outliers are considered minor
outliers. A major outlier is defined as a data-point
that lies outside the inter-quartile range by a factor
of 3. In other words, a data point which lies outside
the range of:
(Q1 - 3 * IQR) and (Q3 + 3 * IQR)
Let us now find out how to identify major outliers in your data with ArgonStudio.
- After the data is loaded into the grid editor, we need to
convert the data from the Per-capita income column to
numeric format. As it is loaded, it contains the $
sign and a ,. To remove these, click the drop-down
button at the end of the column
(1) and select Update
- In the Update Column Value dialog, ensure
the Pre-configured tab is selected
(1), click the Choose
operation drop-down and select Convert to number
(2). Click Update to
apply the update (3).
- The data is updated and the column is now a numeric
column. Click the column drop-down again, and choose Show
major outliers from the menu.
- The software now highlights the 21 major outliers. The results are shown below.
|New York County||New York||62498|
|Falls Church City||Virginia||59088|
|Los Alamos||New Mexico||51044|
Outliers are data points which differ significantly from the rest of the points in the dataset. Identifying them is important to decide whether the data is anomalous or represents a variability in the value.
- ArgonStudio provides a simple way to identify such outliers.
- Outliers in a single numerical column can be identified which highlights the entire row.
- Alternatively, outliers in all numerical columns can be identified with a single click.