Grouping CSV data for summarization
Grouping (or aggregation) is a useful operation when attempting to understand trends in data. Some examples:
- Assuming you have a file of purchase transactions, you may want to sum sales by date. This is an example of applying a metric of sum to a set of transactions that have been aggregated by date.
- Another example would be: finding the average life expectancy of males and females from death records. In this case, we would like to compute a metric of average for a set of records aggregated by sex.
In this post, we show you how to load a CSV file into the grid editor, and compute aggregations on the data.
Loading a CSV file into the grid editor
Let us begin by loading a CSV data concerning the currently ongoing COVID-19 crisis. This data file contains the number of cases by US state and county since Jan 21, 2020, as well as the number of deaths. The data has been compied by New York Times and is available here.
To learn how to load the file into the grid editor, see this article.
Exploring the Analytics interface
- Once the file is loaded into the grid editor, click on the Analytics tab.
- Ensure that the Grouped pill is active (1). This panel allows you to aggregate data and compute metrics.
- The way aggregation works is as follows: group the data into buckets, and compute a metric function for the rows in each bucket.
- The bucketing function is currently set to terms (2 and 3) which requests grouping a row by the value of a column.
- Choose the column you want to group-by from the dropdown (4). Let us choose state here.
- The metric function refers to the method used to compute the metric (5). We choose sum here to compute the total of the cases for each state.
- The metric field is the field for which the metric function is to be applied (6). We choose cases for the metric field.
Computing totals for grouped data
We would like to determine the number of cases of COVID-19 for each state. This means we need to sum (2) the field cases (3) for each state (1). We also choose to sort by state (sort set to key ascending)(4).
After making all the selections (sort is optional), the results appear in the table as shown.
Grouping by multiple columns
Let us now group by the state as well as county to determine county-level totals.
- To add multiple fields for grouping, click the add button (1) at the end of the row as shown.
- A new bucketing row is added. Make the selections as shown: choose state for the top row (1) and county for the bottom row (2). The results are updated as shown above.
Grouping and aggregation is a crucial step in data analysis. In this article, we introduced how to use ArgonStudio to perform data analysis on CSV data. We learnt the following:
- Loading CSV data into the grid editor.
- Computing an aggregation using a metric function on a column value with grouped data.
- Using multiple columns in grouping.