New York Times Data Training - Analysis of NYC Housing Data

The New York Times have released some training materials that they use internally to train their journalists and editors in data analysis. These training materials are divided into Beginner, Intermediate and Advanced sections. Each dataset includes a set of questions which can be answered by analyzing the provided data.

In this post, we will use ArgonStudio to try to answer questions in one of the intermediate training sets: NYC Housing. The google spreadsheet containing the data is located here.

Loading NYC housing CSV data into the grid editor

Navigate to the google spreadsheet and download it as CSV. Open the downloaded CSV in the grid editor as explained here.

Exploring the NYC housing data

Before beginning the analysis of the CSV file, it is good to get an idea of the data we are dealing with. For this purpose, we can explore the type and range of values contained in each column. To do that, navigate to the Analytics tab, choose the Metrics pill, and pick a column from the Choose column dropdown.

The column type is indicated next to the dropdown. For a text column, we see the unique values that the column contains along with the frequency of occurence. The details for the Neighborhood column is shown above.

By picking a different column from the Choose column dropdown, you can explore different columns.

Viewing a histogram for numeric columns

When you pick a numeric column from the Choose column dropdown, a histogram of the values is the column is shown. This allows you to determine the frequency of values in that column. Once the histogram is generated, you can mouse over it to view the frequency of any value.

In the image above, we have chosen year_built. From the histogram, it is clear that most of the buildings in the CSV were built between 1900 and 1930, and also between 2000 and 2010.

What are the total number of units in Upper East Side?

Let us now look into the questions posed by the NYC Housing: Worksheet and Data Guide.

The first question is: What are the total number of units in Upper East Side? Here is how we can answer this question.

  • In the Analytics tab, click on the Grouped pill (1).
  • Set the bucket function to terms (2) and choose the Neighborhood column (3) from the dropdown.
  • Set the metric function to sum (4) and the metric field to res_units (5) (which is the number of units in each building).
  • The aggregation is performed and the table of neighborhoods and the number of units in each appears below. From this table, you can see that the number of units in Upper East Side is 16419 (6).

When were most of these buildings built?

  • Ensure bucket function is set to terms (1) and bucket field is set to year_built (2).
  • Set metric function to count (3) and metric field to year_built (4).
  • Choose value - descending under sort (5).
  • The aggregation results appear below and you can see that most buildings were built between 1900 and 1920 (6).

Plotting the results on a chart

  • To plot the results of the above aggregation on a chart, click on the chart button at the bottom of the table as shown.
  • The data will be copied to the chart editor under the Chart tab.

Drawing the chart

  • Click the Chart tab (1) and verify that the aggregated data appears in the text area (2).
  • To draw a column (or bar) chart, make sure the Column tab is selected (3).
  • Under the X Columns dropdown, choose terms(year_built) (4). Under Y Column, choose count(year_built) (6 ). Leave the Orientation at Horizontal (7) and click the Draw button (8).
  • After configuring the chart, it looks like this:

Configuring the chart

  • Let us now learn how to configure the chart. Click the menu-bars icon to open the configurator.
  • The configurator is shown above.
  • Zooming the x-axis: Click xAxis, then axisSlider and turn on show. A zoom tool appears at the bottom of the chart. Adjust it so that the bars are visible properly and about 10 bars are displayed. Close the xAxis accordion.
  • Adjusting series labels: Click Series followed by label and delete the first part of formatStr till the hyphen (-). The final value should be:
    <%=sprintf("%.0f",arg.value)%>
  • Position series labels: Again under label, change position to top, rotate to 0 and align to center.
  • Changing the label color: Open the Style accordion and change color to #000000 (or any other desired color). Close the Series accordion.
  • Adding X and Y axes labels: Click xAxis, followed by core and enter the axis label in the name field. Change the nameGap to 30 and nameLocation to middle. Ensure nameRotate is 0 for X-axis and 90 for Y-axis. Click Apply Changes.
  • Adding title: To add a title, open the title accordion and enter the title in the Text field. Add a Subtext also if desired. Change padding to 20.
  • The final chart is shown above.

How long after construction did the subsidies generally kick in?

For computing this value, we need the number of years between the year_built and the start_date (which is the year the subsidies kicked in). This value must be calculated for each row. Rows which have no value for either year_built or the start_date must be skipped.

Let us go ahead and add a column for computing and storing this value. We will call this column SubsidiesKickedInAfter which will be the difference of start_date and year_built.

  • Change to the Grid tab (1), click the menu-bar icon (2), and choose New Column (3).
  • In the new column dialog, enter SubsidiesKickedInAfter in the Column Name (1), click to select the Javascript tab (2), and enter the following code in the text-box (3). This code return -1 if either start_date or year_built is blank and the difference of the two if both are present.
    function(row) {
        if ( row.start_date && row.year_built )
            return row.start_date - row.year_built;
        else return -1;
    }
  • Click Add Column after everything is filled out (4).
  • The new column is now added with the values computed appropriately as shown.

Filtering data

We now need to apply a filter to the data to exclude rows which have SubsidiesKickedInAfter less than 0. Remember, in the steps above, when either start_date or year_built was absent, we inserted -1 into SubsidiesKickedInAfter. So we need to exclude those rows.

  • Click on the filter button to open the filter panel (1).
  • Ensure that the Javascript tab is selected (2).
  • Enter the following expression in the text box (3).
    row => row.SubsidiesKickedInAfter >= 0
  • Click the Apply Filter button (4).
  • The rows are now filtered (6) and the number of rows matching is indicated (5).

So, when did the subsidies generally kick in?

After the rows have been filtered, it is time to compute some metrics. Here we need to count the number of instances of each value of SubsidiesKickedInAfter to find out the most frequently occuring value in that column.

  • Click on the Analytics tab.
  • Set the bucket function to terms and the bucket field to SubsidiesKickedInAfter (1).
  • Change the metric function to count (2 ).
  • Choose SubsidiesKickedInAfter for the metric field (3).
  • Finally, set the sort option to value - descending (4 ) to you can see highest frequencies of SubsidiesKickedInAfter.
  • The result table is generated as shown (5).

From the table, we can see that most subsidies kicked in between 1 and 4 years: 204 subsidies kicked in after 3 years, followed by 161 subsidies after 2 years, 116 subsidies after 4 years and 109 subsidies after 1 year.

This same information can also be obtained (at a coarser level) by checking the histogram of the field SubsidiesKickedInAfter. As can be seen below, the highest number of subsidies that kicked in between 0 and 10 years was 658 subsidies.

Which neighborhood has the most units?

  • Finding out which neighborhood has the most units is pretty simple. We need the total of res_units with the data grouped by Neighborhood.
  • Choose Neighborhood for the bucket field and terms for the bucket function.
  • Set the metric function to sum.
  • And the metric field to res_units.
  • Finally sort the output by value - descending.

We find that the neighborhood with the most units is Chelsea/Clinton/Midtown with 39077 units.

Compare neighborhood characteristics

Next we would like to compare neighborhoods with the most units in terms of population, median income and racial diversity. The population, median income and racial diversity comes from another CSV file that is part of this collection, called NYC Housing_ Data - Income. So we need to join this file with the neighborhood units data.

  • Open the aggregated metrics in the above step in the grid editor by clicking the button as shown below.
  • Click the menu-bars icon at the top right (1) and choose Join CSV from the menu (2).
  • The Join CSV dialog appears. Click the folder open button (1) to open the CSV to be joined. After the CSV is opened, the columns in the new CSV appear in the dropdown (2). Choose the column containing the name of the neighborhood (Sub-Borough Area in this case). Set the Source Column dropdown to terms(Neighborhood) (3) Ensure the Join Type radio is set to Inner Join (4) and click the Join Columns button.
  • The data is joined and the grid is updated as shown. From the table, we notice that the neighborhood with the most units has a population of 140247, a median income of $103,538 and a racial diversity of 57.42%.

Which neighborhood has the most units that will expire this year?

To find out which neighborhood has the most units that will expire this year, we need to apply a filter (shown below) on end_date to restrict it to the current year, 2020. Adding that filter and re-running the aggregation gives the answer as shown below: Greenwich Village/Financial District has the most units (1450) with subsidies that expire in 2020.

row => row.end_date == 2020

Which subsidy program affects the majority of units that are expiring this year?

To find this measure, we need to group records by program_name (1) and sort by value - descending. This will group records by program_name, compute the sum of res_units and sort by the total number of units.

From the results table (2), we find that the program whose expiry affects the most units is 421-a Tax Incentive Program which affects 3,706 units in total.

Which neighborhoods will be most affected (in terms of number of units) in each of the next 10 years by expiring subsidies and which one is the most secure?

The most affected neighborhood in the next 10 years by expiring subsidies, when arranged by year of expiry, is Upper East Side with 8192 units affected.

  • For arriving at these results, we must first filter the rows to include rows with end_date between 2020 and 2030. Here is the filter condition:
    row => row.end_date >= 2020 && row.end_date <= 2030
  • Next, we group the filtered data first by Neighborhood () and then by end_date (). Finally we sort by value - descending to obtain the results table (3).
  • The results are shown below.

To obtain an answer to the last part of the question, we just reverse the sort order i.e. sort by value - ascending. This gives the neighborhood which has the least number of units affected by a program expiring within the next 10 years. See below.

Which owner has the most buildings?

NYC HOUSING AUTHORITY owns the most buildings at 507 buildings in the list. The value null (second item in the list) indicates that 490 buildings have no owner specified.

We obtain this information by grouping by owner_name (1) and applying a metric function of sum (2) for buildings (which represents the count of buildings owned by that owner) (3 ). Finally we sort by value - descending to obtain the results shown (4).

Who are the biggest owners in each neighborhood based on the number of units? Limit your table to owners who have more than 1,000 units.

For answering this question, we need to group data by two columns: Neighborhood (1) and owner_name (). Next we compute the sum of the field res_units (3) which gives us the total number of units for every owner in each neighborhood. We sort the results by value - descending (4) and apply a filter to limit the total number of units to greater than 1000 (5). The final results are shown in 6.

Which owner has the most tax delinquent projects?

NYC PARTNERSHIP HOUSI has the most number of tax-delinquent projects at 7 projects, followed by GOUVENEUR GARDENS HSG, LESPMHA HOUSING DEVEL and LENOX AND PENNAMON HO at 4 each.

These figures are obtained by choosing owner_name for the bucket field (1), sum for the metric function 2, Tax delinquency 2016 for the metric field (3) and sorting by value - descending (4).

Which one has the most serious housing code violations?

27 buildings, which had serious hazard violations in 2017 have no owner specified. WHGA GARVEY HOUSING D was the next highest with 12 buildings and GENESIS Y15 OWNERS LL with 10 buildings with serious hazard violations.

Compare the average value per unit for different neighborhoods. Which is the most expensive and which is the cheapest?

  • First we need to compute the value per unit for each property. We do this by dividing the field assessed_value by the number of units in each property (res_units). We add a new column for the purpose of storing this value, and call it Value per unit. The value of the column is computed using the following expression:
    function(row) {
        if ( row.res_units > 0 )
            return parseFloat(row.assessed_value.replace(/[\$,]/g,''))/row.res_units;
        else return -1;
    }

  • Next we compute the aggregation grouping by Neighborhood and using the mean function on the newly added Value per unit field.
  • From the table, we can see that Greenwich Village/Financial District has the highest value per unit of 211593.

And that completes the analysis of the New York Times Data Training - NYC Housing Data file.

Summary

The NYC Housing Data file is a data file used by The New York Times in training their journalists and editors in data analysis. We used these materials to show how it can be done using ArgonStudio. We covered the following topics:

  • Exploring the data file using Analytics / Metrics tool to determine unique values of a text field, as well as the histogram distribution of a numeric field.
  • Using the Analytics / Grouped aggregation tool, we learned how to group the data into buckets, and apply a function to compute metrics such as sum, count and mean.
  • Plotting the aggregate data in a chart and configuring the chart as a nice presentation.
  • Adding computed columns whose value depends on other columns. A javascript function is used to generate the value of such computed columns.
  • Filtering data to include or exclude data of interest.
  • Joining CSV data using a common column.
  • Multiple grouping of data and applying metrics to each group.
  • Filtering the results of grouped aggregates to narrow down to sections of interest.

We hope you have enjoyed the material presented in this article. You can now continue to try some of this functionality on your data.