Chapter 3 Data transformation

The legally operating businesses data set is the primary data set containing most of the information we need and is used for the majority of the plots generated. The licence applications dataset is used to plot the acceptance ratio across various boroughs. The challenge was to create a column which was similar to the address borough column in the legally operating businesses dataset as the column closest to that in the license applications was the City column which has Manhattan and Queens missing as Boroughs and has a lot of other entries as well which are not present in the Address borough column in our primal dataset.

3.1 Legally Operating Businesses

On plotting the count of licences across the Address Borough column we notice that a lot of boroughs have been named twice. And the duplicate names with all capital letters have negligible rows in the dataset. Hence, we remove the entries with Address Borough values as “BRONX”, “QUEENS”, “BROOKLYN”, “MANHATTAN” and “Outside NYC” as we are only considering boroughs in New York.

To plot a graph showing the length of the licenses given to various industries we calculated the number of days between license creation and expiration date. During this we found that a lot of these values are negative. Hence we removed all such cases as well as it does not make sense for the license to expire before it is created.

While doing the missing data analysis mentioned in Chapter 4 we found out that the columns Secondary Address Street Name, Detail and Business Name 2 have more than 75% values missing. Since these columns are not used by us anywhere, we remove these as well.

3.2 License Applications

We have created a new dataset having columns corresponding to the year of the “End Date” column, the “Status” column and the “License.Category” column from the License Applications dataset. We also have created a new column called “nyc.data.borough” which tells as to which borough the license application belongs to. This column was created by looking at the Zip Codes corresponding to the 5 boroughs i.e. Brooklyn, Staten Island, Bronx, Queens and Manhattan in the Legally Operating Businesses dataset and then mapping Zip codes in the License applications dataset to these boroughs. This enables us to see the acceptance rates in various boroughs across NYC over the years.