Permutation importance is a simple, yet powerful tool in the hands of machine learning enthusiast. It has many applications. You can use it to validate your model and dataset. You can use it to find important but not obvious dependencies between features and a label. You can use it to drop redundant features from the dataset.
In this research, I want to show how permutation importance helped me to analyze the Manhattan Real Estate market. The full research is available via Google Colab. There is a dataset of a bit more than 20k entities describing Lower and Central Manhattan Real Estate properties. When I looked at the data, the question appeared in my mind: which exactly features of this dataset influence the market price of the property?
If you want to know which exact feature influences the market price the most, unfortunately I have to disappoint you, you will not find anything new here because this is (spoiler alert) the size of the property. On the contrary, there are few fascinating conclusions I’ve made after this research and (spoiler alert) there are other features that influence the market price of the property except its size.
As previously stated, there is a dataset of ~20k entities (exactly 22,314). Each entity contains descriptive information about the real estate property. Firstly, I want to show you a map of feature name/description pairs. Further, we use only feature names, so you can get back here for alias.
The dataset was separated on three subsets by prop_type feature:
- Residential (condominium, apartment etc) of 17,859 entities;
- Commercial (office, common area, store etc) of 3970 entities;
- Other (school, religious, vacant land etc) of only 482 entities.
My main interest is in Residential and Commercial properties. In the following image 1 you can see the properties' distribution with the colored market value ranges.
The first thing I want to do is to calculate the correlation between features of the dataset. Correlation is a ‘dependency’ of one feature from another. If two vectors have positive correlation, then when the value from the first vector increases, the value from the second vector, also, tends to increase. If two vectors have negative correlation, then when the value from the first vector increases, the value from the second vector tends to decrease. On image 2 you can see the correlation heatmap of the dataset.