Monday, March 20, 2017

Kaggle Project: Two Sigma Connect Rental Properties

Working on a Kaggle project now (https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries) about predicting interest levels for housing in NYC. The data includes images, presumably to encourage application of computer vision, and text descriptions...so much stuff. These kinds of projects would require huge amounts resources and time to actually do well, but I will document my stab at it here. I don't intend to get too obsessed with this project, as I would if I saw it as a litmus test of my analytical abilities, and all I'm hoping is to learn a couple things while working on it. 

Geographical Analysis of High, Med, Low Interest Listings


Getting started, I did some general exploratory data analysis. Here are the listings separated by interest level (of which there are only three: high, medium, low) overlaid onto a map of NYC. I took out a couple geographical outliers, but otherwise unaltered.  




Already, I can see how complicated this project could get. There are no indications of high interest listings clustering around a particular area, and the cartesian coordinates of the listings on the map probably are not going to help much in the classification. The high interest listings may seem sparser only because they have the fewest data points. For this location data to be useful in classifying, some inventive (and very detailed) feature engineering will be required. And that will require a lot of....manual work and analysis....


Interest Level to Price Relationship

Price was also included in the dataset. Intuitively, I expect that price data should matter a lot when it comes to interest level. From my personal experience looking for apartments, a large portion of the people who rent in the city are young professionals who don't have a family. Another guess is that these people all make around the same amount of money, and interest level should be intense for the nicer apartments in the 2k~3k range. Obviously, there are going to be listings much higher and lower than that, which do not attract the same amount of interest. The cheaper listings being in dangerous neighborhoods, and the expensive ones being out of reach for the majority of customers. That means the relationship of interest to price should be non-linear. The relationship could be linear after accounting for several factors, like location, etc.