CMSC320 Final Project

Introduction

We will look at data from Airbnb. Airbnb is a platform where individuals can rent out their homes to prospective tenants. Airbnb receives a commission for each transaction. Airbnb is an extremely popular service: on average, around two million people stay on lodgings arranged by Airbnb source. Furthermore, Airbnb is valued at $31 billion dollars source.

I will specifically look at Airbnb data for New York City data. New York City (NYC) is an extremely popular tourist destination, and tourists may book an Airbnb room or apartment during their stay. As a person interested in visiting NYC, understanding Airbnb rates will help me find a cheap, yet convenient place to stay. This data is also relevant for Airbnb hosts, who can see the Airbnb listings in their area and adjust the price accordingly.

This data also serves a more meaningful purpose. As a provider of housing, Airbnb affects the housing situation in the city. This data was meant to give insight into how Airbnb listings affects the housing market of the city source.

Exploring the Dataset

The Airbnb website provides an interactive geographic data visualization of the NYC dataset. This visualization provides a good overview of the dataset.

After viewing the visualization, I had thought of a few questions.

Is there a relationship between availability of a listing and its price?
Does the borough a listing is in affect availability and price? (A burough is similar to a county in NYC)

Importing the Data

First, I downloaded the NYC data that was compile on May 3rd, 2019. The data is 177 megabytes, so I will read it into RStudio directly.

I took around the first hundred entries and put it in Google Sheets. I was able to see the attributes of the dataset. For the purposes of the analyses, many of the columns can be ignored. Some columns have text data in them, such as a description of the listing. I will not use this data, so I ignored the column when parsing the data in.

I will read in the csv file with read_csv().

# loading code goes here
library(tidyverse)
library(dplyr)
library(ggplot2)
library(readr)

library(scales)
library(DBI)
library(broom)

# We skip, or exclude, certain columns
listings <- read_csv("~/cmsc320/final/data/listings.csv",
    col_types = cols(access = col_skip(),
        amenities = col_skip(), bed_type = col_skip(),
        calendar_last_scraped = col_skip(),
        calendar_updated = col_skip(), cancellation_policy = col_skip(),
        description = col_skip(), experiences_offered = col_skip(),
        first_review = col_skip(), host_about = col_skip(),
        host_acceptance_rate = col_skip(),
        host_has_profile_pic = col_skip(),
        host_identity_verified = col_skip(),
        host_location = col_skip(), host_name = col_skip(),
        host_picture_url = col_skip(), host_response_time = col_skip(),
        host_since = col_skip(), host_thumbnail_url = col_skip(),
        host_url = col_skip(), host_verifications = col_skip(),
        house_rules = col_skip(), interaction = col_skip(),
        is_business_travel_ready = col_logical(),
        jurisdiction_names = col_skip(),
        last_review = col_skip(), last_scraped = col_skip(),
        license = col_skip(), listing_url = col_skip(),
        medium_url = col_skip(), name = col_skip(),
        neighborhood_overview = col_skip(),
        notes = col_skip(), picture_url = col_skip(),
        require_guest_phone_verification = col_skip(),
        require_guest_profile_picture = col_skip(),
        requires_license = col_skip(), scrape_id = col_skip(),
        space = col_skip(), square_feet = col_skip(),
        summary = col_skip(), thumbnail_url = col_skip(),
        transit = col_skip(), xl_picture_url = col_skip()
        ))

## Warning: 9 parsing failures.
##   row     col               expected actual                                file
##  4267 zipcode no trailing characters  -2308 '~/cmsc320/final/data/listings.csv'
##  4388 zipcode no trailing characters  -2304 '~/cmsc320/final/data/listings.csv'
##  4544 zipcode no trailing characters  -3220 '~/cmsc320/final/data/listings.csv'
## 26943 zipcode no trailing characters  -3233 '~/cmsc320/final/data/listings.csv'
## 27827 zipcode no trailing characters  m     '~/cmsc320/final/data/listings.csv'
## ..... ....... ...................... ...... ...................................
## See problems(...) for more details.

We see that there were a few parsing failures. We can disregard these since there are so few (9 parsing errors out of 48941 entities). The zip code for these entities will just be NA.

Getting the Data

Recall our question:

Is there a relationship between availability of a listing and its price?

We will retrieve the appropriate data from the listings table.

Here is the list of column names.

listings %>% colnames()

##  [1] "id"                                          
##  [2] "host_id"                                     
##  [3] "host_response_rate"                          
##  [4] "host_is_superhost"                           
##  [5] "host_neighbourhood"                          
##  [6] "host_listings_count"                         
##  [7] "host_total_listings_count"                   
##  [8] "street"                                      
##  [9] "neighbourhood"                               
## [10] "neighbourhood_cleansed"                      
## [11] "neighbourhood_group_cleansed"                
## [12] "city"                                        
## [13] "state"                                       
## [14] "zipcode"                                     
## [15] "market"                                      
## [16] "smart_location"                              
## [17] "country_code"                                
## [18] "country"                                     
## [19] "latitude"                                    
## [20] "longitude"                                   
## [21] "is_location_exact"                           
## [22] "property_type"                               
## [23] "room_type"                                   
## [24] "accommodates"                                
## [25] "bathrooms"                                   
## [26] "bedrooms"                                    
## [27] "beds"                                        
## [28] "price"                                       
## [29] "weekly_price"                                
## [30] "monthly_price"                               
## [31] "security_deposit"                            
## [32] "cleaning_fee"                                
## [33] "guests_included"                             
## [34] "extra_people"                                
## [35] "minimum_nights"                              
## [36] "maximum_nights"                              
## [37] "minimum_minimum_nights"                      
## [38] "maximum_minimum_nights"                      
## [39] "minimum_maximum_nights"                      
## [40] "maximum_maximum_nights"                      
## [41] "minimum_nights_avg_ntm"                      
## [42] "maximum_nights_avg_ntm"                      
## [43] "has_availability"                            
## [44] "availability_30"                             
## [45] "availability_60"                             
## [46] "availability_90"                             
## [47] "availability_365"                            
## [48] "number_of_reviews"                           
## [49] "number_of_reviews_ltm"                       
## [50] "review_scores_rating"                        
## [51] "review_scores_accuracy"                      
## [52] "review_scores_cleanliness"                   
## [53] "review_scores_checkin"                       
## [54] "review_scores_communication"                 
## [55] "review_scores_location"                      
## [56] "review_scores_value"                         
## [57] "instant_bookable"                            
## [58] "is_business_travel_ready"                    
## [59] "calculated_host_listings_count"              
## [60] "calculated_host_listings_count_entire_homes" 
## [61] "calculated_host_listings_count_private_rooms"
## [62] "calculated_host_listings_count_shared_rooms" 
## [63] "reviews_per_month"

We make a new dataframe with the columns we need to answer this question.

NYC has five boroughs: The Bronx, Brooklyn, Manhattan, Queens, and Staten Island. We see the attribute neighborhood_group_cleansed corresponds to borough. This will be renamed to borough.

We will use availability_365 as a measure of availability. The column gives the days the listing is available out of 365 days. This will be renamed as availability.

We will use price as a measure of the cost of the listing. We will exclude security deposit, cleaning fee, administrative fees, etc. Note that this price is the price of renting the listing for one night.

data <- listings %>%
        select(id, borough = neighbourhood_group_cleansed, price, availability = availability_365)

head(data)

## # A tibble: 6 x 4
##         id borough   price   availability
##      <dbl> <chr>     <chr>          <dbl>
## 1  1742654 Manhattan $200.00          204
## 2 23502842 Manhattan $159.00           10
## 3 15984984 Brooklyn  $180.00           10
## 4 13820083 Brooklyn  $99.00            51
## 5  6170979 Brooklyn  $100.00          257
## 6 27283214 Manhattan $75.00            19

Cleaning the Data

We notice that borough should be a factor, or categorical attribute. Here is the documentation for factor().

Price should be a double. We will parse the string and convert it into a double.

data$borough <- factor(data$borough, levels = c("Manhattan",
        "Bronx", "Brooklyn", "Staten Island",
        "Queens"))
        
# split on "."
# returns list of char vectors
split_string <- str_split(data$price, "\\.")

# Turn list into vector
split_vec <- unlist(split_string)

# Vector is 200, 00, 159, 00
# We want to remove all "00"
remove <- "00"

# split_vec %in% remove will return a logical vector
# With true if the string is "00", false otherwise
# Then, use the negation of the logical vector
# to keep the price of the vector
split_vec <- split_vec[! split_vec %in% remove]

data$price <- parse_number(split_vec)


head(data)

## # A tibble: 6 x 4
##         id borough   price availability
##      <dbl> <fct>     <dbl>        <dbl>
## 1  1742654 Manhattan   200          204
## 2 23502842 Manhattan   159           10
## 3 15984984 Brooklyn    180           10
## 4 13820083 Brooklyn     99           51
## 5  6170979 Brooklyn    100          257
## 6 27283214 Manhattan    75           19

Exploratory Data Analysis (EDA)

If you are interested in EDA, you can see Tukey’s book, this description of EDA, and a summary of Tukey’s life.

Availability

We first plot the distribution of availability.

data %>%
  ggplot(aes(x = availability)) + 
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We note that availability is heavily right skewed, meaning a majority of listings have low availability. This makes sense, since NYC is a very popular destination.

Since availability is right skewed, we could consider doing a log transform. We will use the code here, made by Professor Hector Corrada Bravo.

compute_skew_stat <- function(df, attribute) {
  attribute <- enquo(attribute)
  
  df %>%
    summarize(med_attr=median(!!attribute, na.rm=TRUE), 
              q1_attr=quantile(!!attribute, 1/4, na.rm=TRUE), 
              q3_attr=quantile(!!attribute, 3/4, na.rm=TRUE)) %>%
    mutate(d1 = med_attr - q1_attr, d2 = q3_attr - med_attr, skew_stat = d1 - d2) %>%
    select(d1, d2, skew_stat)
}

data %>% compute_skew_stat(availability)

## # A tibble: 1 x 3
##      d1    d2 skew_stat
##   <dbl> <dbl>     <dbl>
## 1    51   176      -125

d1 is the $median - quantile1$. d2 is $quartile3 - median$. skew_stat is $d1 - d2$. Since d1 and d2 are significantly different, the dataset is most likely skewed lecture notes.

We would like to reduce data skew. However, we notice that most values of availability are 0. The log of 0 is undefined, thus a log transform is probably not applicable to this data.

We are unable to resolve the skewness for availability, but this is an important fact to keep in mind during downstream analysis.

Price

We plot the histogram for price.

data %>%
  ggplot(aes(x = price)) + 
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We see price is left skewed. Since we can expect price to be nonzero, we would like to apply a log transform.

data %>% compute_skew_stat(price)

## # A tibble: 1 x 3
##      d1    d2 skew_stat
##   <dbl> <dbl>     <dbl>
## 1    39    71       -32

The large difference between d1 and d2 reaffirm that price is skewed. We apply log transform on price then plot the histogram.

data <- data %>%
            mutate(price_log = log2(price))

data %>%
  ggplot(aes(x = price_log)) + 
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 12 rows containing non-finite values (stat_bin).

data %>% compute_skew_stat(price_log)

## # A tibble: 1 x 3
##      d1    d2 skew_stat
##   <dbl> <dbl>     <dbl>
## 1 0.646 0.729   -0.0826

We see the skew of price has been reduced.

Exploring the Relationship

We make a scatterplot of price(transformed) vs. availability. We see if there is a relationship between these two attributes.

Since there are many entries, we will first only plot a random fraction.

data %>%
  sample_frac(.1) %>%
  ggplot(aes(x=availability, y=price_log)) +
    geom_point() +
    labs(title="Price(transformed) vs. availability",
         x = "availability",
         y = "Price")

Next, we will plot listings for a specific boroughs.

data %>%
  ggplot(aes(x=availability, y=price_log)) +
    geom_point() +
    labs(title="Price(transformed) vs. Availability conditioned on Borough",
         x = "availability",
         y = "Price") +
  facet_wrap(~borough)

From examining the plots, it looks like Bronx is most likely to have a linear relationship between availability and price. We make a plot of Bronx specifically and draw a linear regression line.

data %>%
  filter(borough == "Bronx") %>% 
  ggplot(aes(x=availability, y=price_log)) +
    geom_point() +
    labs(title="Price(transformed) vs. Availability of Bronx",
         x = "availability",
         y = "Price") +
  geom_smooth(method = lm)

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

The slope seems very low. However, this does not imply that there is no significant relationship between availability and price. We would like to know if there is a significant relationship. This question will hopefully be answered in the Analysis portion.

Remarks

From above, availability is heavily skewed. We were able to reduce the skewness of price. In downstream analysis, we must keep in mind of these facts, as they will impact our analysis.

Analysis

We want to determine if there exists a relationship between the availability of a listing and its price. Specifically, we will only look at listings in the Bronx borough.

We will use linear regression to determine:

1. In the Bronx, is there a significant linear relationship between availability and price?

2. Can we make statements of the dependence of price on availability?

Initial Thoughts

We feel that availability should have an impact on price. If a listing is rarely available, this could indicate that the price is very good (cheap or low price). However, there are many other factors that impact price besides availability. For example, say a listing is has a very convenient location (close to restaurants, public transportation, etc.). The host could price this listing higher, knowing that the place has a very good location. Despite the price, renters could be wiling to pay this price to get the good location. In turn, the availability is low and the price is high.

Hopefully, the relationship between availability and price will become clearer in the following analysis.

Linear Regression

We do linear regression to model price with availability as the predictor. Note that there are likely many other factors that impact price that are not being included in this analysis.

For more on linear regression, see CMSC320 lecture notes. In my opinion, I recommend Andrew Ng’s online Machine Learning to find out more on linear regression and machine learning in general. I also recommend these notes from Stanford’s CS229 on linear regression.

bronx_data <- data %>% 
                filter(borough == "Bronx")

bronx_fit <- lm(price ~ availability, data = bronx_data)

bronx_fit_stats <- bronx_fit %>% tidy()

bronx_fit_stats

## # A tibble: 2 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   74.8       5.03       14.9  1.90e-45
## 2 availability   0.0588    0.0236      2.49 1.30e- 2

Hypothesis Testing

We will see if there is a relationship between these two attributes. We state the null and alternative hypotheses.

$H_0$ : There is not a significant relationship between availability and price.

$H_a$ : There is a significant relationship between availability and price.

We will test at the five percent significance level ($\alpha = 0.05$).

From the table above, we note that p-value is approximately $0.01295$. This is less than $\alpha$. Therefore, we reject the null hypothesis of no relationship.

For more on hypothesis testing within linear regression, see this section of UMD CMSC320 lecture notes.

Interpretation of Model

For now, let us assume the model meets the assumptions of linear repression. Then, we can make statements on the model.

From the model, we look at the intercept $\hat{\beta_0} = 74.76297055$ and the slope $\hat{\beta_1} = 0.05878498$

Based on this model, on average, a listing has is never available (availability = 0) has a listing price of about $75 per night.

On average, holding all other factors constant, for each additional day that the listing is available, its price increases by $0.0588, or around six cents. In other words, listings that are not as available tend to be cheaper.

Evaluation

We want to see if model actually meets the assumptions of linear regression. See lecture notes for more information on assumptions of linear regression.

We will plot a graph of residuals vs the fitted values. This will allow us to see if the linear relationship is a good estimate of the relationship between availability and price.

augmented_bronx <- bronx_fit %>% augment()

augmented_bronx %>% head()

## # A tibble: 6 x 9
##   price availability .fitted .se.fit .resid    .hat .sigma .cooksd
##   <dbl>        <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
## 1    58           48    77.6    4.20  -19.6 1.73e-3   101. 3.27e-5
## 2   100           89    80.0    3.63   20.0 1.29e-3   101. 2.55e-5
## 3    63          185    85.6    3.17  -22.6 9.87e-4   101. 2.49e-5
## 4   120          144    83.2    3.19   36.8 9.95e-4   101. 6.61e-5
## 5    42           36    76.9    4.40  -34.9 1.90e-3   101. 1.13e-4
## 6    77          301    92.5    4.47  -15.5 1.96e-3   101. 2.31e-5
## # … with 1 more variable: .std.resid <dbl>

augmented_bronx %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth() + 
  labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals")

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Looking at the plot, almost all of the residual values are near 0. Furthermore, there does not appear to be a trend within the residuals. However, some residuals are extremely high. These outliers are very few, so the linear model appears to perform well.

Based on this simple examination, the linear model appears to fit the values well, and thus we are able to make inferential statements from the model. However, to be more thorough, more exploration into the linear model is necessary.

Conclusion

Summary

We obtained the Airbnb data and cleaned it. Based on initial examination of the data, we wanted to examine the relationship of availability and price. After making plots, we decided to limit our scope to the Bronx borough. We explored the relationship between availability and price of listings in the Bronx. From the analysis, we see that there is a significant relationship between availability and price, and we were able to make statements on the effect of availability on price.

Statements

We saw that a listing that is never available has a listing price of about $75 dollars per night. On average, for every day that the listing is available, the listing price increases by around $0.06 per night.

From this, we can make some general (high level) statements. In the Bronx, listings that are less popular tend to be priced higher. Listings that are priced low tend to be booked more often. A listing of $75 dollars per night tends to be on the cheaper side of listings in the Bronx, since these listings are often booked.

Something that can be explored is the difference in price between Airbnb listings and hotels in the Bronx. We searched Google Maps for hotels in the Bronx. From initial, rudimentary examination, these hotels tend to charge $118 to $212 per night.

We see that Airbnb listings are cheaper than the typical hotel stay in the Bronx. The cheap price of lodging that Airbnb offers, compared to hotels, is likely an reason for Airbnb’s success.

Remarks

Again, we note that there are many other factors impacting price, such as location, number of bedrooms and bathrooms, amenities, etc. We also note that availability is heavily right skewed.

Future Work

In our analysis portion, we only examined the relationship between availability and price in the Bronx borough. Future work could entail determining the relationship between these attributes in other boroughs.

We also know that price is not only determined by availability. Future analysis could incorporate other attributes, such as number of bedrooms and bathrooms, superhost status, number of amenities, etc.

The listing data also has information on reviews. This data can be further examined. We could explore if reviews are significantly different in listings in certain boroughs versus others.