Applied Statistics & Probability

Course site for Fall 2017 offering of CISC 5420 - Applied Statistics & Probability at Fordham University.


View the Project on GitHub

Welcome to CISC 5420 - Applied Statistics & Probability!

Python Data Science Handbook by Jake VanderPlas

I came across an incredible resource for learning more about Python, Data Science, and Machine Learning. The Python Data Science Handbook is written by Jake VanderPlas, an astronomist and leading figure in the Python Data Science community. The book contains chapters on NumPy, pandas, matplotlib, seaborn, and Scikit-learn. The book is available for free online in the form of Jupyter notebook. Really amazing stuff!

Lecture 14

This week we introduced Time Series Analysis. A time series is a sequence of measurements from a system that varies in time. The techniques used to analyze time series are different from the other techniques we’ve reviewed this semester for several reasons, particularly when it comes to modeling data.

  1. Values that vary in time typically do so in unpredictable ways. Hence, there is no reason to expect a long-term trend to be a line (as in linear regression) or any other simple sort of function. For example, prices of goods are affected by supply and demand, which varies over time.

  2. For purposes of prediction, we should probably give more weight to data that is closer in time. It doesn’t make as much sense to equally weigh all data.

  3. With time series data, successive values are often correlated in time. For instance, if the prices of a stock is high on a day, we expect it to remain fairly high the subsequent day.

Much of time series analysis is based on the assumption that observed data can be decomposed into three components.

  1. Trend: Some smooth function that captures persistent changes.

  2. Seasonality: Periodic variation that could come in cycles of days, weeks, months, or years.

  3. Random noise: A random variation around the long-term trend.

We looked at various methods of modeling trends, including the rolling mean and exponentially moving average, how to calculate the autocorrelation function, and decomposing time series into the sum of its three components.

Your homework assignment for next class is as follows:

  1. Read Chapter 13 on Survival Analysis and run through the code in the chapter.

Lecture 13

Our class discussion today focused on linear regression and logistic regression as implemented in the Python StatsModels library. The methods and objects in this library allow us to easily and quickly fit regression models to data and summarize results. For example, we can fit a multiple regression model and view model summary statistics including the coefficient of determination and p-values for each of the slopes. We discussed our linear regression can be used for explanatory purposes by controlling for the effects of certain variables and analyzing how the linear regression parameters for features vary. The data mining approach focuses on predictive power - in the regression case, we seek the model that achieves the highest R^2 value while in the classification case we look for the model that achieves the highest accuracy.

Your homework assignment for next class is as follows:

  1. Read Chapter 12 on Time series analysis and run through the code in the chapter.

Twitter Article

Check out this cool article from the data scientists at Twitter. Twitter wanted to determine whether they should increase the length of tweets from 140 characters and how that should differ by language. After collecting billions of tweets in different languages, they found that the length distributions follow a log-normal distribution.

Lecture 12

Our discussion this week was about linear least squares fit a.k.a simple linear regression. In simple linear regression, we model a relationship between two variables as a linear function and estimate the parameters of that line - the slope and intercept - by minimizing the sum of squared errors. Why is minimizing the sum of squared errors a good thing to do? For one, it treats positive and negative errors - i.e. residuals - the same way. If we simply added the errors, the positive and negative errors would cancel one another out. Squaring penalizes larger residuals. Theoretically, the least squares fit is also the maximum likelihood estimator if our residuals are uncorrelated and normally distributed with mean 0 and constant (but unknown) variance.

Since our parameters are estimated from samples of data, they’re affected by sampling bias, sampling error, and measurement error. To quantify this uncertainty, we can simulate repeated experiments by resampling our data with replacement and computing statistics on the sampling distributions. Finally, we can check to measure whether the linear relationships are statistically significant by running hypothesis tests where we use the coefficient of determination or the slope as our test statistic.

In order to assess the goodness of fit - a measure of how well the model fits our data - we compute the coefficient of determination, which measures how much variance in the data our model explains. See this blog post for another explanation of the coefficient of determination.

Another neat way to fit a line to data is to use a Theil-Sen estimator a non-parametric way to fit a line through a sample of points. This method is implemented in the scipy library.

Your homework assignment for next class is as follows:

  1. Reread Chapter 11 on Regression. We will discuss this in our Skype class on Sunday.
  2. Read Chapter 12 on Time series analysis. This is an incredibly fascinating topic, especially as timeseries data is so prevalent. Time series analysis requires a different set of tools than tools than does traditional non time series analysis because data is correlated in time i.e. the data leading up to a particular point in time tell us something about the data after that point in time.

Interesting Dataset

I came across an interesting dataset on home prices while reading this interesting New York Times article, How much income do you need to buy a home?. The dataset from shows the annual income one needs to be able to afford a home in one of America’s 50 most populous cities. Check out the article and dataset here.

Let’s talk about this information in class on Tuesday!

Lecture 11

This week we spoke about statistical hypothesis testing - the methodology by which we determine whether the effects we see in a sample of data are likely to appear in a larger population. In hypothesis testing, the main question we seek to answer is “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” (quote from Downey, ThinkStats2). This question captures the essence of hypothesis testing very nicely. Our “apparent effect” is quantified via a test statistic, the “by chance” part is modeled by the null hypothesis, and the “probability of such an effect by chance” is our p-value. The higher that probability (ie a high p-value) the higher the probability that the effect is due to random chance and hence not true of the larger population.

Your homework assignment for next class is to read chapter 11 of ThinkStats2. Note, we will be covering chapter 10 on Linear Regression next week in class. We will cover chapter 11 next Saturday during our Skype class.

Happy Thanksgiving!

Lecture 10

This week we discussed estimation - the process of estimating a population paramater such as the mean from a finite sample of data. In particular, we noted that how you estimate a parameter depends on the particular circumstance. For example, if your data comes from a normal distribution, than the sample mean is the best unbiased estimator, that is, it minimizes the Mean Squared Error. We also discussed Maximum Likelihood Estimation(MLE) in depth. If we have a fixed dataset and statistical model, MLE selects the parameters that maximizes the likelihood function, i.e. maximizes the chance of observing the given data. We showed that 1/mean(x), (where mean(x) is the sample mean) is the MLE of the parameter lambda for the exponential distribution.

Bias is the difference between an estimator’s expected value (i.e. average) and the true value of a parameter. An estimator with zero bias is unbiased. We showed computationally that the sample mean is an unbiased estimator of a Gaussian distribution.

Your homework assignment for next class is as follows:

  1. Read Ch. 9 and 10 of ThinkStats2 and run through the code in the chapter. These chapters discuss hypothesis testing and single linear regression - both very important topics. In particular, hypothesis testing is used to determine whether observed statistical differences, such as the observed mean difference in birth lengths between first babies and others, can be attributed to random chance. You will be asked to perform at least 2 hypothesis tests in your final projects so please read this chapter carefully.

  2. Check out gmaps, a Python package for plotting geographical data on Google Maps in the Jupyter notebook.

  3. We will not have class next week. Instead, we will make up lecture over a Skype lecture that will be scheduled when we meet in 2 weeks.

Cool Kaggle Kernel

I just came across this awesome kaggle kernel analyzing some data about salaries by undergraduate major. The author does an awesome job of creating a cool visualization in steps. Check it out!

Lecture 9

Our focus this week was on examining relationships between two variables. Scatter plots are visualizations used to examine the relationship between 2 numerical features. We spoke about ways of improving scatter plots when working with larger datasets and then discussed correlation statistics as ways of quantifying the strength of relationships between variables. Remember that Pearsons-moment correlation quantifies the strength of a linear relationship between variables. A correlation of 0 does not necessarily mean no relationship, but it does mean no linear relationship. We then spent a fair amount of time applying these concepts to the Austin bikeshare dataset. Here is that notebook.

Your homework assignment for next class is as follows:

  1. Read Ch. 8 of ThinkStats2 and run through the code in the chapter. This chapter is about estimation - the practice of estimating quantities that describe a population from finite samples. You’ll encounter several extremely important concepts including standard error, sampling bias, confidence intervals, and sampling distributions.

Lecture 8

This week we spent the class reviewing midterm project submissions. It was a great chance to learn from your peers, ask one another questions, and find out about difficulties encountered during your work. Nice job, everyone!

Your homework assignment for next class is as follows:

  1. Read Ch. 7 of ThinkStats2 and run through the code in the chapter. This chapter begins our foray into looking at relationships among different variables in your dataset. Although I won’t ask you to hand this in, it’s a good idea to begin applying these methods to your own datasets, even if you don’t fully understand the concepts. We’ll review the concepts more in class.

  2. Check out this awesome visual intro to machine learning. Amazing visualizations! This is a great example of using technology to improve education! Also, I just came across this awesome collection of cheat sheets for data analysis in Python. Definitely check these out! There are cheat sheets for pandas, amtplotlib, seaborn, and more. Great resource.

Map Visualizations

Several students have asked about how to visualize geographic information by creating maps. Arguably the best tools for visualizaing geographic information are written in JavaScript. D3 is one of the finer tools for this and the author of the library has written a series of tutorials for creating beautiful maps. I encourage you to check them out!

Lecture 7

This week we concluded the first half of our semester by discussing probability density functions and kernel density estimation. Probability Density Functions, or PDFs, allow us to define probability distributions for continuous random variables. The PDF is the derivative of the Cumulative Distribution Function. Evaluating the PDF at a particular point does not give us the probability of that value occuring; rather, it gives us a probability density. In order to compute a probability, we need to integrate the PDF over some range i.e. compute the area underneath the curve. We showed in class how probability mass functions, cumulative distribution functions, and probability density functions are all related.

We also discussed kernel density estimation (KDE). KDE is a nonparametric way of computing a smooth distribution that fits a finite sample of data. This is useful for visualization, interpolation, and simulation, and an alternative method to model data.

Your homework assignment for next class is as follows:

  1. Your midterm projects are due by next class! Please see the directions here to remind yourself of what is expected. Be prepared to discuss these projects in class with your classmates.

Lecture 6

Our class this week was on modeling empirical distributions with analytic distributions. Analytic distributions are characterized by a cumulative distribution function that is a mathematical function. Modeling empirical data with analytic distributions is useful if the models capture relevant aspects of the real world and leave out unneccessary details such as measurement error or specific quirks from a sample. Models also act as a form of data compression, allowing us to summarize large amounts of data with a small set of parameters.

We spoke specifically about the exponential, normal, lognormal, and pareto distributions. I would highly recommend reading through the links provided to get a better understanding of the types of real-world phenomena that these distributions model. We also spoke about the types of transformations and plots we can create to determine whether or not these distributions fit out data well. For instance, to determine if data can be approximated by a normal distribution, we plot a normal probability plot and examine whether the plot is linear. Check out this week’s notebook where I used these tests to model data from the Austin bikeshare dataset.

Your homework assignment for next class is as follows:

  1. Read Ch. 6 of ThinkStats2. This chapter is on probability density functions, which describes the relative likelihood for a continuous variable to take on a given value. This chapter ties together the previously seen concepts of probability mass functions (PMFs) and cumulative distribution functions (CDFs) and concludes the introductory material of the class. Running through the code is not optional. It is a mandatory part of the assignment.

  2. Keep working on your midterm projects. They’re due in 2 weeks on October 24th!

  3. A student mentioned the concept of the bootstrap in class yesterday. Bootstrapping is a method for estimating some property of an estimator by measuring those properties from some approximating distribution, often times the empirical distribution. Essentially, we resample the empirical distribution with replacement and compute some statistic, such as the variance. We will learn much more about estimation in chapter 8 of the text.

As always, email me with any questions!

Lecture 5

This week we focused on Cumulative Distribution Functions - functions that map from values to percentiles. First, we noted that we run into issues when plotting histograms for variables that contain many different values. Namely, histograms will be hard to interpret because it will be difficult to see the overall pattern. One way of getting around this is to discretize or bin our data. However, we are then tasked with picking appropriate bin sizes, which is not trivial. In order to pick bin sizes, data analysts typically try a range of bin sizes through trial-and-error (we mentioned more principled techniques in class).

A better way to deal with this issue is to plot the CDF, which provides an informative visual representation of the shape of a distribution. Common values appear as steep or vertical sections of a CDF. If there are few values at a certain percentile, the CDF is flat in this range. CDFs are especially useful for comparing distributions, as plotting multiple CDFs on the same graph makes the shape of the distribution as well as differences more apparent. We generated some normally distributed test data and plotted the resulting CDF in this week’s notebook.

Your homework assignment for next class is as follows:

  1. Read Ch. 5 of ThinkStats2. This is the most difficult material we will have encountered thus far. Pay particular attention to the author’s description of how to determine if data is modeled by an analytic distribution. You will be required to use these methods in your projects. Please run through the code in the chapters. Running through the code is not optional. It is a mandatory part of the assignment.

  2. I’ve posted the midterm rubric. Please refer to this document and ask me any questions you may have.

As always, email me with any questions!

Lecture 4

This week we introduced the concept of probability mass functions (PMFs). Whereas histograms map from values to integer counts, PMFs map from values to probabilities. These can be visualized as bar graphs or hollow histograms and are useful for comparing multiple distributions since PMFs account for differences in sample size. In this weeks Jupyter notebook, I gave a few examples of using PMFs to compare different distributions found within the Austin bikeshare dataset.

Your homework assignment for next class is as follows:

  1. Read Ch. 4 of ThinkStats2. Please run through the code in the chapters. Running through the code is not optional. It is a mandatory part of the assignment.

  2. If I’ve emailed you asking for further clarification on your research questions, please email me back with the clarifications I’ve requested. By this point, you should all be working on your midterm projects, which are due on October 24th. I will follow up with additional details about the project. At a high level, I will expect a Jupyter notebook containing analysis similar to the analysis encountered so far in the textbooks.

As always, email me with any questions!

Lecture 3

Yesterday we spent time speaking about different forms of exploratory data analysis. After discussing histograms, outliers, effect sizes, and summary statistics (and a few other things), we discussed a Jupyter notebook I put together applying these ideas to an Austin bikeshare dataset. In that notebook I introduced a new Python library, seaborn, which “provides a high-level interface for drawaing attractive statistical graphs.” I recommend checking out some of the tutorials on the seaborn site, they’re quite good.

Your homework assignment for next class is as follows:

  1. Read Ch. 3 of ThinkStats2. Please run through the code in the chapters.

  2. Read Ch. 2 of OpenStats except section 2.3.

As always, email me with any questions!

Lecture 2

This past week we began to dig into practical techniques for analyzing data.

First, we examined some useful ways of reading in and slicing through data using the Python package pandas. In order to demonstrate some of these techniques, I put together a simple Jupyter notebook, which you can access here. If you click on the link, you’ll see the Jupyter notebook rendered directly in the browser, as if you were running the notebook on your own machine. In order to actually run the cells, though, you’ll have to clone the repo and run the notebook on your own machine. If you haven’t used Github before, here is a short interactive tutorial.

We also spent some time examining a HarvardX and MITx: Four Years of Open Online Courses, a statistical analysis of four years worth of data on the EdX MOOC platform. Specifically, we discussed the format of the paper and several of the visualizations. Note how the authors put forth several important questions, provide brief answers to these questions, and support these answers with carefully arranged descriptive statistics and well thought out data visualizations. Spend some time going through other visualizations within the document. Do you see the points the authors are trying to make? Do you think the authors’ claims are well evidenced?

A few students mentioned difficulties with understanding the Python code. Allen Downey, the author of ThinkStats, has also written a textbook for learning Python 3 called ThinkPython2e. If you like the style of our textbook, I encourage you to check it out. Alternatively, check out this interactive textbook. Automate the Boring Stuff with python is also a fantastic resource for learning how to do some interesting things in Python, like Sending Email and Text Messages.

Your homework assignment for next class is as follows:

  1. Read the Ch. 2 of ThinkStats2. Please run through the code in the chapters.

  2. If you haven’t already done so, please read Ch. 1 of OpenStats. If you have read it, please skim through it again.

  3. Put together a proposal for your midterm and final project. I’d like you to propose 3 possible projects, each utilizing a different dataset. For each possible project, please include:

    • A guiding research question you seek to answer.
    • A dataset (or datasets) that you will analyze in order to answer the research question. If the dataset exists somewhere, please include a link to the dataset. If the dataset does not exist in a ready-to-use format, but you’d like to access data by writing a web scraper, polling an API, etc, please describe how you will do this.
    • A description of the variables within your dataset.

Please email me this proposal before the start of class next week.

As always, email me with any questions!

Lecture 1

Thank you for a very enjoyable discussion last night! I really enjoy teaching this course because of the way students are able to explore their interests through their personal projects. I’m looking forward to learning about a diverse range of topics through your projects!

As I mentioned in class, I will be using this site to post resources, links, lecture reviews, and homework assignments.

During our first lecture we spent time discussing the course syllabus, including the grading breakdown for the course and the midterm and final projects. We also spent time reviewing some of the visualizations in this Kaggle kernel. Reviewing Kaggle kernels is a great way to learn about different applied data analysis methods and visualizations.

Kaggle is also a great source of publically available datasets. This Github repo contains more datasets and Amazon and Google also host other datasets. I encourage each of you to look through these resource to find datasets for your term projects. Feel free to find other datasets or come up with your own by writing web scrapers. Here is a simple tutorial for how you might do that.

Your homework assignment for next class is as follows:

  1. Read the Preface and Chapter 1 of ThinkStats2. Please run through the code in the chapters. In order to do that you’ll have to…

  2. Install Anaconda on your machines. We will be using Python 3.6 in this class.

  3. Work on exercise 1.3 in ThinkStats2. This asks you to think about a personal project. Part of your homework assignment for next class will be to submit a formal proposal for your midterm and final projects.

  4. Skim Chapter 1 of OpenStats. This chapter contains useful methods for performing exploratory data analysis of numerical and categorical data. You’ll also get to read more about some of the visualizations we discussed in class yesterday.

If anyone has any questions, please feel free to email me!