{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2\n", "\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline\n", "import pandas as pd\n", "from sklearn.decomposition import PCA\n", "import numpy as np\n", "import datetime\n", "\n", "from src.data.get_data import get_url_data\n", "\n", "from src.features.df_functions import str_pad, convert_time" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction \n", "An exploration of reported Chicago Crime data. This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system. \n", "\n", "This report visually explores the data for trends. It also looks for trends using PCA and Gaussian Mixture Models. Through the latter, we see the crime reporting behaves in two distinct patterns: Weekday and Weekend. Weekday crime reporting tends to peak around noon. For Weekends, the flux of crime reports is steady throughout the day. \n", "\n", "However, we do see some weekdays behave like weekends, and conversely, some weekends behave like weekdays. Specifically, there are Tuesdays that behave like weekends, i.e. more crime is reported. Some of those happen to be Christmas, New Years, and July 4th—American holidays. And the two Sundays that behaved like weekdays coincide with playoff games, which might say more about Chicagoans than it's reported crime rates do. \n", "\n", "In summary, more crimes are reported (or perhaps committed?) during hours of leisure. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "...loading csv\n", "CPU times: user 10.7 s, sys: 1.09 s, total: 11.8 s\n", "Wall time: 13.1 s\n" ] } ], "source": [ "%%time\n", "data = get_url_data()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of observations: 1,827,766\n" ] } ], "source": [ "print('Number of observations: {:,.0f}'.format(len(data)))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# converts time column to formatted timestamps\n", "data = convert_time(data)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | count | \n", "
---|---|
Date | \n", "\n", " |
2010-01-01 00:01:00 | \n", "3 | \n", "
2010-01-01 00:02:00 | \n", "1 | \n", "
2010-01-01 00:05:00 | \n", "2 | \n", "
2010-01-01 00:10:00 | \n", "6 | \n", "
2010-01-01 00:15:00 | \n", "4 | \n", "
\n", " | 2010-01-01 | \n", "2010-01-02 | \n", "2010-01-03 | \n", "2010-01-04 | \n", "2010-01-05 | \n", "2010-01-06 | \n", "2010-01-07 | \n", "2010-01-08 | \n", "2010-01-09 | \n", "2010-01-10 | \n", "... | \n", "2018-09-13 | \n", "2018-09-14 | \n", "2018-09-15 | \n", "2018-09-16 | \n", "2018-09-17 | \n", "2018-09-18 | \n", "2018-09-19 | \n", "2018-09-20 | \n", "2018-09-21 | \n", "2018-09-22 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
00:01:00 | \n", "3 | \n", "8 | \n", "4 | \n", "2 | \n", "5 | \n", "3 | \n", "5 | \n", "3 | \n", "6 | \n", "10 | \n", "... | \n", "7 | \n", "6 | \n", "10 | \n", "10 | \n", "6 | \n", "13 | \n", "9 | \n", "4 | \n", "6 | \n", "3 | \n", "
00:02:00 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
00:03:00 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
00:04:00 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
00:05:00 | \n", "2 | \n", "2 | \n", "2 | \n", "3 | \n", "2 | \n", "3 | \n", "3 | \n", "2 | \n", "0 | \n", "3 | \n", "... | \n", "3 | \n", "1 | \n", "4 | \n", "1 | \n", "1 | \n", "2 | \n", "2 | \n", "5 | \n", "5 | \n", "2 | \n", "
5 rows × 3187 columns
\n", "\n", " | PC1 | \n", "PC2 | \n", "PC3 | \n", "PC4 | \n", "PC5 | \n", "PC6 | \n", "PC7 | \n", "PC8 | \n", "PC9 | \n", "PC10 | \n", "... | \n", "PC1429 | \n", "PC1430 | \n", "PC1431 | \n", "PC1432 | \n", "PC1433 | \n", "PC1434 | \n", "PC1435 | \n", "PC1436 | \n", "PC1437 | \n", "PC1438 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2010-01-01 | \n", "-47.152188 | \n", "16.946524 | \n", "3.404645 | \n", "0.198449 | \n", "-4.633534 | \n", "-1.115146 | \n", "1.572401 | \n", "-0.510697 | \n", "-0.577448 | \n", "-3.542706 | \n", "... | \n", "0.028720 | \n", "-0.020969 | \n", "0.026601 | \n", "-0.011574 | \n", "0.005671 | \n", "-0.019609 | \n", "-0.009661 | \n", "-0.005698 | \n", "0.014349 | \n", "0.011784 | \n", "
2010-01-02 | \n", "-35.122274 | \n", "1.091027 | \n", "4.436384 | \n", "4.385195 | \n", "6.024288 | \n", "-4.267457 | \n", "3.185112 | \n", "2.607650 | \n", "-2.695581 | \n", "-2.240907 | \n", "... | \n", "-0.039363 | \n", "-0.029664 | \n", "0.016520 | \n", "-0.027974 | \n", "-0.015012 | \n", "-0.006912 | \n", "0.042668 | \n", "0.007663 | \n", "-0.024588 | \n", "0.012324 | \n", "
2010-01-03 | \n", "-35.027476 | \n", "2.779998 | \n", "-3.652319 | \n", "2.432474 | \n", "-0.912626 | \n", "2.656788 | \n", "-0.230950 | \n", "-0.096006 | \n", "-2.876035 | \n", "1.910059 | \n", "... | \n", "0.046643 | \n", "-0.023867 | \n", "0.023256 | \n", "-0.048993 | \n", "0.020619 | \n", "0.043542 | \n", "0.002398 | \n", "0.002390 | \n", "-0.009306 | \n", "-0.024460 | \n", "
2010-01-04 | \n", "-25.848661 | \n", "-1.949269 | \n", "1.086502 | \n", "-1.338002 | \n", "-2.086197 | \n", "-1.060601 | \n", "1.763883 | \n", "5.046089 | \n", "-2.654646 | \n", "-1.901781 | \n", "... | \n", "0.005797 | \n", "0.025494 | \n", "-0.014706 | \n", "-0.080834 | \n", "0.004578 | \n", "-0.040916 | \n", "-0.014588 | \n", "0.037911 | \n", "-0.008122 | \n", "-0.002751 | \n", "
2010-01-05 | \n", "-17.676446 | \n", "-3.366142 | \n", "2.402381 | \n", "3.131339 | \n", "0.537546 | \n", "0.227890 | \n", "-8.078550 | \n", "9.418479 | \n", "-3.550084 | \n", "0.184492 | \n", "... | \n", "0.019269 | \n", "0.004990 | \n", "0.023384 | \n", "-0.021063 | \n", "0.013738 | \n", "0.017831 | \n", "0.073808 | \n", "0.010084 | \n", "0.002871 | \n", "0.028127 | \n", "
5 rows × 1438 columns
\n", "