Lab - San Francisco Crime


Demonstrate your knowledge of the Data Analysis Lifecycle using a given set of data and the tools, Python and Jupyter Notebook

Part 1: Import the Python Packages

Part 2: Load the Data

Part 3: Prepare the Data

Part 4: Analyze the Data

Part 5: Visualize the Data

Background / Scenario

In this lab, you will import some Python packages required to analyze a data set containing San Francisco crime information. You will then use Python and Jupyter Notebook to prepare this data for analysis, analyze it, graph it, and communicate your findings.

Required Resources

  • 1 PC with Internet access
  • Raspberry Pi version 2 or higher
  • Python libraries: pandas, numpy, matplotlib, folium, datetime, and csv
  • Datafiles: Map-Crime_Incidents-Previous_Three_Months.csv

Part 1: Import the Python Packages

In this part, you will import the following Python packages necessary for the rest of this lab.


NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful N-dimensional array object and sophisticated (broadcasting) functions.


Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.


Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.


Folim is a library to create interactive map.

In [ ]:
# Code cell 1
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import folium 

Part 2: Load the Data

In this part, you will load the San Francisco Crime Dataset and the Python packages necessary to analyze and visualize it.

Step 1: Load the San Francisco Crime data into a data frame.

In this step, you will import the San Francisco crime data from a comma separated values (csv) file into a data frame.

In [ ]:
# code cell 2
# This should be a local path
dataset_path = './Data/Map-Crime_Incidents-Previous_Three_Months.csv'

# read the original dataset (in comma separated values format) into a DataFrame
SF = pd.read_csv(dataset_path)

To view the first five lines of the csv file, the Linux command head is used.

In [ ]:
# code cell 3
!head -n 5 ./Data/Map-Crime_Incidents-Previous_Three_Months.csv

Step 2: View the imported data.

a) By typing the name of the data frame variable into a cell, you can visualize the top and bottom rows in a structured way.

In [ ]:
# Code cell 4
pd.set_option('display.max_rows', 10) #Visualize 10 rows 

b) Use the function columns to view the name of the variables in the DataFrame.

In [ ]:
# Code cell 5

How many variables are contained in the SF data frame (ignore the Index)?

c) Use the function len to determine the number of rows in the dataset.

In [ ]:
# Code cell 6

Part 3: Prepare the Data

Now that you have the data loaded into the work environment and determined the analysis you want to perform, it is time to prepare the data for analysis.

Step 1: Extract the month and day from the Date field.

lambda is a Python keyword to define so-called anonymous functions. lambda allows you to specify a function in one line of code, without using def and without defining a specific name for it. The syntax for a lambda expression is :

lambda parameters : expression.

In the following, the lambda function is used to create an inline function that selects only the month digits from the Date variable, and int to transform a string representation into an integer. Then, the pandas function apply is used to apply this function to an entire column (in practice, apply implicitly defines a for loop and passes one by one the rows to the lambda function). The same procedure can be done for the Day.

In [ ]:
# Code cell 7
SF['Month'] = SF['Date'].apply(lambda row: int(row[0:2]))
SF['Day'] = SF['Date'].apply(lambda row: int(row[3:5]))

To verify that these two variables were added to the SF data frame, use the print function to print some values from these columns, and type to check that these new columns contain indeed numerical values.

In [ ]:
# Code cell 8
In [ ]:
# Code cell 9

Step 2: Remove variables from the SF data frame.

a) The column IncidntNum contains many cells with NaN. In this instance, the data is missing. Furthermore, the IncidntNum is not providing any value to the analysis. The column can be dropped from the data frame. One way to remove unwanted variables in a data frame is by using the del function.

In [ ]:
# Code cell 10
del SF['IncidntNum']

b) Similarly, the Location attribute will not be in this analysis. It can be droped from the data frame.

Alternatively, you can use the drop function on the data frame, specifying that the axis is the 1 (0 for rows), and that the command does not require an assignment to another value to store the result (inplace = True ).

In [ ]:
# Code cell 11
SF.drop('Location', axis=1, inplace=True )

c) Check that the columns have been removed.

In [ ]:
# Code cell 12

Part 4: Analyze the Data

Now that the data frame has been prepared with the data, it is time to analyze the data.

Step 1: Summarize variables to obtain statistical information.

a) Use the function value_counts to summarize the number of crimes committed by type, then print to display the contents of the CountCategory variable.

In [ ]:
# Code cell 13
CountCategory = SF['Category'].value_counts()

b) By default, the counts are ordered in descending order. The value of the optional parameter ascending can be set to True to reverse this behavior.

In [ ]:
# Code cell 14

What type of crime was committed the most?

c) By nesting the two functions into one command, you can accomplish the same result with one line of code.

In [ ]:
# Code cell 15

Challenge Question: Which PdDistrict had the most incidents of reported crime? Provide the Python command(s) used to support your answer.

In [ ]:
# code cell 16
# Possible code for the challenge question

Step 2: Subset the data into smaller data frames.

a) Logical indexing can be used to select only the rows for which a given condition is satisfied. For example, the following code extracts only the crimes committed in August, and stores the result in a new DataFrame.

In [ ]:
# Code cell 17
AugustCrimes = SF[SF['Month'] == 8]

How many crime incidents were there for the month of August?

How many burglaries were reported in the month of August?

In [ ]:
# code cell 18
# Possible code for the question: How many burglaries were reported in the month of August?
AugustCrimes = SF[SF['Month'] == 8]
AugustCrimesB = SF[SF['Category'] == 'BURGLARY']

b) To create a subset of the SF data frame for a specific day, use the function query operand to compare Month and Day at the same time.

In [ ]:
# Code cell 19
Crime0704 = SF.query('Month == 7 and Day == 4')
In [ ]:
# Code cell 20

Part 5: Present the Data

Visualization and presentation of the data provides an instant overview that might not be apparent by simply looking at the raw data. The SF data frame contains longitude and latitude coordinates that can be used to plot the data.

Step 1: Plot a graph of the SF data frame using the X and Y variables.

a) Use the plot() function to plot the SF data frame. Use the optional parameter to plot the graph in red and setting the marker shape to a circle using ro .

In [ ]:
# Code cell 21
plt.plot(SF['X'],SF['Y'], 'ro')

b) Identify the number of police department district, then build the dictionary pd_districts to associate their string to an integer.

In [ ]:
# Code cell 22
pd_districts = np.unique(SF['PdDistrict'])
pd_districts_levels = dict(zip(pd_districts, range(len(pd_districts))))

c) Use apply and lambda to add the police deparment integer id to a new column of the DataFrame

In [ ]:
# Code cell 23
SF['PdDistrictCode'] = SF['PdDistrict'].apply(lambda row: pd_districts_levels[row])

d) Use the newly create PdDistrictCode to automatically change the color

In [ ]:
# Code cell 24
plt.scatter(SF['X'], SF['Y'], c=SF['PdDistrictCode'])

Step 2: Add Map packages to enhance the plot.

In Step 1, you created a simple plot that displays where crime incidents took place in SF County. This plot is useful, but folium provides additional functions that will allow you to overlay this plot onto an OpenStreet map.

a) Folium requires the color of the marker to be specified using an hexadecimal value. For this reason, we use the colors package, and select the necessary colors.

In [ ]:
# Code cell 25
from matplotlib import colors
districts = np.unique(SF['PdDistrict'])

b) Create a color dictionary for each police department district.

In [ ]:
# Code cell 26
color_dict = dict(zip(districts, list(colors.cnames.values())[0:-1:len(districts)])) 

c) Create the map using the middle coordinates of the SF Data to center the map (using mean). To reduce the computation time, plotEvery is used to limit amount of plotted data. Set this value to 1 to plot all the rows (might take a long time to visualize the map).

In [ ]:
# Code cell 27
# Create map
map_osm = folium.Map(location=[SF['Y'].mean(), SF['X'].mean()], zoom_start = 12)
plotEvery = 50
obs = list(zip( SF['Y'], SF['X'], SF['PdDistrict'])) 

for el in obs[0:-1:plotEvery]: 
    folium.CircleMarker(el[0:2], color=color_dict[el[2]], fill_color=el[2],radius=10).add_to(map_osm)
In [ ]:
# Code cell 28

© 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.