Demonstrate your knowledge of the Data Analysis Lifecycle using a given set of data and the tools, Python and Jupyter Notebook
In this lab, you will import some Python packages required to analyze a data set containing San Francisco crime information. You will then use Python and Jupyter Notebook to prepare this data for analysis, analyze it, graph it, and communicate your findings.
In this part, you will import the following Python packages necessary for the rest of this lab.
NumPy is the fundamental package for scientific computing with Python. It contains among other things: a powerful N-dimensional array object and sophisticated (broadcasting) functions.
Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy.
Folim is a library to create interactive map.
# Code cell 1 %matplotlib inline import numpy as np import pandas as pd import matplotlib.pyplot as plt import folium
In this part, you will load the San Francisco Crime Dataset and the Python packages necessary to analyze and visualize it.
In this step, you will import the San Francisco crime data from a comma separated values (csv) file into a data frame.
# code cell 2 # This should be a local path dataset_path = './Data/Map-Crime_Incidents-Previous_Three_Months.csv' # read the original dataset (in comma separated values format) into a DataFrame SF = pd.read_csv(dataset_path)
To view the first five lines of the csv file, the Linux command
head is used.
# code cell 3 !head -n 5 ./Data/Map-Crime_Incidents-Previous_Three_Months.csv
a) By typing the name of the data frame variable into a cell, you can visualize the top and bottom rows in a structured way.
# Code cell 4 pd.set_option('display.max_rows', 10) #Visualize 10 rows SF
b) Use the function
columns to view the name of the variables in the DataFrame.
# Code cell 5 SF.columns
How many variables are contained in the SF data frame (ignore the Index)?
c) Use the function
len to determine the number of rows in the dataset.
# Code cell 6 len(SF)
Now that you have the data loaded into the work environment and determined the analysis you want to perform, it is time to prepare the data for analysis.
lambda is a Python keyword to define so-called anonymous functions.
lambda allows you to specify a function in one line of code, without using
def and without defining a specific name for it. The syntax for a
lambda expression is :
lambda parameters : expression.
In the following, the
lambda function is used to create an inline function that selects only the month digits from the Date variable, and
int to transform a string representation into an integer. Then, the pandas function
apply is used to apply this function to an entire column (in practice,
apply implicitly defines a
for loop and passes one by one the rows to the
lambda function). The same procedure can be done for the Day.
# Code cell 7 SF['Month'] = SF['Date'].apply(lambda row: int(row[0:2])) SF['Day'] = SF['Date'].apply(lambda row: int(row[3:5]))
To verify that these two variables were added to the SF data frame, use the
type to check that these new columns contain indeed numerical values.
# Code cell 8 print(SF['Month'][0:2]) print(SF['Day'][0:2])
# Code cell 9 print(type(SF['Month']))
a) The column
IncidntNum contains many cells with NaN. In this instance, the data is missing. Furthermore, the
IncidntNum is not providing any value to the analysis. The column can be dropped from the data frame. One way to remove unwanted variables in a data frame is by using the
# Code cell 10 del SF['IncidntNum']
b) Similarly, the
Location attribute will not be in this analysis. It can be droped from the data frame.
Alternatively, you can use the
drop function on the data frame, specifying that the axis is the 1 (0 for rows), and that the command does not require an assignment to another value to store the result (inplace = True ).
# Code cell 11 SF.drop('Location', axis=1, inplace=True )
c) Check that the columns have been removed.
# Code cell 12 SF.columns
Now that the data frame has been prepared with the data, it is time to analyze the data.
a) Use the function
value_counts to summarize the number of crimes committed by type, then
# Code cell 13 CountCategory = SF['Category'].value_counts() print(CountCategory)
b) By default, the counts are ordered in descending order. The value of the optional parameter ascending can be set to True to reverse this behavior.
# Code cell 14 SF['Category'].value_counts(ascending=True)
What type of crime was committed the most?
c) By nesting the two functions into one command, you can accomplish the same result with one line of code.
# Code cell 15 print(SF['Category'].value_counts(ascending=True))
Challenge Question: Which PdDistrict had the most incidents of reported crime? Provide the Python command(s) used to support your answer.
# code cell 16 # Possible code for the challenge question print(SF['PdDistrict'].value_counts(ascending=True))
a) Logical indexing can be used to select only the rows for which a given condition is satisfied. For example, the following code extracts only the crimes committed in August, and stores the result in a new DataFrame.
# Code cell 17 AugustCrimes = SF[SF['Month'] == 8] AugustCrimes
How many crime incidents were there for the month of August?
How many burglaries were reported in the month of August?
# code cell 18 # Possible code for the question: How many burglaries were reported in the month of August? AugustCrimes = SF[SF['Month'] == 8] AugustCrimesB = SF[SF['Category'] == 'BURGLARY'] len(AugustCrimesB)
b) To create a subset of the SF data frame for a specific day, use the function
query operand to compare Month and Day at the same time.
# Code cell 19 Crime0704 = SF.query('Month == 7 and Day == 4') Crime0704
# Code cell 20 SF.columns
Visualization and presentation of the data provides an instant overview that might not be apparent by simply looking at the raw data. The SF data frame contains longitude and latitude coordinates that can be used to plot the data.
a) Use the
plot() function to plot the SF data frame. Use the optional parameter to plot the graph in red and setting the marker shape to a circle using ro .
# Code cell 21 plt.plot(SF['X'],SF['Y'], 'ro') plt.show()
b) Identify the number of police department district, then build the dictionary pd_districts to associate their string to an integer.
# Code cell 22 pd_districts = np.unique(SF['PdDistrict']) pd_districts_levels = dict(zip(pd_districts, range(len(pd_districts)))) pd_districts_levels
lambda to add the police deparment integer id to a new column of the DataFrame
# Code cell 23 SF['PdDistrictCode'] = SF['PdDistrict'].apply(lambda row: pd_districts_levels[row])
d) Use the newly create PdDistrictCode to automatically change the color
# Code cell 24 plt.scatter(SF['X'], SF['Y'], c=SF['PdDistrictCode']) plt.show()
In Step 1, you created a simple plot that displays where crime incidents took place in SF County. This plot is
folium provides additional functions that will allow you to overlay this plot onto an OpenStreet map.
Folium requires the color of the marker to be specified using an hexadecimal value. For this reason, we use the colors package, and select the necessary colors.
# Code cell 25 from matplotlib import colors districts = np.unique(SF['PdDistrict']) print(list(colors.cnames.values())[0:len(districts)])
b) Create a color dictionary for each police department district.
# Code cell 26 color_dict = dict(zip(districts, list(colors.cnames.values())[0:-1:len(districts)])) color_dict
c) Create the map using the middle coordinates of the SF Data to center the map (using
mean). To reduce the computation time, plotEvery is used to limit amount of plotted data. Set this value to 1 to plot all the rows (might take a long time to visualize the map).
# Code cell 27 # Create map map_osm = folium.Map(location=[SF['Y'].mean(), SF['X'].mean()], zoom_start = 12) plotEvery = 50 obs = list(zip( SF['Y'], SF['X'], SF['PdDistrict'])) for el in obs[0:-1:plotEvery]: folium.CircleMarker(el[0:2], color=color_dict[el], fill_color=el,radius=10).add_to(map_osm)
# Code cell 28 map_osm
© 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.