Title

Lab - Internet Meter Data Analysis

Objectives

  • **Part 1: Collect and Store Data**
  • **Part 2: Manipulate Data**
  • Scenario/Background

    In this lab, you will acquire Internet speed statistics and store the live data in a comma separated values (csv) file. You will also load the stored data from the csv file to a Python data structure, the Pandas DataFrame, and use its functionalities to explore the data and manipulate it so that it is easily readable.

    Required Resources

    • 1 PC with Internet access
    • Raspberry Pi version 2 or higher
    • Python libraries: datetime, csv, subprocess, pandas, numpy
    • Datafiles: rpi_data_long.csv

    Part 1: Collect and Store Data

    The goal of this first part of the lab is to gather internet speed measurements through the Raspberry Pi. Three kinds of measurements will be collected:

    1. Ping speed
    2. Download speed
    3. Upload speed

    Step 1: Install Speedtest and Import Python Libraries.

    In this step, you will install Speedtest and import Python libraries.

    Speedtest-cli is a Python script that measures the upload and download speed of your Internet connection. For more information about speedtest, go to https://github.com/sivel/speedtest-cli.

    a) Install speedtest-cli.

    In [ ]:
    # Code cell 1
    !pip install speedtest-cli
    

    This cli allows the Jupyter notebook to connect to the website and store the data.

    b) Import the necessary Python libraries.

    In [ ]:
    # Code cell 2
    # Python library to manage date and time data
    import datetime
    # Python library to read and write csv files
    import csv
    # Python library to execute bash commands from the notebook.
    # If you want to know more about this, check this resource: 
    # http://www.pythonforbeginners.com/os/subprocess-for-system-administrators
    import subprocess
    

    Step 2: Generate timestamps using the datetime package.

    In this lab, measurements of Internet speed statistics will be generated. A crucial step in data acquisition for the majority of data analytics applications, is to associate a timestamp to measurements.

    a) To generate a timestamp, use the datetime.now function of the datetime package:

    In [ ]:
    # Code cell 3
    date_time = datetime.datetime.now()
    print(date_time, type(date_time))
    

    b) An instance of the class datetime cannot be directly written to in text form. The function strftime parses the date information into a string. The arguments of this function determine the format of the output sting. A description of these parameters can be found in the documentation of the strftime function at https://docs.python.org/2/library/time.html.

    In [ ]:
    # Code cell 4
    date_time.strftime('%a, %d %b %Y %H:%M:%S')
    

    After reading the documentation of the strftime function, generate a timestamp and parse it into a string with the following format: YYYY-MM-DD HH:MM:SS.

    In [ ]:
    # Code cell 5
    # enter your code
    

    Step 3: Run the process and collect the output with Python.

    The speedtest-cli command, if run from a terminal, returns a string with download and upload speeds. To run the command from this notebook, it is necessary to use the Python module subprocess, which allows running a process directly from the notebook code cell.

    a) Run a speed test using the speedtest-cli command from Python. The output will be stored in the process_output variable.

    In [ ]:
    # Code cell 6
    # This string contains the command line to interface with speedtest.net
    speedtest_cmd = "speedtest-cli --simple"
    # Execute the process
    process = subprocess.Popen(speedtest_cmd.split(), stdout=subprocess.PIPE)
    # Collect the command output
    process_output = process.communicate()[0]
    

    b) Print the process output. Notice the type for the process_output variable.

    In [ ]:
    # Code cell 7
    print(process_output, type(process_output))
    

    c) The speed test result is split, and a timestamp is appended to the results.

    In [ ]:
    # Code cell 8
    # Store the time at which the speedtest was executed
    date_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    process_output = process_output.split()
    process_output.append(date_time)
    print(process_output, type(process_output))
    

    d) The speedtest() function is created to return the results from the speedtest-cli command.

    In [ ]:
    # Code cell 9
    # function to excute the speed test
    def speedtest():
        # We need to store the time at which the speedtest was executed
        date_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        # This is a string that contains what we would write on the command line 
        #to interface with speedtest.net
        speedtest_cmd = "speedtest-cli --simple"
        # We now execute the process: 
        process = subprocess.Popen(speedtest_cmd.split(), stdout=subprocess.PIPE)
        process_output = process.communicate()[0]
        process_output = process_output.split()
        # and we add the date and time 
        process_output.append(date_time)
        return process_output
    

    What does the speedtest() function return? What is the code to view results from the speedtest() function?

    In [ ]:
    # Code cell 10 
    # Code to view the results from speedtest() function
    

    Step 4: Store the output of the speedtest() function.

    The comma separated values (csv) is the most common import and export format for spreadsheets and databases. To learn more information about working with csv in Python, navigate to https://docs.python.org/2/library/csv.html.

    a) Create a file named test.txt in the /tmp directory and write "test_msg" in the file.

    In [ ]:
    # Code cell 11
    with open("/tmp/test.txt",'w') as f:
        f.write('test_msg')
    

    b) Use the Linux command cat to verify the creation and content of the file.

    In [ ]:
    # Code cell 12
    !cat /tmp/test.txt
    

    c) To check that the file was successfully open:

    In [ ]:
    # Code cell 13
    with open("/tmp/test.txt",'r') as f:
       str = f.read()
    print(str)
    

    d) Understanding the meaning of the with statement, especially in combination with try and except is not required for the rest of this lab, but a useful resource about this is http://effbot.org/zone/python-with-statement.htm.

    To write into a csv file, it is necessary to create a csv.writer object. Check https://docs.python.org/2/library/csv.html and find out which function of the 'csv.writer' object can be used to add a row to a csv file.

    In [ ]:
    # Code cell 14
    # function to save data to csv
    def save_to_csv(data, filename):
        try:
            # If the file exists, we want to append a new line to it, with the 
            #results of the current experiment
            with open(filename + '.csv', 'a') as f:
                wr = csv.writer(f)
                wr.writerow(data)
        except:
            # If it does not exist, create the file first
            with open(filename + '.csv', 'w') as f:
                # Hint: This is similar to appending new lines to a file.
                # Create a csv writer object
                # ADD CODE HERE
                # Save (write) to file
                # ADD CODE HERE
    

    Step 5: Check the collected data.

    Write a function to open a csv file and print its content to screen. You can find an example in the 13.1.5 section of https://docs.python.org/2/library/csv.html

    In [ ]:
    # Code cell 15
    def print_from_csv(filename): 
        with open(filename + '.csv', 'r') as f:
                re = csv.reader(f)
                # 1. Loop over the rows
                # 2. print
    

    Now, all the functions needed to collect and store Internet speed data are finished.

    Step 6: Run the Speedtest multiple times and store the data.

    a) Write a for loop that calls the speedtest 5 times, prints the output of the tests, and stores the data in a csv file.

    In [ ]:
    # Code cell 16
    for i in range(5):
        speedtest_output = speedtest()
        print('Test number {}'.format(i))
        print(speedtest_output)
        save_to_csv(speedtest_output, '/tmp/rpi_data_test')
        
    

    b) Display the file to verify that the data has been saved correctly.

    In [ ]:
    # Code cell 17
    print_from_csv('/tmp/rpi_data_test')
    

    If a bigger dataset is needed, the speedtest can be running in the background for more samples.

    How would you change the code if you wanted to run the code 100 times?

    In [ ]:
    # Code cell 18
    # Code to run 100 times
    # for i in xrange(100):
    #     speedtest_output = speedtest()
    #     print 'Test number: {}'.format(i)
    #     print speedtest_output
    #     save_to_csv(speedtest_output, '/tmp/rpi_data')
    

    Part 2: Manipulate Data

    The Python library pandas is very useful for working with structured data. The full documentation can be found here: http://pandas.pydata.org/pandas-docs/version/0.14.1/</font>

    A larger dataset collected in advance will be used for this part of the lab. The filename is rpi_data_long.csv.

    Step 1: Import the Python libraries.

    Import pandas and the other libraries used for the next tasks.

    In [ ]:
    # Code cell 19
    import datetime
    import csv
    import pandas as pd
    # NumPy is a library that adds support for large, multi-dimensional arrays and matrices
    # along with high-level mathematical functions to operate on these arrays
    import numpy as np
    

    Step 2: Load the csv file into a DataFrame object using pandas.

    A pandas DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table. The pandas library function read_csv automatically converts a csv file into a DataFrame object.

    Read the read_csv documentation in http://pandas.pydata.org/pandas-docs/version/0.14.1/generated/pandas.read_csv.html. This function contains a lot of parameters. The only non-optional one is the filepath, i.e. the location of the csv file. All the other parameters are optional.

    In this step, you will import and view the content of the csv file, rpi_data_long.csv. This csv file is located in the same directory as this Jupyter notebook.

    a) Assign the file rpi_data_long.csv to the variable data_file.

    In [ ]:
    # Code cell 20
    data_file = './Data/rpi_data_long.csv'
    

    b) Use the Linux command head to view the first 10 lines of the csv file.

    In [ ]:
    # Code cell 21
    !head -n 5 ./Data/rpi_data_long.csv
    

    c) Use the names parameter of the read_csv function to specify the name of the DataFrame columns.

    In [ ]:
    # Code cell 22
    column_names = [ 'Type A', 'Measure A', 'Units A',
                     'Type B', 'Measure B', 'Units B',
                     'Type C', 'Measure C', 'Units C', 
                     'Datetime']
    

    d) Use the read_csv function to read from data_file and assign column_names as the column names in the dataframe.

    In [ ]:
    # Code cell 23
    with open(data_file, 'r') as f:
        df_redundant = pd.read_csv(f, names = column_names)
    

    e) The command head() displays the first few rows of the dataframe.

    In [ ]:
    # Code cell 24
    # You can specify the number of rows you want to print to screen: 
    # you do so passing the number as an argument to the function
    # (e.g., head(10))
    df_redundant.head()
    

    What is the code to read the first 20 lines of the csv file?

    Step 3: Create a concise representation.

    In this step, you will create a more compact representation using a copy of the data frame df_redundant.

    a) Copy df_redundant into another dataframe called df_compact using copy().

    In [ ]:
    # Code cell 25
    df_compact = df_redundant.copy()
    

    b) Rename the columns relative to the measures as shown:

    Measure A -> Ping (ms)
    Measure B -> Download (Mbit/s)
    Measure C -> Upload (Mbit/s)
    In [ ]:
    # Code cell 26
    df_compact.rename(columns={'Measure A':'Ping (ms)', 
                               'Measure B': 'Download (Mbit/s)',
                               'Measure C': 'Upload (Mbit/s)'}, inplace=True)
    df_compact.head(3)
    

    c) Because the Types and Units columns are no longer necessary, these columns can be dropped.

    In [ ]:
    # Code cell 27
    df_compact.drop(['Type A', 'Type B', 'Type C',
             'Units A', 'Units B', 'Units C'], axis=1, inplace=True)
    df_compact.head()
    

    In the table above, the Datetime field is a string. Pandas and Python offer a number of operations to work with date and time that can be very helpful.

    In the next step, the string in the Datetime column will be separated into two new columns.

    Step 4: Separate data into two columns.

    In this step, you will use Pandas to generate the columns Date and Time from the column Datetime and then drop the Datetime column.

    The lambda function is used create two anonymous functions that extract only the date and the time from a datetime object, respectively. Then, use the pandas function apply to apply this function to an entire column (in practice, apply implicitly defines a for loop and passes the rows one by one to our lambda function). Store the result of the apply functions in two new columns of the DataFrame.

    a) Apply the lambda function to iterate through the data frame to split the date from the Datetime column.

    In [ ]:
    # Code cell 28
    df_compact['Date'] = df_compact['Datetime'].apply(lambda dt_str: pd.to_datetime(dt_str).date())
    

    b) Repeat step a to split time from Datetime column.

    In [ ]:
    # Code cell 29
    # Please note, this requires an intermediate step, because of how NaT are treated by the time() function.
    # Reference: https://github.com/pandas-dev/pandas/issues/11453
    temp = df_compact['Datetime'].apply(lambda dt_str: pd.to_datetime(dt_str))
    df_compact['Time'] = temp.dt.time
    

    c) All the information for the Datetime column has now been copied to the Date and Time columnns. The Datetime column serves no purpose. The Datetime column can be dropped from the data frame.

    Enter the code to drop the Datetime column in the cell below.

    In [ ]:
    # Code cell 30
    

    Enter the code to print out the first 3 rows of the data frame to verify the changes.

    In [ ]:
    # Code cell 31
    

    d) Use the type function to print out the variable type of the values in the Date and Time columns.

    In [ ]:
    # Code cell 32
    print(df_compact['Date'][0], type(df_compact['Date'][0]) )
    print(df_compact['Time'][0], type(df_compact['Time'][0]) )
    

    Step 5: Save the new data frame.

    Save the pandas dataframe df_compact as a csv file called rpi_data_compact:

    In [ ]:
    # Code cell 33
    df_compact.to_csv('./Data/rpi_data_compact.csv')
    

    © 2017 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public.