演習・交通手段分類（データ準備）

9.4. 演習・交通手段分類（データ準備）#

GPSの軌跡データを用いたデータ分析を行います。このNotebookでは、GPSの軌跡データを探索して、各トリップの交通モードのデータの準備を行います。次のNotebookではデータの特徴を確認し、その後生成した特徴量を用いて機械学習で交通手段を推定します。

ここは、一般に公開されているGPS軌跡データセットであるGeolifeを使用します。Geolifeデータセットは、182人のユーザーから5年以上（2007年4月から2012年8月まで）にわたって北京やその他の地域で収集したGPS軌跡データで構成されています。このデータセットのGPS軌跡は、タイムスタンプを押した点の列で表され、各点には緯度、経度、高度の情報が含まれています。

このデータセットは比較的大きいので、今回と以下のノートブックでは、ある1人のユーザーの軌跡データのみを使用します。

もしもgeopandasのインストールが済んでいない場合は、TerminalnなどCLIを開いて、pip install geopandasでgeopandasをインストールしておいてください

# import necessary packages
import geopandas as gpd
import os
import pandas as pd
import numpy as np
import shapely

9.4.1. データの読み込みとデータの準備#

データは、以下のリンクからZipファイルをダウンロード・解凍して、このノートブックと同じ場所へ保存してください。別の場所に保存した場合は、以下のセルでデータへのPATHを変更してから実行してください。

ここでは、ある1人のユーザー（ユーザーコード：010）のみを使用することにします。

user = '010'
# PLEASE REPLACE THE BELOW PATH IF NECESSARY 
path = f"./Geolife Trajectories 1.3/Data/{user}/Trajectory"

# List all the files in the target path
files = os.listdir(path)

# Definine column names
column_name = ['latitude','longitude','height','days_total','date','time']

# Iterate through the files in the folder and read it individually.
df_lst = []
for f in files:
    if f.endswith('plt'):
        fpath = os.path.join(path, f)
        df = pd.read_csv(fpath, skiprows=6, usecols=[0,1,3,4,5,6], names= column_name)
        df = df.assign(record_dt = lambda x: pd.to_datetime(x['date'] + ' ' + x['time']), user = user)
        df_lst.append(df)
# Concat all read files into one DataFrame.
traj_df = pd.concat(df_lst)

# Print out the head rows
traj_df.head(3)

	latitude	longitude	height	days_total	date	time	record_dt	user
0	39.138159	117.217108	-36	39805.961748	2008-12-23	23:04:55	2008-12-23 23:04:55	010
1	39.138196	117.217068	-72	39805.961759	2008-12-23	23:04:56	2008-12-23 23:04:56	010
2	39.138268	117.217034	-59	39805.961771	2008-12-23	23:04:57	2008-12-23 23:04:57	010

# check lenght of rows and columns
traj_df.shape

(935576, 8)

# import another data containing transport mode labels (ground truth)
file_path = '/'.join(path.split('/')[:-1]) + "/labels.txt"
trip_trans = pd.read_csv(file_path, sep="\t")
# convert pandas datetime format (for ease of converting time to sec, and other formats)
trip_trans['Start Time'] = pd.to_datetime(trip_trans['Start Time'])
trip_trans['End Time'] = pd.to_datetime(trip_trans['End Time'])

# print out the head rows
trip_trans.head(3)

	Start Time	End Time	Transportation Mode
0	2007-06-26 11:32:29	2007-06-26 11:40:29	bus
1	2008-03-28 14:52:54	2008-03-28 15:59:59	train
2	2008-03-28 16:00:00	2008-03-28 22:02:00	train

def get_trans_trip(record_dt, ref_df=trip_trans):
    """ function to provide transportation mode labels based on the record times
    """
    time_fit = (record_dt >= ref_df['Start Time']) & (record_dt <= ref_df['End Time'])
    nmatch = time_fit.sum()
    if nmatch == 0:
        t_idx = None
    else:
        if nmatch > 1:
            print ('More than one mode match!')
        t_idx = ref_df.loc[time_fit].iloc[0].name
    return t_idx

# map transportation mode to the trajectory dataframe
traj_df['trans_trip'] = traj_df['record_dt'].map(get_trans_trip)
# exclude the rows not holding any transportation mode information
has_trip = ~(traj_df.trans_trip.isnull())
# map transporation mode information
traj_df['trans_mode'] = np.nan
traj_df.loc[has_trip,'trans_mode'] = traj_df.loc[has_trip]['trans_trip'].apply(lambda x: trip_trans.loc[x,'Transportation Mode'])

/var/folders/z5/0lnyp_m54dqc1xkz22ncbj2h0000gn/T/ipykernel_22708/786520595.py:7: FutureWarning: Setting an item of incompatible dtype is deprecated and will raise in a future error of pandas. Value '['train' 'train' 'train' ... 'taxi' 'taxi' 'taxi']' has dtype incompatible with float64, please explicitly cast to a compatible dtype first.
  traj_df.loc[has_trip,'trans_mode'] = traj_df.loc[has_trip]['trans_trip'].apply(lambda x: trip_trans.loc[x,'Transportation Mode'])

print("N of rows with transportation mode information:\t{:,}".format(
    traj_df[~traj_df.trans_trip.isnull()].shape[0]))
print("N of rows without transportation mode information:\t{:,}".format(
    traj_df[traj_df.trans_trip.isnull()].shape[0]))

N of rows with transportation mode information:	534,140
N of rows without transportation mode information:	401,436

# save files as csv
# PLEASE REPLACE THE BELOW PATH TO YOUR OWN PATH
traj_df.to_csv(f'./traj_{user}_labeled.csv')

演習・交通手段分類（データ準備）

Contents

9.4. 演習・交通手段分類（データ準備）#

9.4.1. データの読み込みとデータの準備#