演習・交通手段分類（機械学習）

9.6. 演習・交通手段分類（機械学習）#

9.6.1. 教師あり#

9.6.1.1. 分類#

交通手段をいくつかの特徴で分類します。

import geopandas as gpd
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn import tree
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# PLEASE REPLACE THE BELOW PATH WITH YOUR OWN
df = pd.read_csv('./traj_010_labeled_with_features.csv', index_col=0)
# create dataframe holding features per trip
# in this notebook, we focus on the features' mean scores per trip
df = df.groupby('trans_trip').agg({
    'distance': np.mean,
    'speed': np.mean,
    'accel': np.mean,
    'angle':np.mean,
    'angular_velocity':np.mean,
    'trans_mode':lambda x: pd.unique(x)[0],
})
df.head()

/var/folders/z5/0lnyp_m54dqc1xkz22ncbj2h0000gn/T/ipykernel_22744/2906762216.py:5: FutureWarning: The provided callable <function mean at 0x1084d4e00> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  df = df.groupby('trans_trip').agg({

	distance	speed	accel	angle	angular_velocity	trans_mode
trans_trip
1.0	0.920653	55.720618	-0.014007	68.982383	0.354393	train
2.0	1.082321	65.670647	-0.001027	116.581851	0.568180	train
3.0	1.339320	74.683147	-0.003409	215.717754	0.480407	train
4.0	3.513432	173.776717	0.233277	174.470829	0.283810	train
5.0	4.282399	68.448484	-0.025508	150.209730	0.176290	train

# drop rows holding np.nan
df.dropna(inplace = True)

df.describe()

	distance	speed	accel	angle	angular_velocity
count	428.000000	428.000000	428.000000	428.000000	428.000000
mean	0.081362	39.136603	-0.068544	171.572185	24.004009
std	0.329114	38.591674	0.115817	64.819410	24.533810
min	0.001203	1.925925	-1.116834	5.303116	0.176290
25%	0.002057	6.276831	-0.108822	128.523012	2.852289
50%	0.009352	27.671732	-0.045742	173.816135	18.066445
75%	0.020917	53.299781	-0.006693	218.588914	35.412731
max	4.282399	175.182973	0.298885	329.530934	114.471672

# Let's check the counts of the target categorical values in the dataset
df.trans_mode.value_counts()

trans_mode
walk      152
train      99
taxi       96
subway     47
bus        34
Name: count, dtype: int64

In this notebook, we merge several labels and use three labels: vehicle(car/taxi/bus), walk, and train(train/subway).

# create a dictionary holding new labels as values and corresponding old labels as keys
trans_mode_map = {'bus':'vehicle', 'car':'vehicle','taxi':'vehicle','subway':'train', 'walk':'walk','train':'train'}
# map the above dictionary to the current labels and replace them with the three labels
df['trans_mode'] = df.trans_mode.map(trans_mode_map)
# count each labels
df.trans_mode.value_counts()

trans_mode
walk       152
train      146
vehicle    130
Name: count, dtype: int64

### implementation of some basic classificaition models

create X holding features to be used in models and y holding labels to be predicted.

X = df[['speed','accel','angular_velocity']]
y = df['trans_mode']

# check correlations between features
X.corr()

	speed	accel	angular_velocity
speed	1.000000	0.159520	-0.530622
accel	0.159520	1.000000	-0.068022
angular_velocity	-0.530622	-0.068022	1.000000

# check the sizes of X and y
X.shape, y.shape

((428, 3), (428,))

# Split into training and validation data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

Run several classification models to estimate transportation mode (vehicle, train, and walk)

In this notebook, we run logistic regression, SVM(support vector machines), and Decision Tree as examples. To understand how each model works, please check the documents below and other machine learning introductory materials.

logistic regression: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

SVM: https://scikit-learn.org/stable/modules/svm.html#support-vector-machines

Decision Tree: https://scikit-learn.org/stable/modules/tree.html#decision-trees

# Logistic regression models (one-vs-rest)
lg_model=LogisticRegression()
lg_model.fit(X_train, y_train) # Fitting models to training data
lg_y_predicted=lg_model.predict(X_test) # Predicting labels with test data.
# Assessment of prediction accuracy
print('accuracy score (w/ training data): {}'.format(lg_model.score(X_train, y_train)))
# Assessment of prediction accuracy
print('accuracy score (w/ validation data): {}'.format(lg_model.score(X_test, y_test)))
print(classification_report(y_test, lg_y_predicted))
# print out confusion matrix
pd.DataFrame(confusion_matrix(y_test, lg_y_predicted), 
             columns = ['train','vehicle','walk'], 
             index=['train','vehicle','walk'])

accuracy score (w/ training data): 0.9130434782608695
accuracy score (w/ validation data): 0.9069767441860465
              precision    recall  f1-score   support

       train       0.92      0.80      0.85        44
     vehicle       0.80      0.92      0.86        39
        walk       1.00      1.00      1.00        46

    accuracy                           0.91       129
   macro avg       0.91      0.91      0.90       129
weighted avg       0.91      0.91      0.91       129

/opt/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

	train	vehicle	walk
train	35	9	0
vehicle	3	36	0
walk	0	0	46

In the above cell, we printed out the accuracy score within training data and those within validation data.

Also, we printed out the classification reports that hold more information, including precision, recall, and f1-score.

\(\texttt{accuracy}(y, \hat{y}) = \frac{1}{n_\text{samples}} \sum_{i=0}^{n_\text{samples}-1} 1(\hat{y}_i = y_i)\)
\(\texttt{precision} = tp / (tp + fp)\) where tp is tp (true positive) is the correct result and fp (false positive) is an unexpected result
\(\texttt{recall} = tp / (tp + fn)\) where fn (false negative) is missing result
\(\texttt{f1-score} = 2 * (precision * recall) / (precision + recall)\)

With the confusion matrix, we can easily check which label are misclassified to which.

# Support Vector Machine (SVM)
svm_clf = svm.SVC()
svm_clf.fit(X_train, y_train) 
svm_y_predicted=svm_clf.predict(X_test) # Predicting labels with test data.
print("accuracy score (w/ training data): ", svm_clf.score(X_train, y_train)) 
print("accuracy score (w/ validation data): ", svm_clf.score(X_test, y_test)) 
print(classification_report(y_test, svm_y_predicted))
# print out confusion matrix
pd.DataFrame(confusion_matrix(y_test, svm_y_predicted), 
             columns = ['train','vehicle','walk'], 
             index=['train','vehicle','walk'])

accuracy score (w/ training data):  0.919732441471572
accuracy score (w/ validation data):  0.875968992248062
              precision    recall  f1-score   support

       train       0.92      0.82      0.87        44
     vehicle       0.79      0.79      0.79        39
        walk       0.90      1.00      0.95        46

    accuracy                           0.88       129
   macro avg       0.87      0.87      0.87       129
weighted avg       0.88      0.88      0.87       129

	train	vehicle	walk
train	36	8	0
vehicle	3	31	5
walk	0	0	46

# Decision Tree Classifier
tree_clf = DecisionTreeClassifier(random_state=0, max_depth=3)
tree_clf.fit(X_train, y_train) 
tree_y_predicted=tree_clf.predict(X_test)
print("accuracy score (w/ training data): ", tree_clf.score(X_train, y_train)) 
print("accuracy score (w/ validation data): ",tree_clf.score(X_test, y_test))
print(classification_report(y_test, tree_y_predicted))
# print out confusion matrix
pd.DataFrame(confusion_matrix(y_test, tree_y_predicted), 
             columns = ['train','vehicle','walk'], 
             index=['train','vehicle','walk'])

accuracy score (w/ training data):  0.9431438127090301
accuracy score (w/ validation data):  0.9302325581395349
              precision    recall  f1-score   support

       train       0.89      0.91      0.90        44
     vehicle       0.89      0.87      0.88        39
        walk       1.00      1.00      1.00        46

    accuracy                           0.93       129
   macro avg       0.93      0.93      0.93       129
weighted avg       0.93      0.93      0.93       129

	train	vehicle	walk
train	40	4	0
vehicle	5	34	0
walk	0	0	46

Let’s visualize how the tree classifes our data

fig, ax = plt.subplots(figsize=(10, 5))
tree.plot_tree(
    tree_clf, ax = ax, fontsize=8, 
    feature_names= X.columns,  
    class_names= y.unique(), 
    filled=True,
    rounded=True,
)
plt.show()

../../_images/255af37f75b2d3e7a548f242b5818b92368ee05a33b29d4231932dc017731432.png

9.6.2. 教師なし#

ここではK-meansとPCAを用います

9.6.2.1. K-means#

ここでは、特徴量 angular_velocity と speed を用いて、K-means を用いてクラスタリングを行います。 K-meansでは、クラスタの数を指定する必要があるので、ここでは2つのクラスタを生成します。

X_kmeans = df[['angular_velocity', 'speed']]
model = KMeans(n_clusters=3) # with n_clusters specifying the number of clusters
model.fit(X_kmeans) # fitting the data
y_km=model.predict(X_kmeans) # predict clusters

/opt/anaconda3/lib/python3.11/site-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
  warnings.warn(

正解データとクラスタリングの結果を比較してみます

y = df['trans_mode']
y_colors = y.map({'train':'navy', 'vehicle':'turquoise', 'walk':'darkorange'}) 
fig, axs = plt.subplots(1, 2, figsize=(8, 3))
for i, ax in enumerate(axs):
    ax.scatter(df['angular_velocity'], df['speed'], c = [y_km, y_colors][i])
    ax.set_xlabel('angle rate')
    ax.set_ylabel('speed')
    ax.set_title(['K-means clustering', 'Ground truth class'][i])
plt.show()

../../_images/e69ed352ad4c4b72fea44ebf6448591cc362a696daeeb2ca74171a78e536690b.png

2つの特徴量しか用いていないこともあり、K-meanではあまりうまく分類ができていないですが、それでも大まかな傾向は捉えているようです。

9.6.2.1.1. PCA (Principal Component Analysis)#

高次元データの次元削減のために用いられることが多いです。ここでは、GPS軌跡データの5つの特徴量をPCAを用いて2次元に縮小します。

X_pca = df[['distance','speed','accel','angle','angular_velocity']]
y = df['trans_mode']

X_pca = StandardScaler().fit_transform(X_pca)

pca = PCA(n_components=2)
X_pca_res = pca.fit_transform(X_pca)

print(pca.explained_variance_ratio_)

[0.36113444 0.2130784 ]

第1主成分と第2主成分をプロットします。

colors = ['navy', 'turquoise', 'darkorange']
target_names =  y.unique()
for color, target_name in zip(colors, target_names):
    plt.scatter(X_pca_res[y == target_name, 0], X_pca_res[y == target_name, 1], color=color, alpha=.8, label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1, frameon=False)
plt.title('PCA of GPS dataset')
plt.show()

../../_images/c207a955827df77f077469063d2ac28b70240b432259a1a2cb2a71b8f920c44e.png