# Locating Charging Stations for Electric Vehicles (Spatial Clustering) – Part 1

The blog series covers an optimized way to find the Charging Stations for Electric Vehicles on the basis of the trajectory of the vehicles over a given area.

I have made use of the Kaggle dataset with the trajectory of different Taxis to find out the centroids, for optimum number of clusters, serving as the Charging stations’ location on the basis of the density of the taxis in the city.

**Useful links** :

Dataset used : http://archive.ics.uci.edu/ml/datasets/Taxi+Service+Trajectory+-+Prediction+Challenge%2C+ECML+PKDD+2015

As the exercise is pretty big, so it has been divided into 5 different parts across three blog posts.

All the code files have been committed to the following Git repository for readers’ reference : https://github.com/palashgoyal1/Taxi_Service_Trajectory

The Part-1 of this blog series contains the Points on *Data Pre-processing* and *Algorithm selection and Challenges* faced. One can refer to the Post 2 and Post 3 for overall understanding of the approach.

(For better understanding of the reader, I have included the required code snippets wherever needed, and have shared the desired file link present on the Github repository. If required, feel free to comment below and I would add the relevant snippet in the post.)

Contents

### Data Pre-processing

**Data Cleaning**

(Kindly refer to the coord_parallel.py and coord_parallel_R.R files for the below steps)

- Selecting only rows with : MISSING_DATA=’False’
- Not considering ORIGIN_CALL and ORIGIN_STAND
- As TIMESTAMP is not varying much over the period of time, so assuming there is no effect of time on the traffic density over regions
- Removing TAXI_ID from the TRIP_ID string (TRIP_ID has the TAXI_ID component)
- Adding a factor variable with CALL_TYPE and DAY_TYPE, and creating 3 different files for each of these

1234567_train_df = train_df[train_df['MISSING_DATA']==False]_train_df['CALL_DAY'] = train_df['CALL_TYPE'] + '_'+train_df['DAY_TYPE']......aa_df = pd.read_csv('/tmp/tempproject/A_A_364769.csv') # Storing the count as wellba_df = pd.read_csv('/tmp/tempproject/B_A_817878.csv')ca_df = pd.read_csv('/tmp/tempproject/C_A_528013.csv')**CALL_TYPE****DAY_TYPE****ROW COUNT****FileName**A A 364769 A_A_364769.csv B A 817878 B_A_817878.csv C A 528013 C_A_528013.csv - Breaking the whole POLYLINE into all the different Lat-Lng of the trips

12345# adding column with the coordinatescoord_df <- coord_df[, POLYLINE := substring( coord_df[,POLYLINE ], 3) ]coord_df <- coord_df[, POLYLINE := substr(coord_df[,POLYLINE ],1, nchar(coord_df[,POLYLINE ])-2) ]coord_df <- coord_df[, coord_leng := (str_count(coord_df[ , POLYLINE], "\\],\\[")+1) ]# Now get the coord_leng using "],[" in the POLYLINE - Removing rows with same TRIP_ID values : on TAXI_ID level
- Removing entries where only 2 or less than 2 coordinate points are being recorded for each TRIP_ID. Assuming these must have been recorded due to some mistake/error. And also removing the cases where only either of the latitude or longitude has been recorded.
- Multiple entries of the same latitude-longitude for same trip : unique entries taken on TRIP_ID, TAXI_ID level

12# For start points of the destinationtrip_cab_start <- trip_cab_start[!duplicated(trip_cab_start, by=c("trip_id","taxi_id","start_point") )]

**Basic Feature creation**

(Kindly refer to the coord_parallel_R.__R__ file for the below steps)

- Making list of all the unique start and end points of all the trips in each of the above 3 files.

12345678910trip_cab_start <- data.table(trip_cab_start %>% group_by(start_point) %>% dplyr::summarize(unique_cab = length(unique(taxi_id)), freq_val =n() ))trip_cab_start <- trip_cab_start[ ,lon_list := gsub( "(,)(.*)","", trip_cab_start[, start_point]) ]trip_cab_start <- trip_cab_start[ ,lat_list := gsub( "(.*)(,)","", trip_cab_start[, start_point]) ]trip_cab_start <- trip_cab_start[ , start_point:=NULL]......trip_cab_end_ <- data.table(trip_cab_end %>% group_by(end_point, taxi_id) %>% dplyr::summarize( freq_val =n() ))trip_cab_end_ <- trip_cab_end_[ ,lon_list := gsub( "(,)(.*)","", trip_cab_end_[, end_point]) ]trip_cab_end_ <- trip_cab_end_[ ,lat_list := gsub( "(.*)(,)","", trip_cab_end_[, end_point]) ]trip_cab_end_ <- trip_cab_end_[ , end_point:=NULL] - Selecting the unique coordinates for the start and end points of the destination, along with the no. of unique cabs passing through each start and end point, and the no. of trips through that coordinate. This could help us later in estimating the no. of unique cabs served by each of the station, and also the no. of trips which have been through that station.

123456789101112# This DF will have data for unique start point onlytrip_cab_start <- data.table(trip_cab_start %>% group_by(start_point) %>% dplyr::summarize(unique_cab = length(unique(taxi_id)), freq_val =n() ))trip_cab_start <- trip_cab_start[ ,lon_list := gsub( "(,)(.*)","", trip_cab_start[, start_point]) ]trip_cab_start <- trip_cab_start[ ,lat_list := gsub( "(.*)(,)","", trip_cab_start[, start_point]) ]trip_cab_start <- trip_cab_start[ , start_point:=NULL]......# This DF will have data for unique end point onlytrip_cab_end <- data.table(trip_cab_end %>% group_by(end_point) %>% dplyr::summarize(unique_cab = length(unique(taxi_id)), freq_val =n() ))trip_cab_end <- trip_cab_end[ ,lon_list := gsub( "(,)(.*)","", trip_cab_end[, end_point]) ]trip_cab_end <- trip_cab_end[ ,lat_list := gsub( "(.*)(,)","", trip_cab_end[, end_point]) ]trip_cab_end <- trip_cab_end[ , end_point:=NULL]

CALL_TYPE |
DAY_TYPE |
Point Type of Trip |
Unique Count |
FileName |

A | A | Start Points | 296857 | A_A_start.csv |

A | A | End points | 306657 | A_A_end.csv |

B | A | Start points | 134582 | B_A_start.csv |

B | A | End points | 678747 | B_A_end.csv |

C | A | Start points | 359994 | C_A_start.csv |

C | A | End points | 438274 | C_A_end.csv |

Computation Optimization

Computation Optimization

- Made use of data.table while dealing with the large datasets, as the operations are pretty fast.
- Used stringr and stringi packages, to clean character type columns, and not
**apply**functions for row-wise computations.

**Packages Used**

**R**data.table, plyr, dplyr, sp, splitstackshape, stringr, stringi**Python**pandas, joblib, multiprocessing

### Algorithm Selection, Advantages and Challenges

**Selected Approach**

Below is the top level approach we have opted to find out the optimal locations of the charging stations for the cars :

- Dividing the whole dataset into two types of values (Start/End coordinates, ALL coordinates), and then proceeding with the clustering :

a) Unique Lat-Lon of all the Start points of all the trips under a specific CALL_DAY type combination (Kindly refer to the code file at coord_parallel_R.R)

b) Unique Lat-Lon of all the End points of all the trips under a specific CALL_DAY type combination (Kindly refer to the code file at coord_parallel_R.R)

c) Unique Lat-Lon of the whole trip of all the trips under a specific CALL_DAY type combination (Kindly refer to the code file at coord_parallel_R1.R) - On each of the above dataset of Lat-Lon pairs, grouping them in proper optimal number of clusters
- Clustering coordinates approaches : K-Means or Hierarchical clustering or AP-clustering or HeatMap
- Segmenting the clusters on the basis of the highly dense regions and using the points in those regions for further clustering, and also creating separate clusters for points outside of the dense regions

**Selected Algorithm 1 : K-Means Clustering**

The first algorithm which was selected to find out the optimal location of the centroids of the clusters, so as to minimize the hassle of the car owners to charge their vehicles, was K-Means Clustering. Below are the top level challenges/drawbacks which were faced while using the K-Means Clustering algorithm :

- We have to specify at the start the no. of clusters to be made, and no proper Elbow method technique for stopping
- The centroids (for every cluster) converge to different points every time we run the algo, no single optimal solution!
- The centroids are not always (in most cases) the actual coordinates of any of the locations given as input (This results in lat-long selection which may or may not be a feasible physical locations on the real demographics). As the centroid is selected as mean position.
- Few cases where the points have same latitude values have been clustered together even when by longitudes values they were pretty far off, and v.v. for longitude

**Selected Algorithm 2 : Affinity Propagation Clustering**

As many readers might not be aware of the AP-Clustering (Affinity Propagation Clustering) algorithm, a detailed info on the topic could be obtained from the following paper and for a more better explanation, one can follow the tutorial on Youtube as well, along with the mentioned paper :

Below are the major advantages *AP-Clustering* algorithm offers over *K-Means Clustering* algorithm :

K-Means Clustering |
Affinity Propagation Clustering |

No. of clusters have to be pre-specified. Hierarchical clustering (H-clust) could be used later to cap the no. of clusters | No. of clusters may or may not be specified (depends on user needs, specify if constraints on no. of charging stations or budget). Also, H-clust could be used on top of solution set. |

It produces different optimal solutions (or clusters’ centroids) every time the algorithm is run | It produces same optimal solutions (or clusters’ centroids) every time the algorithm is run |

In most cases the centroids are not the actual coordinates of any of the locations given as input | The centroid of any of the clusters is always among one of the data points of the cluster, and not some other mean position |

It is a non-deterministic algorithm | It is almost deterministic algorithm |

It is sensitive to initialization, as in the initial set of centroids being selected | It is not sensitive to initialization, i.e. the solution set of centroids does not dependent on the initial centroids selected |

It doesn’t handle any missing data entry | Handles the missing data as well, like not taking into account entry with missing value of Lat or Lng, (Good for small no. of missing values) |

For large scale datasets, it will load the whole dataset similarity matrix computation. So one has to either shift to some other offering of it, RevoScaleR, or another algorithm selection, like Markov clustering (http://micans.org/mcl/) | For large scale datasets, it loads only a specified amount of fraction of data and its similarity matrix with the rest of the data at a time. This being an iterative process for the cluster formation |

Other related information :

- H-clustering could be used with both the clustering algorithms, and the depth of the H-clust could be decided by giving the constraint on the number of clusters or the cost to set up individual cluster.
- As the final exemplars (or, centroids of the cluster) are one of the actual data points given as input, this will make it easier to select a feasible location on ground for the construction of the charging station.
- Exemplars = Centroids of clusters

**Advantages of AP-Clustering on Large Scale Datasets**

- It starts with a small (reasonable!), sub-sample of columns (a small fraction of entries or, the coordinates we consider to be potential exemplars) and iterates the AP-clustering algorithm on each of the sub-samples, keeping the exemplars of the previous iteration. As for every next iteration (or sub-sample) the prior computed exemplars are included so the exemplars improve with time, as more data keeps coming. (The sub-sampling is done without replacing)

Ex :

– Total size : 300,000 points

– Pick a sample of 100,000 data points

Run AP-clustering, and say you made 1000 clusters on these : 1000 centroids on 1,00,000 points

– Pick another sample of 100,000 data points(without replacing) & append the above 1000 centroids to it: 1,01,000 points till here

Run AP-clustering again, and make, say 1500, clusters on this stage : 1500 centroids on 2,00,000 points

– Now pick another set of 100,000 data points, and append the above 1500 centroids to it and run the AP-clustering on it - It accepts both a symmetric and non-symmetric matrix as input. It doesn’t calculate the whole similarity matrix (or, distance matrix) in advance, and but computes the similarity matrix of only the sub-samples taken in the above step with all the input entries.

Ex : Say 10% of the data is taken as sub-sample (or, as potential exemplars), then 10 times only 10% of the similarity matrix will be taken into memory.

On large scale datasets, it can’t take a pre-computed similarity matrix (else we are feeding the whole matrix directly). AP-clustering is like, messages are exchanged between the row objects (all) and the potential exemplars/column objects (sub-sample), and for a best mutual match one point acts as a proxy for another.

**Computation Advantage **

The memory requirements are equal to the user specified fraction of data, and not the overall data size to be clustered.

**Final Analysis **

For a better understanding of the clusters formation one can look at the following :

- Location of each of the exemplar and how far are they w.r.t. each other.
- Number of unique vehicles which are being served by every exemplar, or fall in each cluster
- Number of unique Trips happened in each of the proposed cluster
- Optimal density of each of the charging station could be derived to minimize the total cost.
- One can explore the suggestion of shortest path to the driver to any of the charging station (http://www.cs.umd.edu/projects/gas/gas-station.pdf)

The above post explains in detail the initial steps we took for processing the Trajectory of the Taxi data and the introduction to the algorithms we have selected, followed by their comparisons and the challenges faced.

In the next post of this series, we will be going bit more technical and discuss about the implementation of the AP-Clustering algorithm, problems we faced and their proposed solutions along with the implementation. We will also be introducing a section on the Further scopes of improvement for the selected algorithm. Below are the links to the next sections of the series.

### Improvements in the Selected Algorithm

### Further scope of improvement

### Improvements implementation

### References

Cheers!!

– Palash Goyal

## Recent Comments