# Locating Charging Stations for Electric Vehicles (Spatial Clustering) – Part 2

This post is a follow up of the previous post of this series on Locating the Charging stations for Electric Vehicles.

One can refer to the Post 1 and Post 3 for overall understanding of the approach.

**Useful links** :

Dataset used : http://archive.ics.uci.edu/ml/datasets/Taxi+Service+Trajectory+-+Prediction+Challenge%2C+ECML+PKDD+2015

Code files repository : https://github.com/palashgoyal1/Taxi_Service_Trajectory

The Part-2 of this blog series contains the Points on the *Improvements in the selected algorithm* and the *Further scopes of improvement*, algorithm and computation *per se*.

(Links to the previous and further sections have also been appended within the post.)

Contents

### Data Pre-processing

### Algorithm Selection, Advantages and Challenges

(For better understanding of the reader, I have included the required code snippets wherever required, and have shared the desired file link present on the Github repository. If required, feel free to comment below and I would add the relevant snippet in the post.)

### Improvements in the Selected Algorithm

After running the APcluster algorithm, there were few changes to be made for further improvement in the results (Kindly refer to the file start_end_apcluster.R file for the below codes) :

- Removing the outlier coordinates and their respective entries, around 5 cases.

1234max_lat <- mean(ba_start_$lat)+10*sd(ba_start_$lat)min_lat <- mean(ba_start_$lat)-10*sd(ba_start_$lat)ba_start_ <- ba_start_[ba_start_$lat <= max_lat,] # There were 5 outliers in this datasetba_start_ <- ba_start_[ba_start_$lat >= min_lat,] - Removing the outlier clusters and their respective entries, for which length < 10

12345# All points along with their cluster labelsindex_cluster <- data.frame(indexes = matrix(unlist(apres10@clusters), nrow = n_row, byrow = T), cluster_label= rep(1:length(apres10@clusters), sapply(apres10@clusters, length)))# All points/clusters which have more than '10' points in themindex_cluster <- index_cluster[index_cluster$cluster_label %in% names(table(index_cluster$cluster_label))[table(index_cluster$cluster_label)>10],]index_cluster <- index_cluster[with(index_cluster, order(indexes)), ] - Creating a separate DF with exemplars entry : These are to be shown as separate markers entry

1234# This is DF of Exemplarsexemplar_df <- data.frame(exemp_ind = apres10@exemplars, exemp_label = 1:length(apres10@exemplars) )exemplar_df$exemp_label <- as.character(exemplar_df$exemp_label)exemplar_df <- exemplar_df[exemplar_df$exemp_label %in% uni_labels, ] - Putting the argument ‘
longlat=TRUE’ in the
spDists function while creating the similarity matrix. By default, it uses
*Euclidean*distance, but with longlat=TRUE , it uses*Great Circle*distance.

(The below commands are making use of the following function, the best thing about this function is that spDist function offers forming a similarity matrix between different size of rows and columns as well, i.e., either a full given matrix or a subset.)

12345678# Matrix to form the similarity matrix, either of the full given matrix or a subsetspdist_func <- function(x, sel=NA){if(any(is.na(sel)))s <- spDists(x,x, longlat=TRUE) # specify longlat=TRUE for earth distanceelses <- spDists(x,x[sel,], longlat=TRUE)as(s, "matrix")}

Type of points on the map :

- Red : 1st cluster data points
- Blue : 2nd cluster data points
- Blue Markers : exemplars (or centroids) belonging to the clusters on which they are located

**Packages Used **

**Problems Encountered**

- Selection of the ‘
q’ value. Done multiple iterations on a sampled dataset of the B_A_start.csv. Different selections of ‘
q’ values and the resultant number of clusters for each (
frac=0.1, and
sweeps=10 for all of these).

**‘q’ Values****Number of Clusters**0.9 2 0.91 2 0.92 2 0.93 33 0.94 22 0.95 28 0.96 39 0.97 73 0.98 68 0.99 82 1 100 (=’frac’ value, i.e., 0.1 of 1000) ( Value in frac parameter denotes the fraction of samples to be used to leverage clustering. The similarity matrix is generated for all samples against a random fraction of the samples as specified by this parameter. sweeps value denotes the number of sweeps of leveraged clustering performed with changing randomly selected subset of samples.)

1234567# For size of 20,000 : Done for following : These were run separately and were allotted to apres10 variable everytime!apres559_2 <- apclusterL( s=spdist_func, ba_start__ , frac=0.05, sweeps=5, q=0.93)apres10 <- apres559_2apres51094 <- apclusterL( s=spdist_func, ba_start__ , frac=0.05, sweeps=10, q=0.94)apres10 <- apres51094apres51098 <- apclusterL( s=spdist_func, ba_start__ , frac=0.05, sweeps=10, q=0.98)apres10 <- apres51098 - Outliers present in the cluster are resulting in making a shift of the exemplar. One of the ways to avoid the same is to make multiple iterations on the number of ‘
sweeps’ values as well.
12345apres1 <- apclusterL( s=spdist_func, ba_start__ , frac=0.1, sweeps=5, q=0.9)...apres2 <- apclusterL( s=spdist_func, ba_start__ , frac=0.1, sweeps=10, q=0.9)apres5 <- apclusterL( s=spdist_func, ba_start__ , frac=0.1, sweeps=5, q=0.9)apres10 <- apclusterL( s=spdist_func, ba_start__ , frac=0.1, sweeps=10, q=0.9)
- Validation : Selecting random samples from the plot and then checking their geographic distance with their respective exemplars, and then cross-checking it with the distance with other exemplars. Another way to cross check the effect of the each of the clusters is by plotting the density regions of every cluster and assigning only those points to it, and also calculating the proportion of points present within the clusters, and the proportion of points outside the cluster.

**Solutions**

- Check to be put on the number of data points which are falling under each cluster, or
- Constraint on the number of different number of cabs which are being served by those data points, which are falling under that cluster.
- Constraint on the number of clusters selected : That could be solved by an iterative procedure of putting limit on the number of points in each clusters. And for every cluster with some max number of points within, running APclusterK algorithm on it for further sub-clustering of the points.

Many of the data points are overlapping, or almost overlapping, in this dataset and so the number of points visible is lesser as compared to input. Analysis done on a different size subsets.

Map using Google Fusion Tables : (1000 data points – 2 clusters, shown without exemplars)

For a better detailed map of the 2 clusters with exemplars : (1000 data points – 2 clusters, with exemplars in orange)

1 |
apres2 <- apclusterL( s=spectrumK6,ba_start__ , frac=0.1, sweeps=5, q=0.9) |

Using the same data and parameters, I increased the value for sweeps from 5 to 10, and it resulted in the below map :

1 |
apres2 <- apclusterL( s=spectrumK6,ba_start__ , frac=0.1, sweeps=10, q=0.9) |

The position of exemplars in the above map is far from the geographic set of coordinates.

In the further graphs :

– Using the **Leaflet** package, to ease the plotting!

– Adding the
longlat=TRUE argument, in the
spDists function for great circle distance, and not Euclidean.

– Including the count of unique TAXI_ID to the exemplar. It is the total number of unique cabs which have passed by any of the coordinates of that cluster. Exemplars of the cluster are shown with bigger marker and its label.

– Coordinate of one random point highlighted by cursor.

1 |
apres10 <- apclusterL( s=spdist_func, ba_start__ , frac=0.1, sweeps=10, q=0.9) |

Doing the same for 20,000 data points from B_A_start.csv. We receive the following :

Repeating the above for 20,000 points only, but for different values of sweeps and ‘q’. We achieve 3 clusters for these 2 cases, show in vibrant colors of Blue, Red, Green!

1 |
apres51094 <- apclusterL( s=spdist_func, ba_start__ , frac=0.05, sweeps=10, q=0.94) |

1 |
apres51098 <- apclusterL( s=spdist_func, ba_start__ , frac=0.05, sweeps=10, q=0.98) |

The code to generate the above set of Maps :

1 2 3 4 5 6 7 8 9 |
# Plotting the values on the map : df <- sp::SpatialPointsDataFrame( cbind(ba_start__$lon , ba_start__$lat), data.frame(type=factor(ba_start__$cluster_label))) # I have made colorFactor as a varying length element for colors # By default, the 'domain' values are assigned the first and last entry of the colors : blue/green/white/red, if the length(domain)<length(color) pal <- colorFactor(c("blue","green","white","red"), domain = unique(ba_start__$cluster_label)) # Benefit of layers! Entries from another DF could also be used here leaflet(df) %>% addTiles() %>% addMarkers(~exemplar_df$lon, ~exemplar_df$lat, popup = ~as.character(exemplar_df$exemp_marker), label= ~as.character(exemplar_df$exemp_marker), labelOptions = labelOptions(noHide = T) ) %>% addCircleMarkers( radius = 4, color = ~pal(type), stroke = FALSE, fillOpacity = 0.5 , label= ~as.character(paste0(ba_start__$lat,"_", ba_start__$lon)) ) |

### Further scope of improvement

#### Algorithm wise

One can make use of the additional data and introduce tweaks in the algorithm implementation for better set of results :

1. Proper selection of the ‘
frac ’, ‘
sweeps ’ and ‘
q ’ parameters in the apclusterL algorithm, thereby resulting in better optimized performance for the cluster creation.

2. Running the APcluster algorithm on the larger sets and if the number of coordinates (or, unique cabs, or the frequency of the cabs) is more than the maximum possible number of vehicles which could be serviced then in that case, we can further divide every cluster into sub-clusters by making use of the **APClusterK algorithm**. APClusterK algorithm enables to specify the number of clusters prior running the algorithm (somewhat similar to K-means), and gives the best possible optimal solution for that many number of clusters. For ease, the cluster selection on which the sub-clusters have to be defined (or not to be defined) could be done by making the **Heatmap** of the present clusters. We can also make use of hierarchical clustering over the present clusters.

This approach has been implemented using the aggregation of the coordinates to reduce the dataset size (at the end of the doc)

**Example** :

Say, ApclusterL has divided 100,000 points in 10 clusters of 10,000 points each. Now, if we want to have 30 Charging stations, and not 10, then we can further divide each of the 10 clusters into 3 sub-clusters by running the ApclusterK algorithm and specifying the
K=3 argument with each of the cluster. The selection of the
K values for each sub-cluster creation could be done by putting a constraint on the minimum number of entries falling in each cluster.

3. Some information about the vehicle, like the time to charge, and how it varies with the life of the vehicle.

4. Information regarding the capacity of each of the charging station, and how many vehicles at a time, or in a day can it serve.

5. Proper value of the time parameter attached with the ride, i.e. the timestamp wrt every position the vehicle has reached, to allocate the vehicles which belong to same cluster to different clusters, so as to lower down the rush at the same charging station.

6. Similarity among the trips could be visualized :

a) Figuring out all those trips which start and end at same or nearby points

b) Trips which start and end at almost same time interval on daily level and might follow same path daily

7. Validation improvement : Do random sampling and pick up few of the trips/coordinates and find their distance from the all the Charging stations and also the charging station in which the point is falling

#### Computation wise

1. Clustering on start and end points of the trip. Assuming the points where user starts/ends the trip are the points where user is present for most of the time and is more probably to make the charging.

2. Introducing the following parameters in the clustering method to give more info to each of the coordinates:

a) Time parameter, so as to check how exactly is the traffic (or clusters) flowing w.r.t. time

b) Frequency count of each data point

c) Clubbing multiple coordinates within some range (like 50m, as a single coordinate) by taking their mean position or one of the coordinates as proxy for other coordinates, and then running the algorithm on the now collected coordinates.

3. While processing the A_A_start.csv, and A_A_end.csv, (Refer to coord_parallel_R.R file for the creation steps of these files) I faced memory issues. It ended up taking close to 12G. For this, we can use the cluster segmentation approach using the Heatmap. Using that, the number of coordinates will reduce drastically. In the example shown in next section in blog series it reduced the sample size to half.

Kindly proceed to the next post of this blog series with below sections for further read.

### Improvements implementation

### References

Cheers!!

– Palash Goyal

## Recent Comments