We are developing k-means clustering extension. k-means is an unsupervised learning algorithm which provides a simple way to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means. Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.
Function Parameters: Data point to be clustered Number of cluster centers - k Number of iterations - m Number of events for which the model is trained - x
The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream. The model is trained for every x events received. After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.
The clustering can be performed for a given window implementation i.e. time, time batch, length
On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time.
Fazlan Nazeem
On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem
Training is carried out on the number of data points accumulated, depending on the window used. The data is collected over a given window size, by updating an array list. Once an event is expired from the window, an element is removed from the array list.
For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering. The training process includes:
An option can be given to train the model for only the first x number of events or train it for each x data points received.
Sachini Siriwardene
adding Fazlan On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene
Malith Jayasinghe
Hi Sachini, Okay. I think I misread the "every x events" part previously. This means if x is 100 when 200 events have been received we would have 2 models in total. +1 if that is the case. On Fri, Jun 9, 2017 at 9:56 AM, Malith Jayasinghe
Fazlan Nazeem
Hi Fazlan, Yes , that is what happens. On Fri, Jun 9, 2017 at 10:52 AM, Fazlan Nazeem
Sachini Siriwardene
Should we add an option to enable/disable continuous learning? If "on" then training will happen after every x events otherwise only after first x events. On Fri, Jun 9, 2017 at 11:04 AM, Sachini Siriwardene
Malith Jayasinghe
Further implementation details of the extension : The extension is implemented extending the stream processor. The input parameters for the function :
Output received :
We can use the clustering extension with a given window. The processing details are as follows:
The training process includes:
On Fri, Jun 9, 2017 at 12:40 PM, Malith Jayasinghe
Sachini Siriwardene
