Siddhi: K-means Clustering extension

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Siddhi: K-means Clustering extension

Malith Jayasinghe

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Fazlan Nazeem
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Sachini Siriwardene
Hi Fazlan,
Please find my replies inline.

On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[hidden email]> wrote:
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 
 
  Training is carried out on the number of data points accumulated, depending on the window used.  The data is collected over a given window size, by updating an array list.

Once an event is expired from the window, an element is removed from the array list.

 

For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering.

The training process includes:

  1. Initializing the cluster centers based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centers will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centers.

  3. The new cluster centers are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centers recomputed until the cluster center values do not change or the number of iterations is reached.

 

An option can be given to train the model for only the first x number of events or train it for each x data points received.

    
  

 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Malith Jayasinghe
adding Fazlan

On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Please find my replies inline.

On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[hidden email]> wrote:
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 
 
  Training is carried out on the number of data points accumulated, depending on the window used.  The data is collected over a given window size, by updating an array list.

Once an event is expired from the window, an element is removed from the array list.

 

For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering.

The training process includes:

  1. Initializing the cluster centers based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centers will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centers.

  3. The new cluster centers are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centers recomputed until the cluster center values do not change or the number of iterations is reached.

 

An option can be given to train the model for only the first x number of events or train it for each x data points received.

    
  

 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Fazlan Nazeem
Hi Sachini,

Okay. I think I misread the "every x events" part previously. This means if x is 100 when 200 events have been received we would have 2 models in total. +1 if that is the case.  


On Fri, Jun 9, 2017 at 9:56 AM, Malith Jayasinghe <[hidden email]> wrote:
adding Fazlan

On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Please find my replies inline.

On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[hidden email]> wrote:
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 
 
  Training is carried out on the number of data points accumulated, depending on the window used.  The data is collected over a given window size, by updating an array list.

Once an event is expired from the window, an element is removed from the array list.

 

For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering.

The training process includes:

  1. Initializing the cluster centers based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centers will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centers.

  3. The new cluster centers are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centers recomputed until the cluster center values do not change or the number of iterations is reached.

 

An option can be given to train the model for only the first x number of events or train it for each x data points received.

    
  

 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware



--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Sachini Siriwardene
Hi Fazlan,
Yes , that is what happens. 

On Fri, Jun 9, 2017 at 10:52 AM, Fazlan Nazeem <[hidden email]> wrote:
Hi Sachini,

Okay. I think I misread the "every x events" part previously. This means if x is 100 when 200 events have been received we would have 2 models in total. +1 if that is the case.  


On Fri, Jun 9, 2017 at 9:56 AM, Malith Jayasinghe <[hidden email]> wrote:
adding Fazlan

On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Please find my replies inline.

On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[hidden email]> wrote:
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 
 
  Training is carried out on the number of data points accumulated, depending on the window used.  The data is collected over a given window size, by updating an array list.

Once an event is expired from the window, an element is removed from the array list.

 

For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering.

The training process includes:

  1. Initializing the cluster centers based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centers will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centers.

  3. The new cluster centers are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centers recomputed until the cluster center values do not change or the number of iterations is reached.

 

An option can be given to train the model for only the first x number of events or train it for each x data points received.

    
  

 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware



--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Malith Jayasinghe
Should we add an option to enable/disable continuous learning? If "on" then training will happen after every x events otherwise only after first x events.  

On Fri, Jun 9, 2017 at 11:04 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Yes , that is what happens. 

On Fri, Jun 9, 2017 at 10:52 AM, Fazlan Nazeem <[hidden email]> wrote:
Hi Sachini,

Okay. I think I misread the "every x events" part previously. This means if x is 100 when 200 events have been received we would have 2 models in total. +1 if that is the case.  


On Fri, Jun 9, 2017 at 9:56 AM, Malith Jayasinghe <[hidden email]> wrote:
adding Fazlan

On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Please find my replies inline.

On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[hidden email]> wrote:
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 
 
  Training is carried out on the number of data points accumulated, depending on the window used.  The data is collected over a given window size, by updating an array list.

Once an event is expired from the window, an element is removed from the array list.

 

For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering.

The training process includes:

  1. Initializing the cluster centers based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centers will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centers.

  3. The new cluster centers are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centers recomputed until the cluster center values do not change or the number of iterations is reached.

 

An option can be given to train the model for only the first x number of events or train it for each x data points received.

    
  

 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware



--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374



--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
Reply | Threaded
Open this post in threaded view
|

Re: Siddhi: K-means Clustering extension

Sachini Siriwardene
Further implementation details of the extension :
  
 The extension is implemented extending the stream processor.

The input parameters for the function :

  1. data point to be clustered,

  2. no.of cluster centers,

  3. no. of iterations,

  4. no. of data points for which the model is trained -  x

  5. continueToTrain(boolean)

 

Output received :

  1. cluster centre value to which the data point belongs

  2. id of the particular cluster center

  3. distance from the cluster center.

 

We can use the clustering extension with a given window. The processing details are as follows:

  1. For each current event in the window, each data point received is added to an arraylist.

  2. If the no. of data points received is greater than x, the cluster centre to which the data point belongs to is calculated and an output is produced.

  3. If the no. of data points received is a multiple of x, the data in the arraylist is sent to be clustered.

  4. If an expired event is received, the first item in the arraylist is removed.

  5. If a reset event is received, all the data in the arraylist is removed.

  6. If the continueToTrain parameter is false, the model will not be trained for each x number of events received. Instead it will only be trained for the first x number of events and the computed centres will be used to give the output for every event received afterwards.

The training process includes:

  1. Initializing the cluster centres based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centres will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centres.

  3. The new cluster centres are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centres recomputed until the cluster centre values do not change or the number of iterations is reached.


On Fri, Jun 9, 2017 at 12:40 PM, Malith Jayasinghe <[hidden email]> wrote:
Should we add an option to enable/disable continuous learning? If "on" then training will happen after every x events otherwise only after first x events.  

On Fri, Jun 9, 2017 at 11:04 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Yes , that is what happens. 

On Fri, Jun 9, 2017 at 10:52 AM, Fazlan Nazeem <[hidden email]> wrote:
Hi Sachini,

Okay. I think I misread the "every x events" part previously. This means if x is 100 when 200 events have been received we would have 2 models in total. +1 if that is the case.  


On Fri, Jun 9, 2017 at 9:56 AM, Malith Jayasinghe <[hidden email]> wrote:
adding Fazlan

On Fri, Jun 9, 2017 at 9:54 AM, Sachini Siriwardene <[hidden email]> wrote:
Hi Fazlan,
Please find my replies inline.

On Wed, Jun 7, 2017 at 3:48 PM, Fazlan Nazeem <[hidden email]> wrote:
Hi Malith,


On Wed, Jun 7, 2017 at 3:04 PM, Malith Jayasinghe <[hidden email]> wrote:

Hello All,

 

We are developing k-means clustering extension. k-means is an unsupervised learning algorithm  which provides a simple way  to classify a given data set through a certain number of clusters . The standard k-means clustering algorithm is a nondeterministic algorithm. This means that we can get different results for the same input data when we run the algorithm multiple times. The reason is that the algorithm randomly chooses k observations from the data set and uses these as the initial means.  Here we implement a variant of k means in which the initial cluster centers are determined by the first k distinct values. This will ensure the same output for a given input.

 

Function Parameters: Data point to be clustered

Number of cluster centers - k

Number of iterations - m

Number of events for which the model is trained - x

 

The cluster centers are initialized based on the first distinct number of k (number of cluster centers) events in the stream.

The model is trained for every x events received.

 
Does this mean at any point in time, the maximum number of input points used by the training process is x? Also how is the training process carried out? I assume the training doesn't happen in real time. 
 
  Training is carried out on the number of data points accumulated, depending on the window used.  The data is collected over a given window size, by updating an array list.

Once an event is expired from the window, an element is removed from the array list.

 

For every x number of data points received, the data accumulated in the array list is sent to be clustered and new cluster centers are computed. The training is carried out real time, for the data available in the array list at the time it is sent for clustering.

The training process includes:

  1. Initializing the cluster centers based on the distinct number of first k data points in the data set. If distinct data points is less than the k value, the number of cluster centers will be initialized to distinct number of data points.

  2. The data points in the given data set is assigned to the available cluster centers.

  3. The new cluster centers are computed for the assigned data for each cluster center by taking the average value.

  4. The values in the data set are re assigned and cluster centers recomputed until the cluster center values do not change or the number of iterations is reached.

 

An option can be given to train the model for only the first x number of events or train it for each x data points received.

    
  

 

After receiving the first x events, an output is given for each event generated. The output consists of the cluster centre value to which the data point belongs, the id of the particular cluster center and the distance from the cluster center.

 

The clustering can be performed for a given window implementation i.e. time, time batch, length


--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware



--
Thanks & Regards,

Fazlan Nazeem
Senior Software Engineer
WSO2 Inc
Mobile : <a href="tel:%2B94%20%280%29%20773%20451194" value="+94773451194" target="_blank">+94772338839

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture




--
Sachini Siriwardene
Software Engineering Intern

<a href="tel:+94%2077%20427%204374" value="+94774274374" target="_blank">+94774274374



--
Malith Jayasinghe 

WSO2, Inc. (http://wso2.com)
Email   :[hidden email]
Mobile :0770704040 
Blog     :https://medium.com/@malith.jayasinghe
Lean . Enterprise . Middleware



--
Sachini Siriwardene
Software Engineering Intern

+94774274374

_______________________________________________
Architecture mailing list
[hidden email]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture