Data Samples

file-iconDealer Sample Download

A. Shape - in parts - video 11

If you set up the Similarity parameter to the value of A. Shape - in parts, then the SimplexDivide application will generate the clusters based on the similarity of the curves parts shape. Thence, it is not important, where in the diagram are located the similar curves. Said in different words, it is not crucial whether the curve values are high or low. The shape of the curves is decisive – the decrease and increase slopes of their parts.

The setting of A. Shape - in parts evaluates the similarity of the curves within the context of the neighboring points of the curves, which form their parts. Therefore, the result depends on the columns order of the input file.

fig-1

Fig 1: Input data of the dealer.csv demonstration example

The data in the dealer.csv file as are a good demonstration example. In this file, each curve represents the business performance of one dealer during the period of 24 months. The input data sample is shown on the Fig. 1 (CLUSTER panel of the SimplexImpera application). The curves diagram of the dealer.csv file is shown on the Fig 2 (GRAPH panel of the SimplexImpera application).

fig-2

Fig 2: Graphic representation of the data on 100 dealers in the dealer.csv file

In this case, the goal consists in the division of the curves (dealers) in a manner showing the development of their business performance. This means that the required information should illustrate either decreasing or increasing trends of the dealers' business performance. Thus, in this case we are interested in the business decrease or increase during the monitored period instead of the business volume. Therefore, in this case, the shape of the parts of the curves is decisive, and this corresponds to the A. Shape - in parts setting.

fig-3

Fig 3: Processing of the dealer.csv demonstration example

The resulting division of the dealers into the clusters can be seen in the SimplexImpera application, if you test this demonstration example. Run the SimplexDivide application and select the dealer.csv input file (SimplexManualsample). Keep unchanged the parameters values and click on the Divide button (Fig.3). Save the result by clicking on Yes. Run the SimplexImpera application and select the dealer_A_W_W_0_0spx result. You may view the identified clusters. The diagrams of all five division clusters are shown in the Fig. 4 and Fig. 5.

fig-4

Fig 4: Cluster of curves (dealers) with more-or-less well balanced performance

The Fig. 4 shows the cluster of 26 dealers (GRAPH panel of the SimplexImpera application) whose business performance during the monitored period is more or less well balanced (the pale curves in the background represent the all 100 dealers). The figure proves that the cluster includes dealers with both high and low level of the performance since the business volume was not of our concern. We were interested in the performance development, i.e. in the shape of the curves.

fig-5

Fig 5: Cluster of curves (dealers) with specific performance

The Fig. 5 shows the remaining identified clusters. For example, the second cluster includes 23 dealers with decreasing business performance.

Please remember again, that for the A. Shape - in parts setting, the result depends on the order of the input file columns. Should we change the order of columns from January 2001 to December 2002, the result will be different. And for this demonstration example it would make no sense to change the chronological order of the months.

Settings of the simplexDivide: similarity: A – shape in parts, modification: W - without data modification, Analysis: W - without data analysis, Strictness: 0

file-iconPatient Sample Download

B. Shape - in points - video 12

If you set up the Similarity parameter to the value of B. Shape - in points, then the SimplexDivide application will generate the clusters based on the similarity of the shape in the individual points of the curves. Thence, it is not important, where in the diagram are located the similar curves. Said in different words, it is not crucial whether the curve values are high or low. The shape of the curves in their individual points, i.e. the decrease or increase slope in comparison with all the remaining curve points, is decisive.

The B. Shape - in points evaluates the similarity in the individual points of the curves regardless to the context of the neighboring points of the parts of the curves. Therefore the result does not depend on the columns order in the input file.

fig-1

Fig 1: Input data of the patient.csv demonstration example

The patient.csv file data are a good demonstration example, which can be found in the SimplexManualsample directory. The file contains 100 rows on patients whose health was monitored by the mean of some medical instrument. For each patient, the instrument measured 14 values placed in columns with code names of FGD, RFA, NBT, ..., PVD. The input data sample is shown on the Fig. 1 (CLUSTER panel of the SimplexImpera application). The curves diagram of the patient.csv file is shown on the Fig. 2 (GRAPH panel of the SimplexImpera application).

p-fig-1

Fig 2: Graphic representation of the data on 100 patients in the patient.csv file

In this example, the goal consists in division of patients to clusters with patients with measured values being similar as far as the mutual decrease or increase of the individual values concerns. Some patients have their measured values generally higher, others generally lower. The order of the individual measurement is unimportant. Thus, we want to find out the structure of this sample of data on patients as far as either decrease or increase of the individual measurements with relation to all the patient's another measurements concerns. Therefore, in this case, the shape of the curves in their individual points is decisive, and this corresponds to the B. Shape - in points setting.

fig-3

Fig 3: Processing of the patient.csv demonstration example

The resulting division of the users into the clusters can be seen in the SimplexImpera application, if you test this demonstration example. Run the SimplexDivide application and select the patient.csv input file (SimplexManualsample). Set up the Similarity parameter to B. Shape - in points and the Strictness parameter to the value 30. Keep the remaining parameter values unchanged, click on the Divide button (Fig. 3) and save the result. Run the SimplexImpera application and select the patient_B_W_W_0_30spx result. You may view the identified clusters. The diagrams of two (from among 11 division clusters) clusters are shown in the following Fig. 4.

image008-patients

Fig 4: Diagrams of two clusters of the division result (patient_B_W_W_0_30spx)

The graphic illustration in the Fig. 4 shows the apparent difference between the show clusters. The measured values of the patients are apparently in mutually different shape. Thus, the resulting clusters were generated on the base of the shape in the individual points of the curves – decrease or increase slope in comparison with all the remaining curve points. And this is the merit of the B. Shape - in points setting.

Please remember that we would obtain the same result in case of any change of the columns order for the input file (e.g. HFP, RFA, FGD, ..., NBT). For B. Shape - in points setting, the result does not depend on the columns order in the input file, unlike it is for the A. Shape - in parts setting.

Settings of the simplexDivide: similarity: B – shape in points, modification: W - without data modification, Analysis: W - without data analysis, Strictness: 30

file-iconInternet Sample Download

C. Proximity – in parts - video 13

If you set up the Similarity parameter to the value of C. Proximity – in parts, then the SimplexDivide application will generate the clusters based on the proximity of the curves parts. This means that the position of the similar curves within the diagram is decisive, i.e. how close to each other they are placed. The curves similarity depends on the similarity degree of the numeric values within their parts. The curves within any identified cluster shall have approximately equal the numeric values in their parts.

The setting of C. Proximity - in parts evaluates the similarity of the curves within the context of the neighboring points of the curves, which form their parts. Therefore, the result depends on the columns order of the input file.

image002

Fig 1: Input data of the internet.csv demonstration example

The internet.csv file data are a good demonstration example, which can be found in the SimplexManualsample directory. The file contains 100 rows on the activity of the Internet provider's customers.  Each row contains data on the connection time of one Internet user during 24 hours of the day during the period of several weeks. This means that for each user, the file contains 24 values – number of minutes of connection within the given daily hour for the monitored period.  The input data sample is shown on the Fig. 1 (CLUSTER panel of the SimplexImpera application). The curves diagram of the internet.csv input data is shown in the Fig. 2 (GRAPH panel of the SimplexImpera application).

image004

Fig 2: Diagram on connection data for 100 users during the day (internet.csv)

In this case, the goal consists in the division of the curves (users) in clusters which should provide information on their activity throughout the day. We want to get into the clusters the users, which are in an approximately the same time of the day approximately the same minutes connected to the Internet. Thus, in this case we are concerned of the connection minutes of the users during similar times of the day. Therefore, in this case, the proximity of the parts of the curves is decisive, and this corresponds to the C. Proximity - in parts setting.

image006

Fig 3: Processing of the internet.csv demonstration example

The resulting division of the users into the clusters can be seen in the SimplexImpera application, if you test this demonstration example. Run the SimplexDivide application and select the internet.csv input file (SimplexManualsample). Set up the Similarity parameter to C. Proximity - in parts. Set up the Analysis parameter to A. Statistical analysis - in row. Set up the Strictness parameter to the value of 5. Keep the remaining parameter values unchanged, click on the Divide button (Fig. 3) and save the result. Run the SimplexImpera application and select the internet_C_W_A_0_5spx result. See the Fig. 4 for the diagrams of the four identified clusters.

image008

Fig 4: Diagrams of four clusters of the division result (internet_C_W_A_0_5spx)

The diagram in the Fig. 4 shows that the Internet users within the individual clusters are similar as far as their daily activity duration concerns. For example, the third cluster shows the users with high activity during the evening and nighttime. The users within the first cluster have also high activity in the evening hours, however, their connection duration is shorter. Thus, the resulting clusters were generated according to similar connection period during the daily hours. And this is the merit of the C. Proximity - in parts setting.

Please remember again, that for the C. Proximity - in parts setting, the result depends on the order of the input file columns. Should we change the order of columns for the individual daily hours, then the result would be different. And for this demonstration example, it would make no sense to change the chronological order of the daily hours.

Settings of the simplexDivide: similarity: C – proximity in parts, modification: W - without data modification, Analysis: A - statistical analysis in row, Strictness: 5

file-iconPeople Sample Download

D. Proximity – in points - video 14

If you set up the Similarity parameter to the value of D. Proximity - in points, then the SimplexDivide application will generate the clusters of curves based on the proximity of their individual points in the diagram. The curves similarity depends on the similarity degree of the individual numeric values of the curves. The curves within any identified cluster shall have approximately equal their individual numeric values.

The D. Proximity - in points setting evaluates the curves similarity in their individual points regardless to the context of the neighboring points of the parts of the curves. Therefore the result does not depend on the columns order in the input file.

image002_people

Fig 1: Input data of the people.csv demonstration example

The data in the people.csv file are a good demonstration example. In this file, each curve represents one person with the following three data: height, mass and age. The input data sample is shown on the Fig. 1 (CLUSTER panel of the SimplexImpera application). The people.csv contains data of 50 persons, which are represented graphically in the Fig. 2 (GRAPH panel of the SimplexImpera application).

image004_people

Fig 2: Graphic representation of the data on 50 persons in the people.csv file

In this example, the goal consists in dividing the persons in clusters with approximately the same height, mass and age. Thus, we want to find out the structure of this sample of personal data as far as their mass, height and age concerns. Therefore, in this case, the similarity of the numeric values is decisive; this corresponds to the setting of D. Proximity - in points.

image006_people

Fig 3: Processing of the people.csv demonstration example

The resulting division of the users into the clusters can be seen in the SimplexImpera application, if you test this demonstration example. Run the SimplexDivide application and select the people.csv input file (SimplexManualsample). Set up the Similarity parameter to D. Proximity - in points. Set up the Analysis parameter to B. Statistical analysis - in column. Keep the remaining parameter values unchanged, click on the Divide button and save the processing result. Subsequently, run the SimplexImpera application and select the people_D_W_B_0_0spx result. You may view the identified clusters.

image008_people

Fig 4: Divesion of the people.csv file in 4 clusters

The SimplexDivide application divided the 50 persons in four clusters as described in the Fig. 4 (table of the IMPERA panel of the SimplexImpera application). After the order No. of the cluster, there are shown the number of persons in the cluster and the average values for height, mass and age of the persons included therein.

image010_people

Fig 5: Diagrams of four clusters of the division result (people_D_W_B_0_0spx)

It is apparent from the graphic illustration in the Fig. 5, that the persons in the individual clusters are mutually similar as far as their height, mass and age concerns. Also the average height, mass and age values in the table of Fig. 4 confirm this statement. For example, the second cluster includes 13 persons with middle height, mass and age. Thus, the created clusters were generated on the base of proximity of the individual values for the height, mass and age of the persons. And this is the merit of the D. Proximity - in points setting.

Please remember that we would obtain the same result in case of any change of the columns order for the input file (e.g. age, height, mass). For D. Proximity - in points setting, the result does not depend on the columns order in the input file, unlike it is for the C. Proximity - in parts setting.

Settings of the simplexDivide: similarity: D – proximity in points, modification: W - without data modification, Analysis: B - statistical analysis – in column, Strictness: 0

file-iconStore Sample Download

A. [min,max] >> [0%,100%] - in row - video 15

If you set up the Modification parameter to the A. [min,max] >> [0%,100%] - in row value, then the SimplexDivide application shall perform the following input file data modification in each row prior the generation of the clusters:

  • the application will find out the minimum and the maximum values in the row
  • the minimum value is equal to 0% and the maximum one to 100%
  • the application replaces all row values by the corresponding percentages related to minimum and maximum

For instance, in the following row with 7 values, the minimum value is 100 (0%) and the maximum one is 200 (100%).

170,190,100,120,200,160,130

After the A. [min,max] >> [0%,100%] - in row modification, this row will look like as follows:

70%,90%,0%,20%,100%,60%,30%

The SimplexDivide application shall perform this modification in each row separately (both minimum and maximum can be different for each row). Firstly, the SimplexDivide application modifies the input file data this way, and then, it shall use the curves of the modified data for the generation of the clusters.

image002-store

Fig 1: Unmodified store.csv input data

The store.csv file data are a good demonstration example, which can be found in the SimplexManualsample directory. The file contains 100 rows on metalware sales during 24 weeks. Each row contains 24 values on one item of goods. The individual values correspond to the total delivery from the stock as of the given week of the year. For instance, if there is in the third column (3rd week) the value equal to 21, this means that the total delivery from the stock was 21 units (pounds, yards or gallons) for the first three weeks. If 5 units will be sold during the next week, then the value of the fourth column (4th week) shall be 26. The original unmodified data are shown in the Fig. 1 (CLUSTER panel of the SimplexImpera application).

image004-store

Fig 2: Diagram of store.csv unmodified input data

The store manager's goal is to find out how the demand for the items is increasing during the monitored period. The problem consists in the fact that the individual items are sold in different units (off, yards, pounds, gallons etc.) and different quantities (tens, hundreds or thousands). This fact is well visible in the input data diagram of the Fig. 2 (GRAPH panel of the SimplexImpera application).

If we want to divide the items into clusters with similar demands, then we should modify the input data in a manner allowing that the shape of the curves illustrates eloquently the demand increase throughout the year. Therefore, it is appropriate to use the A. [min,max] >> [0%,100%] - in row modification. In this case, the minimum is in the first column and the maximum in the last one for each row. As a consequence thereof, the goods with similar demand will get similar curves regardless to the quantities they are sold in – tens, hundreds or thousands. The sense of this modification becomes obvious in the diagrams of the resulting clusters shown hereunder.

image006-store

Fig 3: Modified store.csv input data

image008-store

Fig 4: Diagram of store.csv modified input data

How to set up the Similarity parameter? Since we want to obtain clusters of goods with approximately the same demand increase for the monitored period, where the columns order is important, the A. Shape - in parts should be used.

image010-store

Fig 5: Processing of the store.csv demonstration example

The resulting goods clusters with similar demand can be viewed in the SimplexImpera application whenever you test this demonstration example. Run the SimplexDivide application and select the store.csv input file (SimplexManualsample). Set up the Similarity parameter to A. Shape - in parts and the Modification parameter to A. [min,max] >> [0%,100%] - in row. Set up the Strictness parameter to the value of 30, click on the Divide button and save the result. Run the SimplexImpera application and select the store_A_A_W_0_30spx result. You may view the identified clusters. The diagrams of four (from among 16 division clusters) clusters are shown in the Fig. 6.

image012-store

Fig 6: Diagrams of four clusters of the division result (store_A_A_W_0_30spx)

The differences in demand for the goods during the monitored 24 week are apparent from the graphic representation of the clusters in the Fig. 6. The goods in the first identified cluster were sold at the same quantities approximately throughout the entire period. The goods in the second cluster were sold better in the first 14 weeks, with a posterior decrease. Vice versa, the sales of goods in the third cluster started to rise after an initial passivity, which took the first 6 weeks approximately. The last cluster on the figure is specific due to its small or almost zero demand between 6th and 18th weeks.

Settings of the simplexDivide: similarity: A – shape in parts, modification: A. [min,max] >> [0%,100%] - in row, Analysis: W - without data analysis, Strictness: 30

file-iconEmployee Sample Download

B. [min,max] >> [0%,100%] - in column - video 16

If you set up the Modification parameter to the B. [min,max] >> [0%,100%] - in column value, then the SimplexDivide application shall perform the following input file data modification in each column prior the generation of the clusters:

  • the application will find out the minimum and the maximum values in the column
  • the minimum value is equal to 0% and the maximum one to 100%
  • the application replaces all column values by the corresponding percentages related to minimum and maximum

For example, in the following 10 rows with 3 values, the minimum and maximum values are as follows in the individual columns: [25,59], [30000,64900] and [0,3].

28,37000,1
27,33100,0
50,63900,3
43,44100,3
33,53100,0
52,60300,1
46,64900,3
25,30100,1
25,30000,1
59,54500,1

After the B. [min,max] >> [0%,100%]- in column modification, these rows will look like as follows:

8.82%,20.06%,33.33%
5.88%,8.88%,0%
73.53%,97.13%,100%
52.94%,40.4%,100%
23.53%,66.19%,0%
79.41%,86.82%,33.33%
61.76%,100%,100%
0%,0.29%,33.33%
0%,0%,33.33%
100%,70.2%,33.33%

The SimplexDivide application shall perform this modification in each column separately (both minimum and maximum can be different for each column). Firstly, the SimplexDivide application modifies the input file data this way, and then, it shall use the curves of the modified data for the generation of the clusters.

image002-employee

Fig 1: Unmodified employee.csv input data

The employee.csv file data are a good demonstration example, which can be found in the SimplexManualsample directory. The file contains 100 rows on employees with the following data: age, revenue and number of children. The original unmodified data are shown in the Fig. 1 (CLUSTER panel of the SimplexImpera application).

image004-employee

Fig 2: Diagram of employee.csv unmodified input data

In this case, the goal consists in dividing the employees into clusters with similar ages, revenues and numbers of children. This will provide us information on economic and social structure of the staff. The trouble is that the individual data values are differing too much. The age values are in tens, revenue in thousands and the number of children is usually a very small number. Therefore the curves from these data are not specific sufficiently for the generation of the clusters. This fact is well visible in the input data diagram of the Fig. 2 (GRAPH panel of the SimplexImpera application).

If we want to divide the employees in clusters, then the input data shall be modified in a manner emphasizing the differences between the curves of the individual employees. Therefore, it is appropriate to use the B. [min,max] >> [0%,100%] - in column modification. This modification will cause the accordingly different curves for the employees with different data. The modified data are shown in Fig. 3 and their diagram in Fig. 4.

image006-employee

Fig 3: Modified employee.csv input data

image008-employee

Fig 4: Diagram of employee.csv modified input data

How to set up the Similarity parameter? Since we want to obtain the clusters with employees with approximately the same age, revenue and number of children while the columns order is unimportant, we should use the D. Proximity - in points setting.

image010-employee

Fig 5: Processing of the employee.csv demonstration example

You may test this example. Run the SimplexDivide application and select the employee.csv input file (SimplexManualsample). Set up the Similarity parameter to D. Proximity - in points and the Modification parameter to B. [min,max] >> [0%,100%] - in column and the Analysis parameter to B. Statistical analysis - in column. Keep the remaining parameter values unchanged, click on the Divide button and save the processing result. Subsequently, run the SimplexImpera application and select the employee_D_B_B_0_0spx result. You may view the identified clusters.

image012-employee

Fig 6: Division of the employe.csv file in 4 clusters

The SimplexDivide application has divided 100 employees into 4 clusters as described in the table of the IMPERA panel of the SimplexImpera application (Fig. 6). After the cluster order No., the number of employees in the cluster and the average age, revenue and children number values for that cluster follow.

image014-employee

Fig 7: Diagrams of four clusters of the division result (employee_D_B_B_0_0spx)

It is apparent from the graphic illustration in the Fig. 7, that the employees in the individual clusters are mutually similar as far as their age, revenue and number of children concerns. Also the average age, revenue and number of children values in the table in the Fig. 6 confirm this. For example, in the third cluster, there are 26 employees in middle age, with high revenue and either one or no child.  Also the data of these employees correspond to this in the Fig. 8.

image016-employee

Fig 8: Third identified cluster employees

Settings of the simplexDivide: similarity: D – proximity in points, modification:  B. [min,max] >> [0%,100%] - in column , Analysis: B  - Statistical analysis in column, Strictness: 0

file-iconCustomer Sample Download

E. [sum] >> [100%] - in row

If you set up the Modification parameter to E. [sum] >> [100%] - in row value, then the SimplexDivide application shall perform the following input file data modification in each row prior the generation of the clusters:

  • the application shall find out the total sum of all values in the row
  • the total sum of the row is equal to 100%
  • the application replaces all row values by the corresponding percentages from the total sum

For example, the total sum of the values in the following row with 7 values is equal to 63000, which represent 100% for the row.

0,0,0,6300,37800,12600,6300

After the E. [sum] >> [100%] - in row modification, this row will look like as follows:

0%,0%,0%,10%,60%,20%,10%

The SimplexDivide application shall perform this modification in each row separately (the total sum can be different for each row). Firstly, the SimplexDivide application modifies the input file data this way, and then, it shall use the curves of the modified data for the generation of the clusters.

image002-customer

Fig 1: Unmodified customer.csv input data

The customer.csv file data are a good demonstration example, which can be found in the SimplexManualsample directory. The file contains 18 rows on the long-term customers of the company trading certain sorts of fruits. Each row contains data on financial amounts, for which the specific customer has purchased the individual sorts of the fruits in the past period. The original unmodified data are shown in the Fig. 1 (CLUSTER panel of the SimplexImpera application).

image004-customer

Fig 2: Modified customer.csv input data

The goal of the commercial manager of the company is to carry out the segmentation of the customers by the products. Thus, he needs to divide the customers in segments with similar portfolio of the purchased products. Therefore it is necessary to modify the input data in a manner showing in columns the percentages expressing the customer's interest on the given product (the fruit sort) rather than the financial amounts. Therefore, it is appropriate to use the E. [sum] >> [100%] - in row modification. The modified data are shown in Fig. 3 and their diagram in Fig. 4 (GRAPH panel of the SimplexImpera application).

image006-customer

Fig 3: Diagram of customer.csv modified input data

How to set up the Similarity parameter? Since we want to obtain the clusters with customers with approximately the same percentages in the individual columns while the columns order is unimportant, we should use the D. Proximity - in points setting.

image008-customer

Fig 4: Processing of the customer.csv demonstration example

You may test this example. Run the SimplexDivide application and select the customer.csv input file (SimplexManualsample). Set up the Similarity parameter to D. Proximity - in points and the Modification parameter to E. [sum] >> [100%] - in row. Keep the remaining parameter values unchanged, click on the Divide button and save the processing result. Subsequently, run the SimplexImpera application and select the customer_D_E_W_0_0spx result. You may view the identified clusters (Fig. 5).

image010-customer

Fig 5: Product segmentation of customers in customer.csv

The SimplexDivide application divided successfully 18 customers in customer.csv to four clusters (segments). The different profiles of the customers as far as the product segmentation concerns, are apparent from the graphic illustration of the clusters. For example, 5 customers in the second cluster are interested mostly in grapes, bananas and oranges, but first of all in bananas. Also the data of these 5 customers correspond to this in the Fig. 6.

image012-customer

Fig 6: The second identified cluster customers

Settings of the simplexDivide: similarity: D – proximity in points, modification: E - [sum]  >> [100%] - in row, Analysis: W  - without data analysis, Strictness: 0