Correlation matrix clustering solution

vaenster · Sep 11, 2023

Hi! Im new to this forum so forgive me if i chose the wrong section. Also i dont have any real degree in mathematics other than my gymnasial exams in Sweden.
I have been working on a google sheet for abit and ive ended up with a Correlation matrix (i think the name is?) that i need to be clustered in a certain way.
The sheet is containing members and their relation towards eachothers, ive posted a picture to see.

Currently there is 34 different members in the sheet, and i need to divide it into 3 groups with the goal of obtaining the least possible "relationship" value..
That means that the total of each group combined needs to be the lowest value possible. To add to this it doesent matter how big the groups are, if the groups combined get a lower value by stacking 20 members in 1 group, 10 in the other and 4 in the last, that is totaly fine.

Ive spend countless of hours working on this, and so far ive only learnt the name of my table (correlation matrix), and that i need some way of clustering it.
Ive read some into K-Means clustering and i think that its a reasonal way of dealing with my task.? There is also something called agglomeration clustering which i havent gotten as deep into.
Anyway, i post here to see if someone knows about a software or a way for me to solve my problem. Let me know, thanks

William.P

BigBeachBanana · Sep 11, 2023

vaenster said:
Hi! Im new to this forum so forgive me if i chose the wrong section. Also i dont have any real degree in mathematics other than my gymnasial exams in Sweden.
I have been working on a google sheet for abit and ive ended up with a Correlation matrix (i think the name is?) that i need to be clustered in a certain way.
The sheet is containing members and their relation towards eachothers, ive posted a picture to see.

Currently there is 34 different members in the sheet, and i need to divide it into 3 groups with the goal of obtaining the least possible "relationship" value..
That means that the total of each group combined needs to be the lowest value possible. To add to this it doesent matter how big the groups are, if the groups combined get a lower value by stacking 20 members in 1 group, 10 in the other and 4 in the last, that is totaly fine.

Ive spend countless of hours working on this, and so far ive only learnt the name of my table (correlation matrix), and that i need some way of clustering it.
Ive read some into K-Means clustering and i think that its a reasonal way of dealing with my task.? There is also something called agglomeration clustering which i havent gotten as deep into.
Anyway, i post here to see if someone knows about a software or a way for me to solve my problem. Let me know, thanks

William.P

The sklearn library in Python has both k-mean clustering and agglomeration clustering algorithms.

vaenster · Sep 11, 2023

BigBeachBanana said:
The sklearn library in Python has both k-mean clustering and agglomeration clustering algorithms.

I cant say that i know how to program, atleast not in Python. I however have some friends who knows c++. I will ask them if they possibly could help me with this task. I doubt they can so if someone has another solution that would be preferred

BigBeachBanana · Sep 11, 2023

vaenster said:
I cant say that i know how to program, atleast not in Python. I however have some friends who knows c++. I will ask them if they possibly could help me with this task. I doubt they can so if someone has another solution that would be preferred

What would be your preferred solution?

vaenster · Sep 11, 2023

BigBeachBanana said:
What would be your preferred solution?

If someone knows a software that does this with an easier interface so that i, together with some googeling could manage to get the result.
Also if i could get a confirmation that this is the correct way of approaching this problem.

vaenster · Sep 11, 2023

my answers are very delayed cause its awaiting approval from a moderator, most likely cause im new to this forum.

BigBeachBanana · Sep 11, 2023

vaenster said:
Hi! Im new to this forum so forgive me if i chose the wrong section. Also i dont have any real degree in mathematics other than my gymnasial exams in Sweden.
I have been working on a google sheet for abit and ive ended up with a Correlation matrix (i think the name is?) that i need to be clustered in a certain way.
The sheet is containing members and their relation towards eachothers, ive posted a picture to see.

Currently there is 34 different members in the sheet, and i need to divide it into 3 groups with the goal of obtaining the least possible "relationship" value..
That means that the total of each group combined needs to be the lowest value possible. To add to this it doesent matter how big the groups are, if the groups combined get a lower value by stacking 20 members in 1 group, 10 in the other and 4 in the last, that is totaly fine.

Ive spend countless of hours working on this, and so far ive only learnt the name of my table (correlation matrix), and that i need some way of clustering it.
Ive read some into K-Means clustering and i think that its a reasonal way of dealing with my task.? There is also something called agglomeration clustering which i havent gotten as deep into.
Anyway, i post here to see if someone knows about a software or a way for me to solve my problem. Let me know, thanks

William.P

I don't think what you have there is a "correlation matrix". Correlation has a value between -1 and 1. Where 1 indicates perfect correlation as you should expect on the main diagonal of your matrix. If you look at the main diagonal, for example, why is Mopi correlated to Mopi has a value of 4 but Bexxi to Bexxi has a value of 5?

It might be helpful to tell us how you came up with the matrix.

P.S: The restriction goes away after 5 posts.

vaenster · Sep 11, 2023

BigBeachBanana said:
I don't think what you have there is a "correlation matrix". Correlation has a value between -1 and 1. Where 1 indicates perfect correlation as you should expect on the main diagonal of your matrix. If you look at the main diagonal, for example, why is Mopi correlated to Mopi has a value of 4 but Bexxi to Bexxi has a value of 5?

It might be helpful to tell us how you came up with the matrix.

P.S: The restriction goes away after 5 posts.

Thanks for your response, so the relationship they have to themselves is actualy unimportant, i should have put x or 0 or something similar as they cant have a relation to themselves.

Im not sure what its called and most likely its not what im saying it is.
But if i explain it this way maybe you know what im after...

You can see the sheet as relations between different members. The higher the number, the worse their relation are.
The goal is to create 3 groups, where the total value is as low as possible, which translates to paring the members that has good relation with eachothers as much as possible, ofcourse it cant be perfect and some members will not go together as much, but this is what we want to minimize.

If i would put all the members with perfect relation with eachothers, the other groups will be chaotic. So in the end, we want to add the values of each group together, and that SUM should be as low as possible.

vaenster · Sep 12, 2023

Can someone approve my last comment please?

BigBeachBanana · Sep 12, 2023

If I'm not mistaken, this is a well-known "Graph Partitioning Problem" (my graph theory is out of touch). It's a class of combinatorial optimization problems where the goal is to partition the vertices (nodes) of a graph into subsets (partitions) while optimizing a certain objective function. The objective function can vary depending on the specific problem variant, but in your case, it involves minimizing the sum of relationship rankings within each group.

Try this Python code. Update the relationship_rankings 'A', 'B', etc... to the names of your members and add the missing ones.

Code:

import networkx as nx
import itertools

# Define the relationship rankings as a dictionary
relationship_rankings = {('A', 'B'): 5, ('B', 'C'): 10, ('A', 'C'): 2, ('A', 'D'): 0, ('B', 'D'): 20}

# Create a graph and add edges with weights
G = nx.Graph()
for (person1, person2), ranking in relationship_rankings.items():
    G.add_edge(person1, person2, weight=ranking)

# Initialize group sizes and groups
group_sizes = [20, 10, 4]
groups = [[] for _ in range(len(group_sizes))]

# Find a partition that minimizes the sum of relationship rankings
partition = None
min_partition_cost = float('inf')

# Try all possible combinations of nodes to partition
for combination in itertools.combinations(G.nodes(), sum(group_sizes)):
    current_partition = [list(combination[:group_sizes[0]]), list(combination[group_sizes[0]:group_sizes[0]+group_sizes[1]]), list(combination[group_sizes[0]+group_sizes[1]:])]

    # Calculate the sum of relationship rankings within each group
    partition_cost = 0
    for group in current_partition:
        for person1, person2 in itertools.combinations(group, 2):
            if G.has_edge(person1, person2):
                partition_cost += G[person1][person2]['weight']

    # Update the partition if it has a lower cost
    if partition_cost < min_partition_cost:
        partition = current_partition
        min_partition_cost = partition_cost

# Print the final partition and its cost
for i, group in enumerate(partition):
    group_members = ', '.join(group)
    print(f'Group {i + 1}: {group_members}')
print(f'Minimum Sum of Relationship Rankings: {min_partition_cost}')

vaenster · Sep 13, 2023

BigBeachBanana said:

If I'm not mistaken, this is a well-known "Graph Partitioning Problem" (my graph theory is out of touch). It's a class of combinatorial optimization problems where the goal is to partition the vertices (nodes) of a graph into subsets (partitions) while optimizing a certain objective function. The objective function can vary depending on the specific problem variant, but in your case, it involves minimizing the sum of relationship rankings within each group.

Try this Python code. Update the relationship_rankings 'A', 'B', etc... to the names of your members and add the missing ones.

Code:

import networkx as nx
import itertools

# Define the relationship rankings as a dictionary
relationship_rankings = {('A', 'B'): 5, ('B', 'C'): 10, ('A', 'C'): 2, ('A', 'D'): 0, ('B', 'D'): 20}

# Create a graph and add edges with weights
G = nx.Graph()
for (person1, person2), ranking in relationship_rankings.items():
    G.add_edge(person1, person2, weight=ranking)

# Initialize group sizes and groups
group_sizes = [20, 10, 4]
groups = [[] for _ in range(len(group_sizes))]

# Find a partition that minimizes the sum of relationship rankings
partition = None
min_partition_cost = float('inf')

# Try all possible combinations of nodes to partition
for combination in itertools.combinations(G.nodes(), sum(group_sizes)):
    current_partition = [list(combination[:group_sizes[0]]), list(combination[group_sizes[0]:group_sizes[0]+group_sizes[1]]), list(combination[group_sizes[0]+group_sizes[1]:])]

    # Calculate the sum of relationship rankings within each group
    partition_cost = 0
    for group in current_partition:
        for person1, person2 in itertools.combinations(group, 2):
            if G.has_edge(person1, person2):
                partition_cost += G[person1][person2]['weight']

    # Update the partition if it has a lower cost
    if partition_cost < min_partition_cost:
        partition = current_partition
        min_partition_cost = partition_cost

# Print the final partition and its cost
for i, group in enumerate(partition):
    group_members = ', '.join(group)
    print(f'Group {i + 1}: {group_members}')
print(f'Minimum Sum of Relationship Rankings: {min_partition_cost}')

I just wanted to give you an update. I am trying some things out and i might have come forward towards a solution, il give you an update when i have more. But its based of the code you sent me.

vaenster · Sep 13, 2023

BigBeachBanana said:

If I'm not mistaken, this is a well-known "Graph Partitioning Problem" (my graph theory is out of touch). It's a class of combinatorial optimization problems where the goal is to partition the vertices (nodes) of a graph into subsets (partitions) while optimizing a certain objective function. The objective function can vary depending on the specific problem variant, but in your case, it involves minimizing the sum of relationship rankings within each group.

Try this Python code. Update the relationship_rankings 'A', 'B', etc... to the names of your members and add the missing ones.

Code:

import networkx as nx
import itertools

# Define the relationship rankings as a dictionary
relationship_rankings = {('A', 'B'): 5, ('B', 'C'): 10, ('A', 'C'): 2, ('A', 'D'): 0, ('B', 'D'): 20}

# Create a graph and add edges with weights
G = nx.Graph()
for (person1, person2), ranking in relationship_rankings.items():
    G.add_edge(person1, person2, weight=ranking)

# Initialize group sizes and groups
group_sizes = [20, 10, 4]
groups = [[] for _ in range(len(group_sizes))]

# Find a partition that minimizes the sum of relationship rankings
partition = None
min_partition_cost = float('inf')

# Try all possible combinations of nodes to partition
for combination in itertools.combinations(G.nodes(), sum(group_sizes)):
    current_partition = [list(combination[:group_sizes[0]]), list(combination[group_sizes[0]:group_sizes[0]+group_sizes[1]]), list(combination[group_sizes[0]+group_sizes[1]:])]

    # Calculate the sum of relationship rankings within each group
    partition_cost = 0
    for group in current_partition:
        for person1, person2 in itertools.combinations(group, 2):
            if G.has_edge(person1, person2):
                partition_cost += G[person1][person2]['weight']

    # Update the partition if it has a lower cost
    if partition_cost < min_partition_cost:
        partition = current_partition
        min_partition_cost = partition_cost

# Print the final partition and its cost
for i, group in enumerate(partition):
    group_members = ', '.join(group)
    print(f'Group {i + 1}: {group_members}')
print(f'Minimum Sum of Relationship Rankings: {min_partition_cost}')

This is my code currently, and i think its working as intended. You can see the results attached as a picture, the number 180 is the total value between everyone in the sheet. 47 is the combined total of the groups , so i have to say its starting to look good.

Code:

from pulp import LpProblem, LpMinimize, LpVariable, lpSum, value, LpBinary, LpInteger
import csv

# Define the path to your CSV file
csv_file_path = r'C:\test2.csv'

# Create an empty dictionary to store relationship values
relationship_values = {}

# Read the CSV file and populate the relationship dictionary
with open(csv_file_path, mode='r') as csv_file:
    csv_reader = csv.reader(csv_file)
    member_names = next(csv_reader)[1:]  # Extract member names from the first row

    for row in csv_reader:
        source_member = row[0]
        for i, val in enumerate(row[1:]):
            target_member = member_names[i]
            relationship_values[(source_member, target_member)] = int(val)

# Create a binary optimization problem
prob = LpProblem("Grouping_Optimization", LpMinimize)

# Define the members and groups
members = member_names
groups = ['Group_A', 'Group_B', 'Group_C']

# Create decision variables with binary values
x = LpVariable.dicts('assign', [(m, g) for m in members for g in groups], cat=LpBinary)

# Create decision variables for relationship values
relationship_vars = LpVariable.dicts('relation', [(m1, m2, g) for m1 in members for m2 in members for g in groups], cat=LpInteger)

# Define the objective function using the provided relationships
prob += lpSum(relationship_values.get((m1, m2), 0) * relationship_vars[(m1, m2, g)] for m1 in members for m2 in members for g in groups)

# Add constraints
for m in members:
    prob += lpSum(x[(m, g)] for g in groups) == 1  # Each member is assigned to one group

# Additional constraints to relate x and relationship_vars
for m1 in members:
    for m2 in members:
        for g in groups:
            prob += relationship_vars[(m1, m2, g)] >= x[(m1, g)] + x[(m2, g)] - 1
            prob += relationship_vars[(m1, m2, g)] <= x[(m1, g)]
            prob += relationship_vars[(m1, m2, g)] <= x[(m2, g)]

# Solve the ILP
prob.solve()

# Create a dictionary to store the members assigned to each group and their respective group's total relationship value
group_members = {g: [] for g in groups}
group_relationship_values = {g: 0 for g in groups}

# Assign members to groups
for g in groups:
    for m in members:
        if x[(m, g)].varValue == 1:
            group_members[g].append(m)

# Calculate the total relationship value for each group
group_relationship_values = {}
for g in groups:
    group_relationship_values[g] = sum(
        relationship_values.get((m1, m2), 0) * relationship_vars[(m1, m2, g)].varValue
        for m1 in group_members[g]
        for m2 in group_members[g]
    )

# Divide the total relationship value for each group by 2
for g in groups:
    group_relationship_values[g] /= 2

# Print the results
for g in groups:
    print(f"Group {g}: {', '.join(group_members[g])}")
    print(f"Total relationship value for {g}: {group_relationship_values[g]}")
    print()

# Calculate and print the total relationship value between everyone in the CSV
total_relationship_value_all = sum(
    relationship_values.get((m1, m2), 0) * relationship_vars[(m1, m2, g)].varValue
    for m1 in members
    for m2 in members
    for g in groups
)

# Divide the total relationship value between everyone in the CSV by 2
total_relationship_value_all /= 2

# Print the total relationship value for everyone in the CSV
print(f"Total relationship value between everyone in the CSV: {total_relationship_value_all}")

# Calculate and print the total relationship value for each group
total_relationship_value_group = {}
for g in groups:
    total_relationship_value_group[g] = sum(
        relationship_values.get((m1, m2), 0) * relationship_vars[(m1, m2, g)].varValue
        for m1 in group_members[g]
        for m2 in group_members[g]
    )
    # Divide the total relationship value for each group by 2
    total_relationship_value_group[g] /= 2

# Calculate and print the sum of total relationship values for groups A, B, and C
total_relationship_value_groups_sum = sum(total_relationship_value_group[g] for g in groups)
print(f"Sum of total relationship values for groups A, B, and C: {total_relationship_value_groups_sum}")

# Create a dictionary to store the members assigned to each group
group_members = {g: [] for g in groups}

# Assign members to groups and populate group_members dictionary
for g in groups:
    for m in members:
        if x[(m, g)].varValue == 1:
            group_members[g].append(m)

# Create a text file to save group assignments
with open('group_assignments.txt', 'w') as file:
    for g, members in group_members.items():
        file.write(f'Group {g}:\n')
        for member in members:
            file.write(f'{member}\n')
        file.write('\n')

vaenster · Sep 15, 2023

BigBeachBanana said:

If I'm not mistaken, this is a well-known "Graph Partitioning Problem" (my graph theory is out of touch). It's a class of combinatorial optimization problems where the goal is to partition the vertices (nodes) of a graph into subsets (partitions) while optimizing a certain objective function. The objective function can vary depending on the specific problem variant, but in your case, it involves minimizing the sum of relationship rankings within each group.

Try this Python code. Update the relationship_rankings 'A', 'B', etc... to the names of your members and add the missing ones.

Code:

import networkx as nx
import itertools

# Define the relationship rankings as a dictionary
relationship_rankings = {('A', 'B'): 5, ('B', 'C'): 10, ('A', 'C'): 2, ('A', 'D'): 0, ('B', 'D'): 20}

# Create a graph and add edges with weights
G = nx.Graph()
for (person1, person2), ranking in relationship_rankings.items():
    G.add_edge(person1, person2, weight=ranking)

# Initialize group sizes and groups
group_sizes = [20, 10, 4]
groups = [[] for _ in range(len(group_sizes))]

# Find a partition that minimizes the sum of relationship rankings
partition = None
min_partition_cost = float('inf')

# Try all possible combinations of nodes to partition
for combination in itertools.combinations(G.nodes(), sum(group_sizes)):
    current_partition = [list(combination[:group_sizes[0]]), list(combination[group_sizes[0]:group_sizes[0]+group_sizes[1]]), list(combination[group_sizes[0]+group_sizes[1]:])]

    # Calculate the sum of relationship rankings within each group
    partition_cost = 0
    for group in current_partition:
        for person1, person2 in itertools.combinations(group, 2):
            if G.has_edge(person1, person2):
                partition_cost += G[person1][person2]['weight']

    # Update the partition if it has a lower cost
    if partition_cost < min_partition_cost:
        partition = current_partition
        min_partition_cost = partition_cost

# Print the final partition and its cost
for i, group in enumerate(partition):
    group_members = ', '.join(group)
    print(f'Group {i + 1}: {group_members}')
print(f'Minimum Sum of Relationship Rankings: {min_partition_cost}')

This is my code currently, what it does is trying every possible combination to find which one has the lowest total value when dividing the members of the csv into 3 groups. But the thing is, there is no reason for it to try every possible combination. Since its storing the best one found so far, when it starts a new combination and recieve higher value than the current best one even tho it only put together a couple of names and not finishing the combination there is no reason to ever finish that combination and instead should skip any combination starting with those names. Even if its just putting together 2 names and recieve a higher value, it should never put those 2 people in the same group. same goes if the value doesent reach higher than current best before it put 5 new names in a combination together, if it becomes a higher value than current best it should never put exactly those members into a group again. Its important that we keep all current code such as multiprocessing for performance enhancement since currently its running 80000 it/s and without it about 5000 it/s.

Ive attached 2 files as txt, change them to csv and you can run the code, just update the path name.

Code:

import csv
import math
import time
from multiprocessing import Pool, freeze_support
import itertools

def calculate_relationship_value(args):
    perm, members, groups, relationship_values = args
    group_assignment = {m: g for m, g in zip(perm, itertools.cycle(groups))}
    total_relationship_value = sum(
        relationship_values.get((m1, m2), 0)
        for m1 in members
        for m2 in members
        if group_assignment[m1] == group_assignment[m2]
    )
    return total_relationship_value, group_assignment

if __name__ == '__main__':
    freeze_support()  # Add this line to support running on Windows

    csv_file_path = r'C:\half.csv'

    # Load CSV data and set up variables
    with open(csv_file_path, mode='r') as csv_file:
        csv_reader = csv.reader(csv_file)
        member_names = next(csv_reader)[1:]

        members = member_names
        groups = ['Group_A', 'Group_B', 'Group_C']
        relationship_values = {}

        for row in csv_reader:
            source_member = row[0]
            for i, val in enumerate(row[1:]):
                target_member = member_names[i]
                relationship_values[(source_member, target_member)] = int(val)

    combination_count = 0
    iterations_per_second = 0

    # Set the number of permutations to process before calculating iterations per second
    permutations_before_update = 1000  # Adjust this as needed
    start_time = time.time()

    # Use multiprocessing pool to parallelize permutation calculation
    with Pool(processes=8) as pool:  # Adjust the number of processes as needed
        members_permutations = itertools.permutations(members)
       
        best_grouping = None
        lowest_total_relationship_value = float('inf')

        for perm_chunk in iter(lambda: tuple(itertools.islice(members_permutations, permutations_before_update)), ()):
            results = pool.map(calculate_relationship_value, [(perm, members, groups, relationship_values) for perm in perm_chunk])

            for total_relationship_value, group_assignment in results:
                combination_count += 1

                if total_relationship_value < lowest_total_relationship_value:
                    lowest_total_relationship_value = total_relationship_value
                    best_grouping = group_assignment

                elapsed_time = time.time() - start_time
                iterations_per_second = combination_count / elapsed_time
                remaining_permutations = math.factorial(len(members)) - combination_count
                estimated_seconds_remaining = remaining_permutations / iterations_per_second

                # Convert seconds to hours and minutes
                hours_remaining = int(estimated_seconds_remaining // 3600)
                minutes_remaining = int((estimated_seconds_remaining % 3600) // 60)

                print(f"{combination_count}/{math.factorial(len(members))} it/s: {iterations_per_second:.2f} | "
                      f"Estimated Time Remaining: {hours_remaining} hours {minutes_remaining} minutes", end='\r')

    # Calculate the total relationship value for each divided group
    group_relationship_values = {group: 0 for group in groups}
    for g in groups:
        group_members = [m for m, assigned_group in best_grouping.items() if assigned_group == g]
        group_relationship_values[g] = sum(
            relationship_values.get((m1, m2), 0)
            for m1 in group_members
            for m2 in group_members
        )

    # Display the best group and its total relationship value
    print("Best Group Assignment:")
    for g in groups:
        group_members = [m for m, assigned_group in best_grouping.items() if assigned_group == g]
        print(f"Group {g}: {', '.join(group_members)}")
        print(f"Total relationship value for {g}: {group_relationship_values[g]}")
        print()

    # Print the total number of combinations tried
    print(f"Total combinations tried: {combination_count}")

Correlation matrix clustering solution

vaenster

New member

Attachments

BigBeachBanana

Senior Member

vaenster

New member

BigBeachBanana

Senior Member

vaenster

New member

vaenster

New member

BigBeachBanana

Senior Member

vaenster

New member

vaenster

New member

BigBeachBanana

Senior Member

vaenster

New member

vaenster

New member

Attachments

vaenster

New member

Attachments