I'm just confused about the transpose aspect. Let's say A is an element of |R^(mxn) and we want to find the gradient of f(x) = Then by chain rule we get 2(Ax) * d/dx Ax. Does d/dx Ax = A or A^T and why? Also, just for clarity as I'm confused about it, does something like 2(Ax+b)A = 2A^T(Ax+b)? Like, if you change the order of matrix multiplication do you need to change A to A^T? I know matrix multiplication is not commutative, but confused beyond that.