I was also taught about someone called Harold Jeffres who presented something which he called a prior. OK so Wikipedia taught me (twitter being unsuited for explaining things to me (or anyone)). His prior is proportional to the square root of the determinant of the Fisher information matrix. The Fisher information matrix is -1 times the second derivative of the expected log likelihood with respect to a parameter vector theta evaluated at the true theta. It is also the variance covariance matrix of the gradient of the log likelihood at the true theta (the two matrices are identical).
The Fisher information matrix is a function of theta. A penalty which depends on the Fisher information matrix is a function of theta. It can be called a prior (I reserve the term for sincere beliefs).
The point of Jeffreys's prior is that it is invariant under any reparametrization of the model. if phi = g(theta) and g is one to one, then the posterior distribution of phi given Jeffreys's prior on phi will imply exactly the same probabilities of any observable event as the posterior distribution of theta given Jeffrey's prior on theta.
This is true because if theta is distributed according the Jeffres's prior on theta, and phi = g(theta) then phi is distributed according to Jeffreys's prior on phi.
The gradient of the expected log likelihood with respect to theta is the gradient with respect to phi times the Jacobian of g. This means that Jeffrey's prior transforms the way probability densities do and the Jeffreys prior on theta implies the same distribution of phi as the Jeffreys prior on phi.
I am quite sure this is simply because the gradient of the expected log likelihood with respect to theta is a gradient of a scaler valued function of theta. for any scaler valued h(theta) I think the square root of (the gradient of h)(the gradient of h)' would work just as well. For example, if one used the gradient of the likelihood rather than the log likelihood, I think the resulting prior would be invariant as well.
Now except for the expected log likelihood, the Hessian (second derivative) is not equal to minus the product of the gradient and the gradient prime. That implies that for every h() except for the log likelihood there are two invariant priors the square root of the determinant of the expected value of (the gradient times the graident prime) and the square root of the determinant of the expected value of the second derivative.
I think this means that the set of invariant priors is basically about as large as the set of possible probability distributions of theta. Given a prior over theta, invariance implies a prior over any one to one function of theta, but this seems to me to be a statement about how to transform priors when one reparametrizes (which is just the formula for calculating the probability density of a function of a variable with a known probability density).
The log likelihood is a very popular function of parameters and data, but I see no particular reason why a distribution calculated using the log likelihood is more plausible than any other distribution. I don't see any particular appeal of Jeffreys prior. I think one does just as well by choosing a parametrization and assuming a flat distribution for that parametrization.
I don't think I have ever seen Jeffreys prior, that is, I don't think I have ever seen it used.