Notes on Information Theory

Some rough notes about information theory, thermodynamics and quantum mechanics.

Fluctuation Relations:

Great review by Jarzinsky

Jarzinsky equality

$\langle e^{-\beta W} \rangle = e^{-\beta \Delta F}$

Expectation values of work distribution with irreversibility, valid for equilibrium/non-equilibrium and in small $N$ systems.

From Jensen's inequality:

$\langle W \rangle \geq \Delta F$

and for probability of statistical violation of 2nd law:

$P(W \leq \Delta F - \xi) \leq e^{- \beta \xi}$

i.e. "What's the probability the 2nd law will be violated by at least $\xi$ ?" Exponentially decaying tail in thermodynamically forbidden region. Interestingly, no upper bound other than $P(W \lt \Delta F) \lt 1$ , violation with

50% possible. See e.g. Single electron transistor with 65% probability of decreasing $S$ (important is only expectation value!).

Generalized Jarzinsky Equality

Incorporating information from measurement with feedback control by external controller (e.g. Maxwell's Demon):

$\langle e^{- \beta W - I} \rangle = e^{- \Delta F}$

$I$ is the stochastic mutual information between probability distribution of feedback system before and after measurement/ extracting information from system.

$I(x,m) = \ln\!\frac{P(m \mid x)}{P(m)},$

so ensemble average $\langle I \rangle$ is the usual mutual information between system and memory:

$\langle I \rangle = \sum_{x,m} P(x,m)\,\ln\!\frac{P(m \mid x)}{P(m)}$

Time's arrow

Hypothesis test for forward vs. reverse in time (direction defined by $\frac{dS}{dt}$ ), given exchanged work. The likelihood is:

$L(Forward | W) = \frac{1}{1 + e^{-\beta (W - \Delta F)}}$

Easy to tell when exchange of work large, at equilibrium impossible to tell.

Crooks fluctuation theorem

$\frac{P_F(W)}{P_R(-W)} = e^{\beta \,(W - \Delta F)}$

Symmetry between forward and backward work distributions, forces crossing at $W = \Delta F$ .

Fluctuation–Dissipation theorem

$\chi_A''(\omega) = \tanh\!\bigl(\tfrac{\beta\omega}{2}\bigr)\,S_A(\omega)$

Relates the imaginary (dissipative) part of the linear response of (A) to its symmetrized fluctuation spectrum at equilibrium. Fluctuations and dissipation are connected by the factor $\tanh(\beta\omega/2)$ .

From a detalied balance point of view in a two-state system, this can be expressed as:

$\frac{\Gamma_{\downarrow} + \Gamma_{\uparrow}}{\Gamma_{\downarrow} - \Gamma_{\uparrow}} = \frac{1 + e^{-\beta\omega}}{1 - e^{-\beta\omega}} = \tanh\!\bigl(\tfrac{\beta\omega}{2}\bigr)$

This is useful in showing thermal properties of Rindler, Unruh, Hawking.

Fluctuation theorem

$\frac{P(\Sigma_t = A)}{P(\Sigma_t = -A)} = e^{A}$

Quantifies the exponential asymmetry between positive and negative entropy production fluctuations over a time interval $t$ .

Generalized Fluctuation theorem

Effective temperature becomes $T(\omega)$

$T_{\rm eff}(\omega) \;\equiv\; \frac{\omega\,S_A(\omega)}{2\,\chi_A''(\omega)}$

Defines a frequency-dependent “effective temperature” from the ratio of fluctuation power to dissipation.

Here, the two-point correlators are:

$S_A(\omega) = \frac{1}{2}\int_{-\infty}^{\infty}\!dt\;e^{i\omega t}\,\bigl\langle\{A(t),A(0)\}\bigr\rangle \;=\;\frac{1}{2}\int_{-\infty}^{\infty}\!dt\;e^{i\omega t}\,\bigl\langle A(t)A(0)+A(0)A(t)\bigr\rangle$

Symmetrized noise spectrum of the observable $A$ .

$\chi_A''(\omega) = \frac{1}{2i}\int_{-\infty}^{\infty}\!dt\;e^{i\omega t}\,\bigl\langle[A(t),A(0)]\bigr\rangle \;=\;\frac{1}{2i}\int_{-\infty}^{\infty}\!dt\;e^{i\omega t}\,\bigl\langle A(t)A(0)-A(0)A(t)\bigr\rangle$
Imaginary (dissipative) part of the linear susceptibility of $A$ .

$[X,Y] = XY - YX$
Commutator.

$\{X,Y\} = XY + YX$
Anticommutator.

Fluctuation–Dispersion and Dissipation-Dispersion relations

The imaginary part of the susceptibility is proportional to the fluctuation spectrum due to the fluctuation–dissipation theorem:

$\chi_A''(\omega) = \tanh\!\bigl(\tfrac{\beta\omega}{2}\bigr)\,S_A(\omega)$

The expression for the full response function obey the Kramers–Kronig relations:

$\chi_A(\omega) = \chi_{A'}(\omega) + i \, \chi_{A''}(\omega)$

$\chi_A'(\omega) = \frac{1}{\pi}\,\mathcal{P}\!\int_{-\infty}^{\infty} \frac{\chi_A''(\omega')}{\omega' - \omega}\,d\omega', \qquad \chi_A''(\omega) = -\frac{1}{\pi}\,\mathcal{P}\!\int_{-\infty}^{\infty} \frac{\chi_A'(\omega')}{\omega' - \omega}\,d\omega'.$

Substituting the fluctuation–dissipation theorem into the first of these gives a fluctuation–dispersion relation:

$\chi_A'(\omega) = \frac{1}{\pi}\,\mathcal{P}\!\int_{-\infty}^{\infty} \frac{\tanh\!\bigl(\tfrac{\beta\omega'}{2}\bigr)\,S_A(\omega')}{\omega' - \omega}\,d\omega'.$

Using Kramers–Kronig and the fluctuation-dissipation theorem together, all parts of the response can be reconstructed given one of them:

Equilibrium fluctuations $S_A(\omega) \;\; \longleftrightarrow \;$ Dissipation $\chi'' \;\; \longleftrightarrow \;$ Dispersion $\chi'$

Similarly for a dissipation-dispersion relation:

$\chi_A'(\omega) = \frac{1}{\pi}\,\mathcal{P}\!\int_{-\infty}^{\infty}\frac{\chi_A''(\omega')}{\omega'-\omega}\,d\omega' = \frac{1}{\pi}\,\mathcal{P}\!\int_{-\infty}^{\infty}\frac{\tanh\!\bigl(\frac{\beta\omega'}{2}\bigr)\,S_A(\omega')}{\omega'-\omega}\,d\omega'$

$\mathcal{P}$ is the Cauchy principal value of the integral, which converges for analytical functions (analyticity follows from causality):

$\mathcal{P}\!\int_{-\infty}^{\infty}\frac{f(\omega')}{\omega'-\omega}\,d\omega' \;=\; \lim_{\varepsilon\to0^+} \left[ \int_{-\infty}^{\omega-\varepsilon}\frac{f(\omega')}{\omega'-\omega}\,d\omega' \;+\; \int_{\omega+\varepsilon}^{\infty}\frac{f(\omega')}{\omega'-\omega}\,d\omega' \right].$

Applications to mechanics, optics, acoustics, electronics

Fluctuation-dissipation (and its corollaries fluctuation-dispersion and dissipation-dispersion) hold in any system that is linear, time‐translationally invariant, causal within Linear response theory (Kubo formalism), both in equilibrium and out-of-equilibrium with the generalized $T(\omega)$ :

a. Overdamped Brownian particles (Smoluchowski dynamics)

The mobility $\mu(\omega)$ and velocity‐autocorrelation spectrum satisfy

$\mathrm{Im}\,\mu(\omega) = \tanh\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,S_v(\omega), \qquad S_v(\omega) = \coth\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,\mathrm{Im}\,\mu(\omega)$

and obey Kramers–Kronig.

b. Electromagnetic response of linear media

The complex permittivity

$\varepsilon(\omega) = \varepsilon'(\omega) + i\,\varepsilon''(\omega)$

obeys Kramers–Kronig, and at thermal equilibrium

$\mathrm{Im}\,\varepsilon(\omega) = \tanh\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,S_P(\omega), \qquad S_P(\omega) = \coth\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,\mathrm{Im}\,\varepsilon(\omega)$

is tied to the equilibrium current‐fluctuation spectrum via the fluctuation-dissipation theorem.

Interestingly enough, we can get an estimate of the amplitude and phase noise a beam of light picks up when travelling through a material of known index of refraction or absorption spectrum.

The optical power is $P=\hbar\omega\,\Phi$ . The medium absorbs with coefficient $\alpha(\omega)$ over length $L$ , so the excess intensity fluctuations from thermal polarization noise follow from from the fluctuation–dissipation theorem.

$S_{I,\rm add}(\Omega) \approx 2\,\hbar\omega\,P \;\times\;\frac{k_B T}{\hbar\omega}\,\alpha(\omega)\,L =2\,P\,k_B T\,\alpha(\omega)\,L.$

Shot noise is

$S_{I,\rm shot}=2\,\hbar\omega\,P.$

Therefore

$\frac{S_{I,\rm add}}{S_{I,\rm shot}} =\frac{k_B T}{\hbar\omega}\,\alpha(\omega)\,L.$

In terms of SNR this becomes:

$\mathrm{SNR} = \frac{A}{S_{I,\rm shot} + S_{I,\rm add}} = \frac{A}{S_{I,\rm shot}\bigl(1 + \frac{S_{I,\rm add}}{S_{I,\rm shot}}\bigr)} = \frac{A}{S_{I,\rm shot}\bigl(1 + \frac{k_B T}{\hbar\omega}\,\alpha L\bigr)}.$

Thermal refractive‐index fluctuations over length $L$ also introduce phase noise. With wavevector $k_0= \frac{\omega \, n}{c}$ , the fluctuation-dissipation theorem yields

$S_{\phi,\rm add}(\Omega) \approx (k_0L)^2\,\frac{k_B T}{\hbar\omega}\,\alpha(\omega)\,L.$

Shot‐noise‐limited phase diffusion for a coherent beam of flux $\Phi=\frac{P}{(\hbar\omega)}$ is

$S_{\phi,\rm shot}=\frac{1}{2\Phi}=\frac{\hbar\omega}{2P}.$

Thus

$\frac{S_{\phi,\rm add}}{S_{\phi,\rm shot}} =2\,\frac{P}{\hbar\omega}\,(k_0L)^2\,\frac{k_B T}{\hbar\omega}\,\alpha(\omega)\,L.$

Both of these effects are usually minuscule fractions of the shot noise floor, but interesting to know about.

c. Acoustic (sound) waves in fluids or solids

The complex bulk modulus or sound‐attenuation coefficient has real and imaginary parts related by Kramers–Kronig and the attenuation spectrum is set by equilibrium pressure‐fluctuations.

$\mathrm{Im}\,K(\omega) = \tanh\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,S_p(\omega), \qquad S_p(\omega) = \coth\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,\mathrm{Im}\,K(\omega)$

d. Electronic transport in conductors

The complex conductivity

$\sigma(\omega)$

obeys Kramers–Kronig, and

$\mathrm{Re}\,\sigma(\omega)$

is given by current‐noise via the Johnson–Nyquist relation.

$\mathrm{Re}\,\sigma(\omega) = \tanh\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,S_J(\omega), \qquad S_J(\omega) = \coth\!\Bigl(\tfrac{\beta\omega}{2}\Bigr)\,\mathrm{Re}\,\sigma(\omega)$

Jarzynski Equality at strongly coupling

Usually, the world split into system and environment, where $\Delta U_{system} \, + \, \Delta U_{environment} = 0$ .

However, at small scales interactions between system and environment become non-negligible, then It becomes necessary to include the solvation (mean-force) potential $\phi$ . In this case, the Jarzinsky equality becomes:

$\bigl\langle e^{-\beta W + \beta\bigl[\phi(x_t,\lambda_t)-\phi(x_0,\lambda_0)\bigr]}\bigr\rangle \;=\;\exp\!\bigl[-\beta\,\Delta F^*\bigr]$

Here, $\Delta F^*$ the free-energy change of the Hamiltonian of mean force.

Bochkov–Kuzovlev equality

$\big\langle e^{-\beta W_{\mathrm{ex}}}\big\rangle = 1$

This shows that the exponential average of the “exclusive” work done by an external force (with no change in potential) equals unity.

Evans–Searles fluctuation theorem

$\frac{P(\Sigma_t = A)}{P(\Sigma_t = -A)} = e^{A}$

It states that over a finite time (t), the probability of observing entropy production $+A$ vs. $-A$ is exponentially biased by $A$ .

Gallavotti–Cohen steady-state fluctuation theorem

$\lim_{t\to\infty}\frac{1}{t}\ln\frac{P(\Sigma_t = A)}{P(\Sigma_t = -A)} = A$

In a nonequilibrium steady state, the long-time scaled log-ratio of entropy fluctuations equals the production itself.

Kurchan fluctuation theorem

$\frac{P(\Sigma_t = A)}{P(\Sigma_t = -A)} = e^{A}$

An extension of transient FTs to thermostatted steady states, showing the same exponential symmetry for entropy production.

Lebowitz–Spohn fluctuation theorem

$\frac{P(\omega)}{P(\Theta \omega)} = e^{\Sigma(\omega)}$

The ratio of the probability of a trajectory $\omega$ to its time-reversal $\Theta\omega$ is the exponential of its total entropy production.

Hummer–Szabo relation

$\big\langle \delta(x - x_t)\,e^{-\beta W}\big\rangle \;=\;\frac{e^{-\beta F(x)}}{Z_0}$

Enables reconstruction of equilibrium free-energy profiles $F(x)$ from ensembles of non-equilibrium work measurements.

Seifert integral fluctuation theorem

$\big\langle e^{-\Delta s_{\mathrm{tot}}}\big\rangle = 1$

States that for any stochastic process, the exponential average of the total entropy production equals unity.

Complexity vs. Entropy

Take some system with a phase transition like a 2d Ising model. As temperature increases, entropy monotonically increases. However, in a way both the T=0 and T=inf limits are equally simple: the system is homogeneous in both cases and every cell is like every other cell.

However in between, there is nontrivial structure and more "information" contained in the lattice, which is maximized at the critical point. It seems like this quantity might in general be proportional to $\frac{\partial S}{\partial T} = \frac{C(T)}{T}$ .

Some keywords to pick this up later:

Statistical Complexity (Crutchfield, Shalizi, et al.)
Excess Entropy (This is the deviation in entropy from an ideal gas, i.e. a fully decoupled system. Any correlations should show up here!)
There's some notion of susceptibility here, i.e. the higher the heat capacitance the more energy is needed to displace the system from its current state. At the critical point the system seems to be able to "withstand" the most energy to produce a given change in T and structure? Not sure how to express this more rigorously. Something like $\int_{Phase space} \rho(T) \cdot \rho(T + \epsilon) dV$ (integral over original and perturbed density of states) is maximized there? Maybe some kind of deviation metric is useful here, say Ruppeiner geometry? Is this the same as computational (ir)reducability?

Coordinate remapping

Take a Smoluchowski equation with some $V(x)$ and stationary distribution $P(x)$ and apply a nonlinear $x\to y$ map such that $P(y)\propto e^{-V_{\textrm eff}(y)/D}$ is Gaussian. After re-expressing the entropy in the new set of coordinates

$S_y=-\int P(y)\ln P(y)\,dy$

then defining

$V_{\textrm eff}(y)=-D\ln P(y)+\mathrm{const}$

the new potential now takes the form

$V_{\textrm eff}(y)\propto y^2$

a potential consistent with a Gaussian distribution.

The Jacobian term $,-D\ln|\frac{dx}{dy}|$ is absorbed into the coordinates $dy(dx)$ and entropy and energy have been repartitioned. This transformation maintains the free energy and partition function (which something like counts the number of accessible states). Explicitly,

$P(y) = P_x(x(y)) |\frac{dx}{dy}|.$

Functional optimization

Smoluchowski dynamics are a gradient flow of the free-energy functional $\mathcal F[P]=\int V(x)\,P(x)\,dx + k_{B} T\int P(x)\ln P(x)\,dx,$ so $\dot{\mathcal F}\le0$ and its unique stationary solution $P\propto e^{-V/(k_{B}T)}$ is the minimizer of $\mathcal F$ under $\int P=1$ .

In the case of the Schrodinger equation, there is no dissipation (evolution is unitary). However, applying a Wick-rotation into imaginary time the Schrodinger equation turns into a diffusion equation:

$-\hbar\,\partial_\tau\psi = \widehat H\,\psi$

which is the gradient flow of

$\mathcal E[\psi] = \langle\psi|\widehat H|\psi\rangle =\int\psi^*\bigl(-\tfrac{\hbar^2}{2m}\nabla^2+V(x)\bigr)\psi\,dx.$

Through the Wick rotation the higher eigenmodes of $\psi$ to decay in the same way as those in diffusion:

Smoluchowski: $P(x,t)=P_{\textrm eq}+\sum c_i\phi_i(x)e^{-\lambda_i t}$ .
Imaginary-time Schrödinger: $\psi(x,\tau)=\sum a_n\psi_n(x)e^{-E_n\tau/\hbar}$ . In each case only the lowest eigenmode = stationary distribution survives as $t,\tau\to\infty$ , driving the system to the minimum of the corresponding functional.

Schrodinger kinetic energy as diffusion term

The $-\nabla^2$ kinetic energy operator in the Schrodinger equation penalizes curvature/gradients, driving a diffusion like spread. This can be formalized by applying the Madelung transform $\psi=\sqrt\rho,e^{iS/\hbar}$ with $\rho = \Psi^* \Psi$ . For real $\psi$ (as is the case in a stationary ground state distribution among others), the kinetic term becomes the Weizsacker functional

$T_W[\rho] \;=\; \frac{\hbar^2}{8m}\int\frac{|\nabla\rho|^2}{\rho}\,dx,$

which is precisely proportional to the Fisher information of $\rho$ . Thus the ground state can be seen as minimizing

$\mathcal F[\rho] =\underbrace{\tfrac{\hbar^2}{8m}\!\int\frac{|\nabla\rho|^2}{\rho}}_{\textrm “entropy”} \;+\;\int V(x)\,\rho(x)\,dx.$

The Fisher information is: $(\partial_x\ln\rho)^2=(\partial_x \rho)^{2}/\rho^{2}$ , so $\int\rho(\partial_x\ln\rho)^2\,dx =\int\frac{(\partial_x \rho)^2}{\rho}\,dx.$

$I[\rho]=\int\frac{|\nabla\rho|^2}{\rho}\,dx.$ $I[\rho] =\int\frac{(\partial_x \rho)^2}{\rho}\,dx, \quad \rho=\psi^2,$

Therefore the Weizsacker kinetic energy for $\psi=\sqrt{\rho}$ is

$T_W[\rho] =\frac{\hbar^2}{2m}\int|\nabla\psi|^2\ dx =\frac{\hbar^2}{8m} I[\rho],$

so up to a constant factor the kinetic term is the Fisher information.

Some properties and intuition on Fisher information:

Score function sensitivity: $\partial_x\ln\rho$ measures how sensitive log‐likelihood is to shifts; averaging its square under $\rho$ is total information about location.
Cramér–Rao bound: $\mathrm{Var}(\hat\theta)\ge1/I[\rho]$ .
Geometric curvature: penalizes steep features in $\rho$ .
Infinitesimal KL shift: $\mathrm{KL}[\rho|\rho(\cdot+\varepsilon)]\approx\tfrac12,I[\rho],\varepsilon^2$ . Similar to previous question on entropy susceptibility at critical point?

Coordinate transform in Schrodinger equation

Analogous to Smoluchowski equation if we remap $x\to y=f(x)$ and $\psi\to\phi(y)=\sqrt{\frac{dx}{dy}},\psi(x(y))$ , the total energy expectation

$\langle H\rangle =\int\psi^*\bigl(-\tfrac{\hbar^2}{2m}\partial_x^2+V(x)\bigr)\psi\,dx =\int\phi^*\bigl(-\tfrac{\hbar^2}{2m}\partial_y^2+\widetilde V(y)\bigr)\phi\,dy$

is invariant, but you generate an extra Jacobian term in the potential $Q_J(y)$ . The invariants here are energy (Smoluchowski: free energy) and normalization (Smoluchowski: partition function). This is interesting because this means there is nothing special about quantum mechanics in $x$ , $p$ coordinates. As long as the Stone–von Neumann guarantees that any linear canonical change $(x,p)\to(q,P)$ leads to a unitarily equivalent representation. Specifically, any pair of self-adjoint operators $(\hat Q,\hat P)$ on a separable Hilbert space is canonical if $[\hat Q,\hat P] \;=\; i\hbar\,\mathbf1,$ (or equivalently their exponentials satisfy the Weyl relations).

In that case Stone–von Neumann tells you there is up to unitary equivalence exactly one irreducible representation of those relations, and you get equivalent wave-functions in $Q$ -space or in $P$ -space.

I initially thought any set of variables with non-zero commutator might do the trick as long as it forms a basis for the symplectic structure of phase space, but it turns out the the are more stringent conditions in QM. The commutator must be a multiple of $i \hbar$ in order to generate the canonical Heisenberg algebra. I don't understand what that means, maybe look into it later.

Nonlinear canonical transformations are more complicated but analogous to the Smoluchowski case: A non-linear coordinate transformation introduces Jacobian terms that get put in one of two places. Assuming a complete basis set of wavefunctions, the coordinate transformation can be expressed as a unitary transform of the wavefunction: $\Phi(Q) dQ = U\, \Psi(x(Q)) dx$ with a unitary transformation $U$ between the new and old forms of the wavefunction in the new and old coordinates. Due to unitarity this does not change the normalization of the wavefunction. Now the question is whether to include this change of coordinates in the operators and keep the old forms of the wavefunction, or keep the old form of the operators and include this change in the wavefunction. This difference seems to map on nicely onto the QM Schrodinger vs. Heisenberg picture, as a more general, time-independent version of it (since the regular formulation talks about changes over time):

Schrödinger‐picture coordinate change:

Move state into the new coordinates:

$\phi(y)=\bigl[U\,\psi\bigr](y) = \sqrt{\frac{dx}{dy}}\;\psi\bigl(x(y)\bigr),$
Carry your Hamiltonian and all other operators along via

$H_y = U\,H_x\,U^{-1}, \quad A_y = U\,A_x\,U^{-1}.$
Then expectation values are $\langle\psi|A_x|\psi\rangle = \langle\phi|A_y|\phi\rangle$ .

Heisenberg‐picture coordinate change

Keep wavefunction alone $\psi(x)$ in the original Hilbert space,
transform every operator into changed coordinates:

$\widetilde H = U^{-1}\,H_x\,U, \quad \widetilde A = U^{-1}\,A_x\,U.$
States live in the old $x$ –space, but the operators hold all the Jacobian/ordering corrections.
Expectation values stay invariant as $\langle\psi|A_x|\psi\rangle = \langle\psi|\widetilde A|\psi\rangle,$

This may be a good analogy in general for these coordinate transforms?

As discussed in entropic gravity, one can find a set of coordinates in which the distribution ( $\rho$ ) becomes uniform, or any other shape. E.g. using the cumulative-distribution map

$y = F(x)=\int_{x_{\min}}^x\rho_0(x')\,dx', \quad \phi(y)=\frac{\psi_0(x(y))}{\sqrt{\rho_0(x(y))}} = 1.$

The external $V$ is absorbed into a Jacobian quantum potential $Q_J(y)$ , and you can choose to view all of $E_0$ as coming from the Fisher (kinetic) term in the $y$ –frame.

However, since the energy expectation values are preserved (up to a common shift) the higher order modes do not in general transform to match the higher order modes of the new system. For instance, transforming some system into one where the

Entropy as Information Flow

There seems to be a common theme that in systems with dissipation some kind of functional is optimized for in the ground/stationary state. This is not true for unitary evolution, hence the Wick-rotation for the Schrodinger equation is necessary.

I think a good way to think about this is that from the perspecitve of a linear Markovian process (e.g. random walk) there is some kernel that reallocates state density at each time step. The question is what the fix point of repeated convolution of this kernel with itself is.

For most kernels the fix point of auto-convolution is a Gaussian and this is probably enough to qualify as dissipation.

If the kernel is a delta function or a permutation this is not true, the density must be split to at least two (on average, 50% chance of staying fully, 50% of transferring fully might also be ok) states. More rigurously what's needed is a irreducible (for long times all parts of the state space communicate), aperiodic, stochastic convolution kernel $K$ (or the continuous‐time master equation) has a unique stationary $P_{\textrm eq}$ that maximizes entropy (or minimizes free energy), with the KL divergence $D_{KL}[P|P_{\textrm eq}]$ as a Lyapunov functional.

There's also a neat connection between entropy production rate and entropy. For a Smoluchowski equation:

$\partial_tP=\partial_x\bigl(\partial_x (V) P+D_{KL}\,\partial_x P\bigr), \quad P_{\textrm eq}\propto e^{-V/D},$

the relative entropy

$D_{KL}[P\|P_{\textrm eq}]=\int P\ln\frac{P}{P_{\textrm eq}}\,dx$

satisfies $\frac{dD_{KL}}{dt} =-D_{KL} \!\int\frac{(\partial_x P)^2}{P}\,dx =-D_{KL}\;I[P],$

so $\dot D_{KL}\le0$ . Here

$I[P]=\int(\partial_x P)^2/P \, dx$

is Fisher information.
Thus the H-theorem is the statement that the KL divergence decays monotonically, with rate proportional to Fisher info. $D$ looks like the KL divergence of the current distribution and the equilibrium and its decay rate is proportional to Fisher information.

What's also cool is that given the current non-equilibrium distribution one can calculate what minimum entropy production rate some other process must expend in order to maintain the distribution, based on Harada-Sasa link. In order to maintain the distribution requires at least

$\sigma_{min} = k_B \, D_{KL} \, I[P]$

as the entropy production rate to offset it (per particle if P is a probability, if particle number then $I[P]$ scales with $N$ and $\sigma_{tot} = N \cdot \sigma_{min}$ ).

Ideal gas vs. Smoluchowski

While both diffuse there's an interesting difference between how these systems behave. Above I've written about repartitioning of energy and entropy in the Smoluchowski equation. By changing coordinates the distribution can be flattened, so the system is potential-free and the energy expectation value is 0. In an ideal gas in a box, the distribution is also homogeneous. However, the ideal gas still has a non-zero internal energy expectation value, more exactly $\langle E \rangle = \frac{3}{2} k_B \, T$ , or more generally $\langle E \rangle = \frac{3 + f}{2} k_B \, T$ if there are $f$ additional internal degrees of freedom due to equipartition. So why does one of them have a "residual" internal energy, while the other doesn't?

What is going on here is that in Smolukowski we kill of any kinetic energy contributions. The Smoluchowski equation is an overdamped limit where momentum immediately dissipates and is position or potential independent. On the other hand in a system that behaves like an ideal gas the mean free path is very long (ballistic limit), so the distribution spreads over both x and p in phase space.

The Smoluchowski equation the overdamped limit of the more general full Kramers that supports full phase‐space dynamics:

$\partial_t f(x,p,t) = -\frac p m\,\partial_x f(x,p,t) + \partial_x (V)\,\partial_p f(x,p,t) + \gamma\,\partial_p\bigl(p \, f(x,p,t) + m k_BT\,\partial_p f(x,p,t)\bigr),$

where momentum expoential decays with damping rate $\gamma$ . Here both kinetic energy $\tfrac{p^2}{2m}$ and potential energy $V(x)$ appear, and the stationary distribution is a Maxwell–Boltzmann distribution over all of phase space with potential:

$f_{\textrm eq}(x,p,t)\propto e^{-\frac{p^2}{2 \,m \, k_B \, T} - \frac{V(x)}{k_B \, T}}$

The corresponding free energy functional is

$\mathcal F[f] =\int \Bigl(\frac{p^2}{2m}+V(x)\Bigr)\,f(x,p,t) \;+\;k_{B} T \, f(x,p,t) \, \ln f(x,p,t) \, dx\,dp$

and contains a Gibbs-Shannon entropy integrated over all of phase space.

How exactly the system behaves next depends on the ensemble.

In a isolated system with fixed total energy or microcanonical ensemble, the local kinetic energy can vary as a function of $V(x)$ . This is actually a cool example where temperature gradients can develop spontaneously (albeit transiently, in the end they even out) without breaking the second law due to the entropy increase from occupying additional space. In this case the marginal distributions do not factorize in general. For large $N$ and far away from phase transitions (+ some additional qualifications) the final stationary distribution approaches:

$f_{\textrm eq}(x,p,t) \propto e^{\frac{-H}{k_{B}T}} = e^{-\frac{p^2}{2 \, m \, k_{B} \, T} - \frac{V(x)}{k_{B}T}}.$

At that point the momentum distribution is independent of position and only depends on the temperature of the system.

In a single, uniform heat bath at temperature $T$ instantaneously rethermalizes the kinetic degrees of freedom, so the temperature is uniform and every part of the system rapidly ( $t ~ \gamma$ ) decays to the same Maxwell–Boltzmann distribution. The distribution generally factorizes because of this and is the same as the large $N$ limit of the microcanonical ensemble but without the qualifications:

$f_{\textrm eq}(x,p,t) \propto e^{\frac{-H}{k_{B}T}} = e^{-\frac{p^2}{2 \, m \, k_{B} \, T} - \frac{V(x)}{k_{B}T}}.$

The gradient flow for this diffusion in momentum space can be written as:

$\frac{d\mathcal F}{dt} =-\gamma\,k_B T\!\int \;\frac{\bigl(\partial_p f(x,p,t)\bigr)^2}{f} dx\,dp \;=\;-\,\gamma\,k_BT\;I_p[f(x,p,t)] \;\le\;0$

This means $\mathcal F$ decays to its unique minimum, the Maxwell–Boltzmann equilibrium. The rate of dissipation of non-equilibrium modes (and rate of entropy production) can again be expressed through the Fisher information. However, note that this is only the Fisher information of the momentum part of the distribution. There is no $\partial_x f$ term because there is no diffusion $\partial_x^2f$ in the Kramers equation.
This fact appears strage to me, since an ideal gas clearly spreads out to a uniform distribution.

In fact, momentum carries you in $x$ via the advection $-\tfrac p m\,\partial_xf$ . In the overdamped Smoluchowski limit $\gamma\to\infty$ :

$\partial_tP = \partial_x\Bigl(\frac{1}{\gamma} \partial_x (V) P + \frac{k_BT}{m\gamma}\,\partial_xP\Bigr)$

where $\quad D = \frac{k_{B} \, T}{{m \, \gamma}}$ (Stokes-Einstein equation).

This expression has the explicit $\partial_x^2P$ diffusion term.

In the underdamped case, there is a deterministic transport (imagine a beam of gas propagating in a vacuum):

$-\tfrac p m\,\partial_x f(x,p,t) \quad\text{and}\quad + \partial_x (V) \,\partial_p f(x,p,t)$

This term acts to transport probability density from high to low and acts like an effective diffusion, but is not limited to it.