Some Applications

Biostat 216

Author

Dr. Hua Zhou @ UCLA

Published

December 3, 2024

using Pkg
Pkg.activate(pwd())
Pkg.instantiate()
Pkg.status()
  Activating project at `~/Documents/github.com/ucla-biostat-216/2024fall/slides/14-app`
Status `~/Documents/github.com/ucla-biostat-216/2024fall/slides/14-app/Project.toml`
  [1e616198] COSMO v0.8.9
  [13f3f980] CairoMakie v0.12.16
  [f65535da] Convex v0.16.3
  [31c24e10] Distributions v0.25.113
  [ced4e74d] DistributionsAD v0.6.57
  [5789e2e9] FileIO v1.16.6
  [f6369f11] ForwardDiff v0.10.38
  [916415d5] Images v0.26.1
  [ee78f7c6] Makie v0.21.16
  [90014a1f] PDMats v0.11.31
  [0c5d862f] Symbolics v6.22.0
  [e88e6eb3] Zygote v0.6.73

1 Spectral clustering

  • Spectral clusting is one commonly used clustering approach in machine learning. ScikitLearn tutorial.

  • Given an undirected, weighted graph \(G=(V, E)\), we can define the weighted adjacency matrix \(\mathbf{W}=(w_{ij})_{i,j=1,...,n}\) with nonnegative weights \(w_{ij}\). \(w_{ij}=0\) means that the vertices \(v_i\) and \(v_j\) are not connected. The degree of a vertex \(v_i\) is \(d_i = \sum_{j=1}^n w_{ij}\).

  • The unnormalized graph Laplacian is \[ \mathbf{L} = \mathbf{D} - \mathbf{W}, \] where \(\mathbf{D} = \text{diag}(d_1,\ldots,d_n)\).

  • Exercise: For this graph, what is \(\mathbf{L}\)?

  • Properties of the unnormalized graph Laplacian:

    1. For every vector \(\mathbf{f} \in \mathbb{R}^n\), we have \[ \mathbf{f}' \mathbf{L} \mathbf{f} = \frac 12 \sum_{i,j=1}^n w_{ij} (f_i - f_j)^2. \]
      Proof: BV exercises 3.21, 7.9.

    2. \(\mathbf{L}\) is symmetric and positive semidefinite.
      Proof: Part 1.

    3. The smallest eigenvalue of \(\mathbf{L}\) is 0, the corresponding eigenvector is the constant one vector \(\mathbf{1}_n\).
      Proof: Obvious.

    4. \(\mathbf{L}\) has non-negative, real-valued eigenvalues \(0 = \lambda_1 \le \lambda_2 \le \cdots \le \lambda_n\).

    5. (Number of connected components and the spectrum of \(\mathbf{L}\)) The multiplicity \(k\) of the eigenvalue 0 of \(\mathbf{L}\) equals the number of connected components \(A_1,\ldots,A_k\) in the graph. The eigenspace of eigenvalue 0 is spanned by the indicator vectors \(\mathbf{1}_{A_1}, \ldots, \mathbf{1}_{A_k}\) of those components.

  • Normalized graph Laplacians. There are two versions of normalized graph Laplacians in the literature: \[ \mathbf{L}_{\text{sym}} = \mathbf{D}^{-1/2} \mathbf{L} \mathbf{D}^{-1/2} = \mathbf{I} - \mathbf{D}^{-1/2} \mathbf{W} \mathbf{D}^{-1/2} \] and \[ \mathbf{L}_{\text{rw}} = \mathbf{D}^{-1} \mathbf{L} = \mathbf{I} - \mathbf{D}^{-1} \mathbf{W}. \]

  • Unnormalized spectral clustering algorithm.
    Input: A similarity matrix \(\mathbf{S} \in \mathbb{R}^{n \times n}\) and number \(k\) of clusters to construct.

    1. Construct a similarity graph. Let \(\mathbf{W}\) be its weighted adjacency matrix.
    2. Compute the unnormalized Laplacian \(\mathbf{L}\).
    3. Computer the first \(k\) eigenvectors \(\mathbf{u}_1, \ldots, \mathbf{u}_k\).
    4. Let \(\mathbf{U} = (\mathbf{u}_1 \cdots \mathbf{u}_k) \in \mathbb{R}^{n \times k}\).
    5. Treat rows of \(\mathbf{U}\) as data points in \(\mathbb{R}^k\) and cluster them using the \(k\)-means algorithm into clusters \(C_1,\ldots,C_k\).
      Output: Clusters \(A_1,\ldots,A_k\) according to \(C_1,\ldots,C_k\).
  • Normalized spectral clustering according to Shi and Malik (2000)
    Input: A similarity matrix \(\mathbf{S} \in \mathbb{R}^{n \times n}\) and number \(k\) of clusters to construct.

    1. Construct a similarity graph. Let \(\mathbf{W}\) be its weighted adjacency matrix.
    2. Compute the unnormalized Laplacian \(\mathbf{L}\).
    3. Computer the first \(k\) generalized eigenvectors \(\mathbf{u}_1, \ldots, \mathbf{u}_k\) of the generalized eigenproblem \(\mathbf{L} \mathbf{u} = \lambda \mathbf{D} \mathbf{U}\).
    4. Let \(\mathbf{U} = (\mathbf{u}_1 \cdots \mathbf{u}_k) \in \mathbb{R}^{n \times k}\).
    5. Treat rows of \(\mathbf{U}\) as data points in \(\mathbb{R}^k\) and cluster them using the \(k\)-means algorithm into clusters \(C_1,\ldots,C_k\).
      Output: Clusters \(A_1,\ldots,A_k\) according to \(C_1,\ldots,C_k\).

    Remark: Generalized eigenproblem \(\mathbf{L} \mathbf{u} = \lambda \mathbf{D} \mathbf{U}\) is same as the eigenproblem \(\mathbf{L}_{\text{rw}} \mathbf{u} = \lambda \mathbf{u}\).

  • Normalized spectral clustering according to Ng, Jordan, and Weiss (2002)
    Input: A similarity matrix \(\mathbf{S} \in \mathbb{R}^{n \times n}\) and number \(k\) of clusters to construct.

    1. Construct a similarity graph. Let \(\mathbf{W}\) be its weighted adjacency matrix.
    2. Compute the normalized Laplacian \(\mathbf{L}_{\text{sym}}\).
    3. Computer the first \(k\) generalized eigenvectors \(\mathbf{u}_1, \ldots, \mathbf{u}_k\) of \(\mathbf{L}_{\text{sym}}\).
    4. Let \(\mathbf{U} = (\mathbf{u}_1 \cdots \mathbf{u}_k) \in \mathbb{R}^{n \times k}\).
    5. Form the matrix \(\mathbf{T} \in \mathbf{R}^{n \times k}\) from \(\mathbf{U}\) by normalizing the rows to norm 1.
    6. Treat rows of \(\mathbf{T}\) as data points in \(\mathbb{R}^k\) and cluster them using the \(k\)-means algorithm into clusters \(C_1,\ldots,C_k\).
      Output: Clusters \(A_1,\ldots,A_k\) according to \(C_1,\ldots,C_k\).

2 Matrix completion

  • Snapshot of the kind of data collected by Netflix. Only 100,480,507 ratings (1.2% entries of the 480K-by-18K matrix) are observed

  • Netflix challenge: impute the unobserved ratings for personalized recommendation. http://en.wikipedia.org/wiki/Netflix_Prize

  • Matrix completion problem. Observe a very sparse matrix \(\mathbf{Y} = (y_{ij})\). Want to impute all the missing entries. It is possible only when the matrix is structured, e.g., of low rank.

  • Example: Load the 128×128 Lena picture with missing pixels.

using FileIO, Images

lena = load("lena128missing.png")
# convert to real matrices
Y = Float64.(lena)
128×128 Matrix{Float64}:
 0.0       0.0       0.635294  0.0       …  0.0       0.0       0.627451
 0.627451  0.623529  0.0       0.611765     0.0       0.0       0.388235
 0.611765  0.611765  0.0       0.0          0.403922  0.219608  0.0
 0.0       0.0       0.611765  0.0          0.223529  0.176471  0.192157
 0.611765  0.0       0.615686  0.615686     0.0       0.0       0.0
 0.0       0.0       0.0       0.619608  …  0.0       0.0       0.2
 0.607843  0.0       0.623529  0.0          0.176471  0.192157  0.0
 0.0       0.0       0.623529  0.0          0.0       0.0       0.215686
 0.619608  0.619608  0.0       0.0          0.2       0.0       0.207843
 0.0       0.0       0.635294  0.635294     0.2       0.192157  0.188235
 0.635294  0.0       0.0       0.0       …  0.192157  0.180392  0.0
 0.631373  0.0       0.0       0.0          0.0       0.0       0.0
 0.0       0.627451  0.635294  0.666667     0.172549  0.0       0.184314
 ⋮                                       ⋱  ⋮                   
 0.0       0.129412  0.0       0.541176     0.0       0.286275  0.0
 0.14902   0.129412  0.196078  0.537255     0.345098  0.0       0.0
 0.215686  0.0       0.262745  0.0          0.301961  0.0       0.207843
 0.345098  0.341176  0.356863  0.513725     0.0       0.0       0.231373
 0.0       0.0       0.0       0.0       …  0.0       0.243137  0.258824
 0.298039  0.415686  0.458824  0.0          0.0       0.0       0.258824
 0.0       0.368627  0.4       0.0          0.0       0.0       0.235294
 0.0       0.0       0.34902   0.0          0.0       0.239216  0.207843
 0.219608  0.0       0.0       0.0          0.0       0.0       0.2
 0.0       0.219608  0.235294  0.356863  …  0.0       0.0       0.0
 0.196078  0.207843  0.211765  0.0          0.0       0.270588  0.345098
 0.192157  0.0       0.196078  0.309804     0.266667  0.356863  0.0

We fill out the missing pixels uisng a matrix completion technique developed by Candes and Tao \[ \text{minimize } \|\mathbf{X}\|_* \] \[ \text{subject to } x_{ij} = y_{ij} \text{ for all observed entries } (i, j). \] Here \(\|\mathbf{M}\|_* = \sum_i \sigma_i(\mathbf{M})\) is the nuclear norm. In words we seek the matrix with minimal nuclear norm that agrees with the observed entries. This is a semidefinite programming (SDP) problem readily solved by modern convex optimization software.

We use the convex optimizaion package COSMO.jl to solve for this semi-definite program.

# # Use COSMO solver
using Convex, COSMO
solver = COSMO.Optimizer

# Linear indices of obs. entries
obsidx = findall(Y[:] .≠ 0.0)
# Create optimization variables
X = Variable(size(Y))
# Set up optmization problem
problem = minimize(nuclearnorm(X))
problem.constraints += X[obsidx] == Y[obsidx]
# Solve the problem by calling solve
@time solve!(problem, solver) # fast
┌ Warning: Concatenating collections of constraints together with `+` or `+=` to produce a new list of constraints is deprecated. Instead, use `vcat` to concatenate collections of constraints.
└ @ Convex ~/.julia/packages/Convex/QKz6m/src/deprecations.jl:129
[ Info: [Convex.jl] Compilation finished: 3.96 seconds, 1.063 GiB of memory allocated
------------------------------------------------------------------
          COSMO v0.8.9 - A Quadratic Objective Conic Solver
                         Michael Garstka
                University of Oxford, 2017 - 2022
------------------------------------------------------------------

Problem:  x ∈ R^{32897},
          constraints: A ∈ R^{41025x32897} (41281 nnz),
          matrix size to factor: 73922x73922,
          Floating-point precision: Float64
Sets:     DensePsdConeTriangle of dim: 32896 (256x256)
          ZeroSet of dim: 8128
          Nonnegatives of dim: 1
Settings: ϵ_abs = 1.0e-05, ϵ_rel = 1.0e-05,
          ϵ_prim_inf = 1.0e-04, ϵ_dual_inf = 1.0e-04,
          ρ = 0.1, σ = 1e-06, α = 1.6,
          max_iter = 5000,
          scaling iter = 10 (on),
          check termination every 25 iter,
          check infeasibility every 40 iter,
          KKT system solver: QDLDL
Acc:      Anderson Type2{QRDecomp},
          Memory size = 15, RestartedMemory,    
          Safeguarded: true, tol: 2.0
Setup Time: 230.69ms

Iter:   Objective:  Primal Res: Dual Res:   Rho:
1   -1.4585e+03 1.5985e+01  5.9854e-01  1.0000e-01
25   1.4525e+02 5.1105e-02  1.1356e-03  1.0000e-01
50   1.4758e+02 1.1725e-02  1.4834e-03  6.8658e-01
75   1.4797e+02 5.5160e-04  4.7489e-05  6.8658e-01
100  1.4797e+02 1.7304e-05  1.3870e-06  6.8658e-01

------------------------------------------------------------------
>>> Results
Status: Solved
Iterations: 100
Optimal objective: 148
Runtime: 2.254s (2253.88ms)

 10.270626 seconds (46.91 M allocations: 3.466 GiB, 5.00% gc time, 92.00% compilation time: <1% of which was recompilation)
Problem statistics
  problem is DCP         : true
  number of variables    : 1 (16_384 scalar elements)
  number of constraints  : 1 (8_128 scalar elements)
  number of coefficients : 8_128
  number of atoms        : 3

Solution summary
  termination status : OPTIMAL
  primal status      : FEASIBLE_POINT
  dual status        : FEASIBLE_POINT
  objective value    : 147.9711

Expression graph
  minimize
   └─ nuclearnorm (convex; positive)
      └─ 128×128 real variable (id: 946…791)
  subject to
   └─ == constraint (affine)
      └─ + (affine; real)
         ├─ index (affine; real)
         │  └─ …
         └─ 8128×1 Matrix{Float64}
colorview(Gray, X.value)

3 Compressed sensing

  • Compressed sensing Candes and Tao (2006) and Donoho (2006) tries to address a fundamental question: how to compress and transmit a complex signal (e.g., musical clips, mega-pixel images), which can be decoded to recover the original signal?

  • Suppose a signal \(\mathbf{x} \in \mathbb{R}^n\) is sparse with \(s\) non-zeros. We under-sample the signal by multiplying a (flat) measurement matrix \(\mathbf{y} = \mathbf{A} \mathbf{x}\), where \(\mathbf{A} \in \mathbb{R}^{m\times n}\) has iid normal entries. Candes, Romberg and Tao (2006) show that the solution to \[ \begin{eqnarray*} &\text{minimize}& \|\mathbf{x}\|_1 \\ &\text{subject to}& \mathbf{A} \mathbf{x} = \mathbf{y} \end{eqnarray*} \] exactly recovers the true signal under certain conditions on \(\mathbf{A}\) when \(n \gg s\) and \(m \approx s \ln(n/s)\). Why sparsity is a reasonable assumption? Virtually all real-world images have low information content.

Generate a sparse signal and sub-sampling:

using CairoMakie, Makie, Random

# random seed
Random.seed!(123)
# Size of signal
n = 1024
# Sparsity (# nonzeros) in the signal
s = 10
# Number of samples (undersample by a factor of 8) 
m = 128

# Generate and display the signal
x0 = zeros(n)
x0[rand(1:n, s)] = randn(s)
# Generate the random sampling matrix
A = randn(m, n) / m
# Subsample by multiplexing
y = A * x0

# plot the true signal
f = Figure()
Makie.Axis(
    f[1, 1], 
    title = "True Signal x0",
    xlabel = "x",
    ylabel = "y"
)
lines!(1:n, x0)
f

Solve the linear programming problem.

# Use COSMO solver
solver = COSMO.Optimizer
# MOI.set(solver, MOI.RawOptimizerAttribute("max_iter"), 5000)

# Set up optimizaiton problem
x = Variable(n)
problem = minimize(norm(x, 1))
problem.constraints += A * x == y

# Solve the problem
@time solve!(problem, solver)
[ Info: [Convex.jl] Compilation finished: 0.49 seconds, 108.214 MiB of memory allocated
------------------------------------------------------------------
          COSMO v0.8.9 - A Quadratic Objective Conic Solver
                         Michael Garstka
                University of Oxford, 2017 - 2022
------------------------------------------------------------------

Problem:  x ∈ R^{2048},
          constraints: A ∈ R^{2176x2048} (135168 nnz),
          matrix size to factor: 4224x4224,
          Floating-point precision: Float64
Sets:     Nonnegatives of dim: 2048
          ZeroSet of dim: 128
Settings: ϵ_abs = 1.0e-05, ϵ_rel = 1.0e-05,
          ϵ_prim_inf = 1.0e-04, ϵ_dual_inf = 1.0e-04,
          ρ = 0.1, σ = 1e-06, α = 1.6,
          max_iter = 5000,
          scaling iter = 10 (on),
          check termination every 25 iter,
          check infeasibility every 40 iter,
          KKT system solver: QDLDL
Acc:      Anderson Type2{QRDecomp},
          Memory size = 15, RestartedMemory,    
          Safeguarded: true, tol: 2.0
Setup Time: 24.29ms

Iter:   Objective:  Primal Res: Dual Res:   Rho:
1   -8.1920e+03 8.3337e+00  5.9999e-01  1.0000e-01
25  -4.3108e-09 2.0860e-01  3.1188e-02  1.0000e-01
50   4.7129e+00 1.7996e-01  5.3711e-03  1.0000e-01
75   6.7191e+00 7.9040e-02  1.0756e-02  1.0000e-01
100  7.4400e+00 9.2727e-03  1.5535e-01  9.0268e+00
125  7.5387e+00 3.1036e-06  1.0998e-05  9.4803e-01

------------------------------------------------------------------
>>> Results
Status: Solved
Iterations: 136 (incl. 11 safeguarding iter)
Optimal objective: 7.539
Runtime: 0.244s (244.09ms)

  1.188182 seconds (3.00 M allocations: 287.420 MiB, 3.01% gc time, 84.44% compilation time: 43% of which was recompilation)
Problem statistics
  problem is DCP         : true
  number of variables    : 1 (1_024 scalar elements)
  number of constraints  : 1 (128 scalar elements)
  number of coefficients : 131_200
  number of atoms        : 4

Solution summary
  termination status : OPTIMAL
  primal status      : FEASIBLE_POINT
  dual status        : FEASIBLE_POINT
  objective value    : 7.5387

Expression graph
  minimize
   └─ sum (convex; positive)
      └─ abs (convex; positive)
         └─ 1024-element real variable (id: 590…510)
  subject to
   └─ == constraint (affine)
      └─ + (affine; real)
         ├─ * (affine; real)
         │  ├─ …
         │  └─ …
         └─ 128×1 Matrix{Float64}
# Display the solution
f = Figure()
Makie.Axis(
    f[1, 1], 
    title = "Reconstructed signal overlayed with x0",
    xlabel = "x",
    ylabel = "y"
)
scatter!(1:n, x0, label = "truth")
lines!(1:n, vec(x.value), label = "recovery")
axislegend(position = :lt)
f

4 Automatic differentiation (Auto-Diff)

Last week we scratched the surface of matrix/vector calculus and chain rules. Recent surge of machine learning sparked rapid advancement of automatic differentiation, which applies chain rule to the computer code to obtain exact gradients (up to machine precision).

4.1 Example: multivariate normal

Define log-likelihood function \[ \ell(\boldsymbol{\mu}, \boldsymbol{\Omega}) = - \frac{1}{2} \log \det \boldsymbol{\Omega} - \frac{1}{2} (\boldsymbol{y} - \boldsymbol{\mu})' \boldsymbol{\Omega}^{-1} (\boldsymbol{y} - \boldsymbol{\mu}). \]

using Distributions, DistributionsAD, LinearAlgebra, PDMats, Random, Zygote

Random.seed!(216)
n, p = 100, 3
Y = randn(p, n) # each column of Y is a sample
3×100 Matrix{Float64}:
  1.27293     0.45854    0.881297  …  -0.386115  -0.552413   0.154779
 -0.0716618   0.100877  -1.5708        0.866487  -0.434057   0.0726842
 -0.445158   -1.24595    0.485928     -1.25812   -0.675163  -0.882397
# log-likelihood evaluator
mnlogl = (θ) -> loglikelihood(MvNormal(θ[:μ], θ[:Ω]), Y)
#1 (generic function with 1 method)
# log-likelihood at (μ = 0, Ω = I)
θ == zeros(p), Ω = Matrix{Float64}(I(p)))
mnlogl(θ)
-442.06153771100463

Calculate gradient by auto-diff (reverse-mode).

# gradient of log-likeliohod at (μ = 0, Ω = I)
θ̄ = Zygote.gradient(mnlogl, θ)[1];
θ̄[:μ]
3-element Vector{Float64}:
   0.28541706942228096
  -5.239342603555363
 -29.58708211691224
θ̄[:Ω]
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.54102   1.8363  -3.40248
  ⋅       10.3971   4.94345
  ⋅         ⋅       3.44186

Let us check whether auto-diff yields the same answer as our analytical derivations.

# analytical gradient dL/dμ = Ω^{-1} ∑ᵢ (yᵢ - μ)
θ̄[:μ]  θ[:Ω] \ sum(Y .- θ[:μ], dims = 2)
true
# analytical gradient dL/dΩ = -n/2 Ω^{-1} + 1/2 Ω^{-1} ∑ᵢ (yᵢ - μ)(yᵢ - μ)' Ω^{-1}
θ̄[:Ω]  - (n/2) * inv(θ[:Ω]) + 0.5 * inv(θ[:Ω]) * ((Y .- θ[:μ]) * (Y .- θ[:μ])') * inv(θ[:Ω])
false

Surprise! Gradients of the covariance matrix do not match. Close inspection reveals that Julia calculates the gradient with respect to the upper triangular part of the covariance matrix \(\text{vech}(\Omega)\)

# gradient with respect to vech(Ω)
θ̄[:Ω]
3×3 UpperTriangular{Float64, Matrix{Float64}}:
 2.54102   1.8363  -3.40248
  ⋅       10.3971   4.94345
  ⋅         ⋅       3.44186
# gradient wrt vec(Ω)
- (n/2) * inv(θ[:Ω]) + 0.5 * inv(θ[:Ω]) * ((Y .- θ[:μ]) * (Y .- θ[:μ])') * inv(θ[:Ω])
3×3 Matrix{Float64}:
  2.54102    0.918151  -1.70124
  0.918151  10.3971     2.47172
 -1.70124    2.47172    3.44186

4.2 Example: factor analysis

Now let’s try the more complicated factor analysis, where \(\boldsymbol{\Omega} = \mathbf{F} \mathbf{F}' + \mathbf{D}\).

# log-likelihood evaluator
falogl = (θ) -> loglikelihood(MvNormal(θ[:F] * θ[:F]' + diagm(θ[:d])), Y)
#3 (generic function with 1 method)
r = 2
# log-likelihood at (F = ones(p, 2), d = ones(p))
θ = (F = ones(p, r), d = ones(p))
falogl(θ)
-490.85497606101296
# gradient of log-likeliohod at (F = ones(p, 2), d = ones(p)), reserse-mode
θ̄ = Zygote.gradient(falogl, θ)[1];
# auto-diff gradient wrt F
θ̄[:F]
3×2 Matrix{Float64}:
 -13.3555  -13.3555
  -9.9186   -9.9186
 -12.6542  -12.6542
# analytic gradient wrt F
Ω = θ[:F] * θ[:F]' + diagm(θ[:d])
S = Y * Y' / n
- n * inv(Ω) * θ[:F] + n * inv(Ω) * S * inv(Ω) * θ[:F]
3×2 Matrix{Float64}:
 -13.3555  -13.3555
  -9.9186   -9.9186
 -12.6542  -12.6542
# auto-diff gradient wrt d, reserse-mode
θ̄[:d]
3-element Vector{Float64}:
 1.1085074611630732
 2.0908476672040432
 0.606829360941255
# analytic gradient wrt d
- (n / 2) * diag(inv(Ω) -  inv(Ω) * S * inv(Ω))
3-element Vector{Float64}:
 1.1085074611630608
 2.090847667204021
 0.6068293609412523

Hessian (second derivatives) is essentially gradient of gradient. ForwardDiff.jl (forward-mode) is better for Hessian calculation.

using ForwardDiff

# HFF = ∂²L/∂F∂F'
ForwardDiff.jacobian(F -> Zygote.gradient(falogl, (F = F, d = ones(p)))[1][:F], θ[:F])
6×6 Matrix{Float64}:
 -1.90729   -1.25196   -0.728345  -4.12431    6.65049    6.94168
 -1.25196   -1.76658   -0.378075   6.65049   -5.94828    5.81976
 -0.728345  -0.378075  -2.681      6.94168    5.81976   -3.89466
 -4.12431    6.65049    6.94168   -1.90729   -1.25196   -0.728345
  6.65049   -5.94828    5.81976   -1.25196   -1.76658   -0.378075
  6.94168    5.81976   -3.89466   -0.728345  -0.378075  -2.681
# Hdd = ∂²L/∂d∂d'
ForwardDiff.jacobian(d -> Zygote.gradient(falogl, (F = θ[:F], d = d))[1][:d], θ[:d])
3×3 Matrix{Float64}:
 -27.0938    -6.33948   -6.27307
  -6.33948  -28.4971    -5.85244
  -6.27307   -5.85244  -26.3771
# Hθθ = ∂²L/∂θ∂θ'

# gradient function with vector input and vector output
falogl_grad = function(θvec)
    θ = (F = reshape(θvec[1:(p * r)], p, r), d = θvec[(p * r + 1):(p * r + p)])
    θ̄ = Zygote.gradient(falogl, θ)[1]
    return [vec(θ̄[:F]); vec(θ̄[:d])]
end

ForwardDiff.jacobian(falogl_grad, [vec(θ[:F]); vec(θ[:d])])
9×9 Matrix{Float64}:
 -1.90729   -1.25196   -0.728345  …   -0.981177    2.37667    1.56186
 -1.25196   -1.76658   -0.378075       1.39471    -3.71675    1.35155
 -0.728345  -0.378075  -2.681          1.3615      2.13315   -1.33874
 -4.12431    6.65049    6.94168       -0.981177    2.37667    1.56186
  6.65049   -5.94828    5.81976        1.39471    -3.71675    1.35155
  6.94168    5.81976   -3.89466   …    1.3615      2.13315   -1.33874
 -0.981177   1.39471    1.3615       -27.0938     -6.33948   -6.27307
  2.37667   -3.71675    2.13315       -6.33948   -28.4971    -5.85244
  1.56186    1.35155   -1.33874       -6.27307    -5.85244  -26.3771