Placeholder.
We should probably have put human knowledge on adversarially cryptographically signed systems before now, but we didn’t. Now that it is cheap to fabricate history, how can we trust anything?
Obvious candidate solutions include:
Forked from superintelligence because the risk mitigation strategies are a field in themselves. Or rather, several distinct fields, which I need to map out in this notebook.
x-risk is a term used in, e.g. the rationalist community to discuss risks of a possible AI explosion.
FWIW: I personally think that (various kinds of) AI x-risk are plausible, and serious enough to worry about, even if they are not the most likely option. If the possibility is that everyone dies, then we should be worried about it, even if it is only a 1% chance.
I would like to write some wicked tail risk theory at some point.
There are people who think that focusing on x-risk is itself a risky distraction from more pressing problems, especially accelerationists.
e.g. what if we do not solve the climate crisis because we put effort into the AI risks instead? Or so much effort that it slowed down the AI that could have saved us? Or so much effort that we got distracted from other more pressing risks?
Here is one piece that I found rather interesting: Superintelligence: The Idea That Eats Smart People (although I thought that effective altruism meta criticism was the idea that ate smart people).
Personally, I doubt these need to be zero-sum trade-offs. Getting the human species ready to deal with catastrophes in general seems like a feasible intermediate goal.
There is a currently-viral school of X-risk-risk critique that names X-risk as a concern of TESCREALism, which might be of interest to some readers.
Singular learning theory has been pitched to me as a tool with applications to AI safety.
See Sparse Autoencoders for explanation have had a moment.
Let us consider general alignment, because I have little AI-specific to say yet.
AiSafety.com’s landscape map: https://aisafety.world/
Wong and Bartlett (2022)
we hypothesize that once a planetary civilization transitions into a state that can be described as one virtually connected global city, it will face an ‘asymptotic burnout’, an ultimate crisis where the singularity-interval time scale becomes smaller than the time scale of innovation. If a civilization develops the capability to understand its own trajectory, it will have a window of time to affect a fundamental change to prioritize long-term homeostasis and well-being over unyielding growth—a consciously induced trajectory change or ‘homeostatic awakening’. We propose a new resolution to the Fermi paradox: civilizations either collapse from burnout or redirect themselves to prioritising homeostasis, a state where cosmic expansion is no longer a goal, making them difficult to detect remotely.
Ten Hard Problems in and around AI
We finally published our big 90-page intro to AI. Its likely effects, from ten perspectives, ten camps. The whole gamut: ML, scientific applications, social applications, access, safety and alignment, economics, AI ethics, governance, and classical philosophy of life.
The follow-on 2024 Survey of 2,778 AI authors: six parts in pictures
Douglas Hofstadter changes his mind on Deep Learning & AI risk
François Chollet, The implausibility of intelligence explosion
Stuart Russell on Making Artificial Intelligence Compatible with Humans, an interview on various themes in his book (Russell 2019)
Attempted Gears Analysis of AGI Intervention Discussion With Eliezer
Kevin Scott argues for trying to find a unifying notion of what knowledge work is to unify what humans and machines can do (Scott 2022).
ML people
When dealing with high-dimensional Gaussian distributions, sampling can become computationally expensive, especially when the covariance matrix is large and dense. Traditional methods like the Cholesky decomposition become impractical. However, if we can efficiently compute the product of the covariance matrix with arbitrary vectors, we can leverage Langevin dynamics to sample from the distribution without forming the full covariance matrix.
I have been doing this recently in the setting where is outrageously large, but I can nonetheless calculate it for arbitrary vectors ; This arises, for example, when I have a kernel which I can evaluate and I need to use it to generate some samples from my random field, especially where the kernel arises as linear product under some feature map.
TODO: evaluate actual computational complexity of this method.
Note this is really just some notes I have made to myself. I need to sanity check the procedure on a real problem.
We aim to sample from a multivariate Gaussian distribution:
where:
Langevin dynamics provide a way to sample from a target distribution by simulating a stochastic differential equation (SDE) whose stationary distribution is the desired distribution. For a Gaussian distribution, the SDE simplifies due to the properties of the normal distribution (i.e. Gaussians all the way down).
The continuous-time Langevin equation is
where:
For our Gaussian distribution, the potential function is:
We discretize the Langevin equation using the Euler-Maruyama method with time step :
where .
Next, the gradient of the potential function is:
Instead of computing directly, we can solve the linear system:
for , which gives .
To solve efficiently without forming , we use the Conjugate Gradient (CG) method. The CG method is suitable for large, sparse, and positive-definite matrices and relies only on matrix-vector products .
Given :
We have the following algorithm:
For my sins, I am cursed to never escape PyTorch. Here is an implementation in that language that I got an LLM to construct for me from the above algorithm.
First, we need a function to compute efficiently.
import torch
def sigma_mv_prod(v):
# Efficient computation of Σv
# For demonstration, let's assume Σ = A A^T
# where A is some known matrix we can compute with
# Replace this with your specific implementation
A = get_matrix_A()
return A @ (A.T @ v)
Oh dang, the LLM did a really good job on this.
def cg_solver(b, tol=1e-5, max_iter=100):
x = torch.zeros_like(b)
r = b.clone()
p = r.clone()
rs_old = torch.dot(r, r)
for _ in range(max_iter):
Ap = sigma_mv_prod(p)
alpha = rs_old / torch.dot(p, Ap)
x += alpha * p
r -= alpha * Ap
rs_new = torch.dot(r, r)
if torch.sqrt(rs_new) < tol:
break
p = r + (rs_new / rs_old) * p
rs_old = rs_new
return x
def sample_mvn_langevin(mu, num_samples=1000, epsilon=1e-3, burn_in=100):
"""
Samples from N(mu, Σ) using Langevin dynamics.
Parameters:
- mu: Mean vector (torch.Tensor of shape [D])
- num_samples: Number of samples to collect after burn-in
- epsilon: Time step size
- burn_in: Number of initial iterations to discard
"""
D = mu.shape[0]
x = mu.clone().detach()
samples = []
total_steps = num_samples + burn_in
for n in range(total_steps):
# Compute gradient: v = Σ^{-1} (x - μ)
r = x - mu
v = cg_solver(r, tol=1e-5, max_iter=100)
# Langevin update
noise = torch.randn(D)
x = x - epsilon * v + torch.sqrt(torch.tensor(2 * epsilon)) * noise
if n >= burn_in:
samples.append(x.detach().clone())
return torch.stack(samples)
# Example mean vector mu
D = 100 # Dimensionality
mu = torch.zeros(D)
# Function to simulate Σv (customize as needed)
def get_matrix_A():
# For demonstration, let's create a random low-rank matrix
rank = 10
A = torch.randn(D, rank)
return A
# Run the sampler
samples = sample_mvn_langevin(mu)
After sampling, it’s wise to verify that the samples approximate the target distribution.
import matplotlib.pyplot as plt
empirical_mean = samples.mean(dim=0)
empirical_cov = torch.from_numpy(np.cov(samples.numpy(), rowvar=False))
print("Empirical Mean:\n", empirical_mean)
print("Empirical Covariance Matrix:\n", empirical_cov)
# Plot histogram for the first dimension
plt.hist(samples[:, 0].numpy(), bins=30, density=True)
plt.title("Histogram of First Dimension")
plt.xlabel("Value")
plt.ylabel("Density")
plt.show()
Placeholder.
Jesse Hoogland, Neural networks generalize because of this one weird trick:
Statistical learning theory is lying to you: “overparametrized” models actually aren’t overparametrized, and generalization is not just a question of broad basins.
Recommended to me by Rohan Hitchcock:
Alexander Gietelink Oldenziel, Singular Learning Theory
metauni’s Singular Learning Theory seminar
Timaeus is an AI safety research organization working on applications of singular learning theory (SLT) to alignment.
Various interesting challenges in this domain.
How do we verify the authenticity of content? Can we tell if a piece of text was generated by an AI, and if so, which one? Can we do this adversarially? (Doubtful) Can we do this cooperatively, and have AI models sign their outputs with cryptographic signatures as proof that a particular model generated them, as a kind of quality assurance? (Less doubtful but still not certain)
Enter the world of watermarking and cryptographic signatures for AI outputs and AI models.
Consider scenarios like academic integrity, misinformation campaigns, or intellectual property rights. If someone uses an AI model to generate a paper and passes it off as original work, that’s a headache for educators. On a larger scale, various actors could flood social media with AI-generated propaganda. How much can we mitigate these problems by verifying the origin of content?
Overviews in (Cui et al. 2024; Zhu et al. 2024).
Not covered: Data-blind methods such as homomorphic learning, federated learning…
Keyword: Proof-of-learning, …
(Garg et al. 2023; Goldwasser et al. 2022; Jia et al. 2021)
TBD
E.g. Abbaszadeh et al. (2024):
A zero-knowledge proof of training (zkPoT) enables a party to prove that they have correctly trained a committed model based on a committed dataset without revealing any additional information about the model or the dataset. An ideal zkPoT should offer provable security and privacy guarantees, succinct proof size and verifier runtime, and practical prover efficiency. In this work, we present , a zkPoT targeted for deep neural networks (DNNs) that achieves all these goals at once. Our construction enables a prover to iteratively train their model via (mini-batch) gradient descent, where the number of iterations need not be fixed in advance; at the end of each iteration, the prover generates a commitment to the trained model parameters attached with a succinct zkPoT, attesting to the correctness of the executed iterations. The proof size and verifier time are independent of the number of iterations.
A.k.a. fingerprinting. I’m sceptical that this is of any practical use, but it is a good theoretical starting point.
Watermarking is about embedding a hidden signal within data that can later be used to verify its source. The trick is to do this without altering the human-perceptible content. For images, this might involve tweaking pixel values in a way that’s imperceptible to the human eye but detectable through analysis.
For text, it’s trickier (Huang et al. 2024; Li et al. 2024).
One approach is to modify the probabilities in the language model’s output to favour certain token patterns. Suppose we’re using a language model that predicts the next word based on some probability distribution. By slightly biasing this distribution, we can make the model more likely to choose words that fit a particular statistical pattern.
For example, we could define a hash function that maps the context and potential next tokens to a numerical space. We then adjust the probabilities so that tokens with hashes satisfying a certain condition (like being within a specific range) are more likely to be selected.
TBD
Of course, any watermarking scheme must consider the possibility of adversaries trying to remove or alter the watermark. This leads us into game theory and adversarial models. We need to design watermarking methods that are robust against attempts to detect and strip them.
One way is to make the watermark indistinguishable from the natural output of the model. If the statistical patterns introduced are subtle and aligned with the model’s inherent probabilities, it becomes exceedingly difficult for an adversary to pinpoint and remove the watermark without degrading the text quality.
Suppose our language model defines a probability distribution over sequences . Our goal is to define a modified distribution such that:
This sets up an optimisation problem where we balance the fidelity of the text with the detectability of the watermark, and it sounds like a classic adversarial learning problem.
The hothouse in which we collectively cultivate our fictions. A whole participatory universe of transformative works about pop culture properties.
AO3’s 15-year journey from blog post to fanfiction powerhouse is a core sample through the movement as it transformed in my early adulthood:
Primarily written by women, and often featuring erotica and/or queer relationships, sexism and homophobia both played a part in its denigration. Coppa, who hosted archives at the time, would sometimes have to delete stories at the request of its writers, who feared retribution in their jobs or relationships. “I would get frantic emails from people saying, ‘I’m going through a divorce and my husband is going to take my fic and tell the judge I’m an unfit mother and try to take my children. How fast can you make me disappear from the internet?’”
Coppa herself felt pressure to keep certain aspects of fandom secret. At a talk she was giving about fanfiction, an audience member pressed her: weren’t there people writing about Kirk and Spock having sex? “I remember taking a breath and saying quite consciously, ‘Yeah, there is, and it’s amazing, you should read some.’ And it was the first time I had ever done that. It was the kind of thing we all skated around.” She had nightmares about her fanfic costing her tenure track at the college she worked at and would frequently imagine how to defend herself against a hypothetical complaint about her having written sex scenes.
“But the only way through that is to lean into it,” she says. Though she still sometimes has people imply fanfic writing is a strange hobby, she responds differently. “‘Are you dead inside?’ is sort of my answer,” she laughs, comparing it to hobbyist painting, music, or knitting. “‘But isn’t some of it erotic?’ Yes. Yes, it is. It turns out women have a sexuality.”
Fanfiction is of course thirsty, as is much of fandom, and there is interesting crossover to pornography. Mind you, what part of human life does not have occasional crossover with porn?
A foundational work is Russ (1985).
Fanlore the community’s wiki about itself
‘We can continue Pratchett’s efforts’: the gamers keeping Discworld alive
The Journal of Transformative Works and Cultures:
Transformative Works and Cultures (TWC) is an international, peer-reviewed journal published by the Organization for Transformative Works. TWC publishes articles about transformative works, broadly conceived and articles about the fan community.
We invite papers in all areas, including fan fiction, fan vids, film, TV, anime, fan art, comic books, cosplay, fan community, music, video games, celebrities and machinima. We encourage a variety of critical approaches, including feminism, gender studies, queer theory, postcolonial theory, audience theory, reader-response theory, literary criticism, film studies, and posthumanism. We also encourage authors to consider writing personal essays integrated with scholarship; hyperlinked articles; or other forms that test the limits of the genre of academic writing.
Notes on the emerging futures of empathetic machines. The naming is due to Esther Perel.
I would be interested to think about artificial intimacy digital technologies as enabling or infantilising.
On large changes in human behaviour requiring small activation energy. Is this a natural category? I’m reluctant to say so, but the idea comes up so often that I will audition it.
Intermittently made famous by think-pieces about high-impact behavioural interventions, e.g. Broken Windows (Kelling and Wilson 1982) and Nudge (Thaler and Sunstein 2009). This usage tends to get controversial, because of moral issues about transparency, agency, consent…
Kernel tricks for trajectories. That is to say, the other kernel trick for trajectories.
I am told e.g. that this generalises the Radon transform, as seen in tomography, so I guess I should know about that for my own work.
Applications include the identification of forcing fields for functions by sparsely observable trajectories, without finite-difference approximations, for system identification and functional inverse problems.
On the tension between the representation of functions in function space and in weight space in neural networks. We “see” the outputs of neural networks as functions, generated by some inscrutable parameterization in terms of weights, which is more abstruse but also more tractable to learn in practice. Why might that be?
When we can learn in function space many things work better in various senses (see, e.g. GP regression), but such methods rarely dominate in messy practice. Why might that be? When can we operate in function space? Sometimes we really want to, e.g. in operator learning.
See also low rank GPs, partially Bayes NNs, neural tangent kernels, functional regression, functional inverse problems, overparameterization, wide limits of NNs…
A hyped variant of classic NNs.
Where the classic NN (i.e. the MLP) relies on layers of linear transformations (weights) and fixed activation functions (like ReLU or tanh) at the nodes, the Kolmogorov-Arnold Networks (KANs) learn activation functions.
Interesting things about these networks, from my first impression
The Kolmogorov-Arnold theorem claims that any continuous multivariate function can be decomposed into sums of univariate functions. The classic representation looks like this:
This means that for any complex multivariate function, you can break it down into a composition of univariate functions plus some addition.
In a KAN (Liu, Wang, et al. 2024), we learn how these univariate functions compose themselves into a multivariate structure, instead of fixing the composition in advance. The “weights” between nodes, represented by splines or other parameterized functions, are free to learn what the best local univariate relationship is.
KANs are structured by stacking KAN layers, where each layer looks something like this:
where is a learnable univariate function (parameterized as a spline, for instance), and is the activation value from the previous layer. So while each function is univariate, the overall transformation still respects the multivariate nature of the input. The final function learned by a KAN is a composition of these layers:
This makes KANs flexible like MLPs, but maybe the information is combined in a way that is more comprehensible to the human mind: We can visualize or probe the learned univariate functions .
Symbolic regression tries to discover closed-form expressions—think $ y = (x) + $—directly from the data, i.e. a symbolic representation of a function. Symbolic regression is powerful because it gives you a human-readable formula, something interpretable, but it is not robust to noise. The search space of possible functions is huge, and small changes in data or noise can cause symbolic regression to completely fail.
On the other side of the spectrum, we have traditional NNs (MLPs), which are universal function approximators but work as “black boxes.” They don’t tell us how they approximate a function; they just do it. We get almost zero interpretability.
A KAN can, in theory, produce output that mimics symbolic regression by learning a function’s compositional structure. For example, if we’re modelling something like:
MLPs would use layers of matrix multiplications and fixed activations (like ReLU) to approximate this. But with KANs, the model potentially actually learns the internal univariate functions (like the and ) and how to combine them. Once the KAN has trained, we could probe its learned activation functions and discover, for example, that it has closely approximated and as part of its learned structure.
The paper claims that KANs enjoy a neural scaling law of $ N^{-4} $ (where is the test loss and is the number of parameters).
Don’t use quarto preview
for large sites. Use caddy instead.
I constantly recommend the excellent quarto system for all manner of research tasks, including writing papers, building websites and making presentations. My recommendation is not because of the preview server, which is the worst bit of the messiest part, at least for my daily use case.
The blog you are reading right now, which is itself a quarto website, is probably much bigger than anything that the developers of quarto experience in general and as such it exposes many deficiencies in the preview server that teensy little websites might not.
I spent a long time fighting with the preview server (i.e. the one that spins up when we invoke quarto preview
) for quarto websites and eventually the list of information and hacks grew to deserve a page of its own.
Fear not! There are hacks to make it work better. Let me share the ones I know.
This page is dismissive of the quarto preview server, so you might think I am not a massive fan of quarto.
I am a massive fan of quarto insofar as I am a fan of anything, which is to say, grudgingly. It is an incredible bit of infrastructure that makes my life better. It is with deep love, and respect borne of my excellent experience, that I say: Quarto is so good that I will keep on using it despite how much the preview server sucks.
TBH it probably would not make sense for the developers of quarto to spend time on my use case; they will help more people by fixing other things.
In a world where I had leisure to contribute to open source, I would probably try to fix or replace the preview server, but that is not what I am being paid for and not what I wish to do in my scanty free time, so for now, here are easy, pragmatic workarounds that are pretty good and much easier than touching that giant and baffling web-app.
quarto preview
is too fancyFirst caveat: The quarto preview server invoked by quarto preview
, is, per default, slightly too clever for my taste.
Excessive cleverness 1: It tries to do something fancy with process management. I am not sure what the nature of the fanciness is, but the upshot is that the server is a mediocre citizen of the command-line environment. If I run it in the background it magically daemonises or something, which makes it hard to kill. If I run it in the foreground, it is reluctant to die when I press ctrl-c
. This is especially annoying because sometimes the build process will hang and cannot be quit from the CLI. One reason this seems to happen is if a template pops the EJS stack, because I am building a custom listing or something. The server process is a deno
executable, so the following will salvage the situation:
killall deno
However, if I am running other deno processes on my computer, this will kill those too. I do not otherwise use deno, so I leave off my problem-solving there.
OTOH, if I run the preview server at the same time as a render process, it will die spontaneously sometimes.
Excessive cleverness 2: I am discombobulated when the quarto server tries to persuade my browser to switch to the “latest” updated page, since I am usually editing a few pages at once, and do not enjoy having my 7 open tabs suddenly decide to show me the same thing, instead of the 7 different things I wished to see. Infuriatingly, the back button does not work to undo this. Avoid this behaviour with
quarto preview --no-navigate
Excessive cleverness 3: Quarto chooses a new random port for the server each time, which is cute, but makes those 7 preview tabs impossible to bookmark and terrible for my browser workflow. I guess it is trying to optimise the possibility that if I run lots of servers, they will work? But I don’t have the many, many gigabytes of RAM and CPUs that would be required to run multiple copies of this app, so that is no use to me. I fix a predictable port thusly:
quarto preview --port 8887
Putting these together, my invocation for a preview is
quarto preview --port 8887 --no-navigate --no-browser
We should be equivalently able to encode that in a project setting via _quarto.yml
:
project:
type: website
output-dir: _site
preview:
port: 8887
browser: false
navigate: false
However, that does not work for me; I find I need to set the CLI flags instead.
quarto preview
server is broken don’t use **{#quarto-preview-broken}tl;dr Efficient, reliable, convenient: choose none.
quarto preview
uses colossal amounts of RAM for my site; I guess it is serving the site from memory?
Despite that, the quarto server itself is not that fast at serving files. It seems to block a lot and take ages to serve me a page, even if it has already rendered the page. A performance profile in the browser shows my HTML is being served at approximately 400 bytes/second, which is faster than typing, but not by much.
quarto preview
also burns a surprising amount of compute. It tanks my battery if I try to edit my blog on the road, much more rapidly than, for example, running a full-featured real-time audio workstation with live effects processing.
quarto preview
sometimes serves stale versions of the content I am working on. I might see an old version of some page, even if I have seen it tantalisingly serve the updated version momentarily, before it reverts to some older version. This annoyance is worse for me than it might sound, because I waste attention each time it happens trying to work out if the problem is quarto or me. For something I do hundreds of times per day like edit my blog, it adds up to lots of time spent swearing and debugging, rather than writing.
If the preview server is not spending all that RAM on keeping the content current, and all that compute time on wondering whether the content is current, what is it actually doing?
The precise degree of slowness is particularly corrosive to my personal attention span. 30 seconds is long enough to drag productivity to a halt, but not long enough to let me go away and do something else.
A full re-render via quarto render
while also running the preview server behaves unpredictably. Sometimes it crashes the preview server, or leaves detritus lying around. So the server is not set-and-forget, but rather a thing I need to babysit, and make sure it does not clash with a full re-render.
Sometimes, and I do not know why, the server decides a full re-render of the site is necessitated, although I have done nothing special, and then previewing that next one-line change needs a 12 17 minute wait.
My fix for (some of) the weird bugs and misfeatures in quarto preview
is to continually turn the preview server off and on again, which seems to keep memory use under control and keeps the pages more current. This is best done manually, by running it from the shell and doing a ^C
to kill it.
I attempted to define a helper function which kills the quarto process and restarts it automatically, but the wily quarto
process seems to evade my attempts to kill it by spawning detached subprocesses or something, so it doesn’t work. This is how far I got.
quarto_preview_restart --restart-time 300 --port 9888 --no-navigate --no-browser --no-watch-input
However, that does not work. quarto preview
seems to detach if I run it in the background, and I do not know how to find the correct PID to kill it. Suggestions welcome.
Seeing this, I became enlightened. The best way to stop the quarto preview server is to never start it in the first place. Read on.
You know what? I don’t even know why I spent time trying to get quarto’s preview server to work. HTTP servers are solved. We don’t need to wring bonus performance out of quarto’s cockamamie home-grown solution, because we do not need to use it. I use caddy because I am familiar with it and it is small and reliable. As with probably every web server on that list, it is much faster and more reliable than quarto preview
.
caddy file-server --listen 127.0.0.1:9889 --root _site
While that guy is running in the background, I manually render the files I need from the CLI.
quarto render notebook/gd_adaptive.qmd
The VS Code quarto extension has a keyboard shortcut to do that too, which works ok, but using that brings up a modal dialog box in which I need to confirm that I wish to render HTML every time, so the CLI is quicker.
I then manually reload my browser to see the changes.
This is not quite smooth, but it is smoother than the alternative. That friction is a small price to pay for actually seeing the changes, not some arbitrarily stale content, and moreover, seeing them in under 17 minutes, which is how long it took the quarto preview server to serve me a single page after a full re-render last time I tried.
After rendering individual pages using quarto render my_page.qd
, it seems to be necessary to do a full site re-render to reliably publish the site, otherwise most of it is missing.
There exists a file-watching, fancy-pants server for VS code, Live Server. I do not use that because it does not like serving files that I have hidden from the file explorer list, and I like hiding the HTML site output from the file explorer list to keep things tidy. YMMV.
Do not wish to install caddy? No problem, if you have quarto, you already have deno
, so the built-in deno
server probably works:
deno run --allow-net --allow-read jsr:@std/http/file-server
The deno path in fish
shell is (dirname (realpath (which quarto)))/tools/aarch64/deno
for ARM architectures and (dirname (realpath (which quarto)))/tools/x86_64//deno
for x86_64 architectures.
Attention conservation notice: I think this is a recent round of culture war; you might want to skip reading about it unless you enjoy culture wars or are caught up in one of the touch-points of this one, such as AI risk or Effective Altruism.
Since I wrote this, the article about which the Torres article was written has been published (Gebru and Torres 2024). It has a clearer thesis than the Torres article, arguing that the central theme of the ‘bundle’ is not longtermism but rather eugenics. It also constructs the bundle in a rather different, IMO more coherent way, albeit still not convincing to me. It also makes an interesting argument about blame evasion which seems to me to be mostly independent of the TESCREAL definition, and which I am generally sympathetic to.
If I had to recommend one article, it would be that newer one; while I still find many things to disagree with, at least I can work out what they are.
An article by Émile P. Torres has gone viral in my circles recently, denouncing TESCREALism, i.e. Transhumanism, Extropianism, singularitarianism, cosmism, Rationalism, Effective Altruism and longtermism.
There is a lot going on in the article. In fact, there is too much going on. I am not really sure what the main thrust is, content-wise.
In terms of pragmatics, the article probably most successfully acts to found an anti-TESCREAList movement. In the course of doing so, it claims to make some supporting arguments.
I think the main thrust of criticism might be that some flavours of longtermism lead to unpalatable conclusions, including excessive worry about AI x-risk at the expense of the currently living. While making that argument, it frames several online communities which have entertained various longtermist ideas to be a “bundle”, which I assume is working to imply that these groups form a political bloc which encourages or enables accelerationist hypercapitalism. Is that definition the main thrust? Is it that longtermism is bad? Is it that accelerationist hypercapitalist movements are coordinating? It’s not clear to me. Not that I am one to criticise from a superior position; my own blog is notoriously full of half-finished thoughts awaiting structuring into cogent argument. In my defence, I don’t claim those are position-pieces. Maybe I should think of the Torres piece as a notebook rather than an article.
The article lists some philosophical stances I would also criticise.
But the main deal seems to hinge upon an argument of Torres’ which I do not buy: I am not a fan of TESCREALism as a term, in that I don’t think it is useful or credibly a meaningful category of analysis.
I’m vaguely baffled by the attempt to build this acronym. It does seem to have gotten traction. As such I wonder how much mileage I myself can get out of lumping all the movements that have grated me the wrong way in the past together into a single acronym. (“Let me tell you about NIMBYs, coal magnates, liberal scolds, and three-chord punk bands, and how they are all part of the same movement of patsies for malign forces, which I will call NICOLS3CP.”)
Which is to say, Torres’ linked TESCREAL article leans on genealogical arguments, mostly guilt by association. Of the movement they name that I know well, there is no single view on the topic of AI x-risk, longtermism, or the future in general, nor do they share, say, a consistent utilitarian stance, nor a consistent interpretation of utilitarianism when they are utilitarian. We could draw a murder-board upon which key individuals in each of the philosophies are connected by red string, but it doesn’t seem to be a natural category in any strong sense to me, any more than Axis Of Evil or NICOLS3CP.
That said, just because the associations do not seem meaningful to me that doesn’t mean that arguments they share in common are good if there are any. The article does not clearly identify what arguments in particular the author thinks are deficient in the bundle.
Like Torres, I take issue with reasoning like this:
the biggest tragedy of an AGI apocalypse wouldn’t be the 8 billion deaths of people now living. This would be bad, for sure, but much worse would be the nonbirth of trillions and trillions of future people who would have otherwise existed.
On the other hand, if the action to take to prevent an AI apocalypse is the same either way, do we care? Is taking AI risk seriously the bad bit? Or is is the moral justification that we are supposed to be worried about?
The author also seems to be exercised about some longtermism themes, .e.g how to trade off the needs of people living now and people yet to be born. There are for sure some weird outcomes from some fo the longtermist thought experiments.
If I disagree with some school of longtermism, why not just say I disagree with it, without bringing in this bundle? Better yet, why not mention which of the many longtermisms I am worried about, and rebut a specific argument they make?
The muddier strategy of the article, disagreeing-with-longtermism-plus-feeling-bad-vibes-about-various-other-movements-and-philosophies-that-have-a-diverse-range-of-sometimes-tenuous-relationships-with-longtermism, doesn’t feel to me like it is making the other half of the article which invents TESCREALism do useful work.
I saw this guilt-by-association play out in public discourse previously with “neoliberalism”, and probably the criticisms of the “woke” “movement” are doing the same work. Since reading that article, I have become worried that I am making the same mistake myself when talking about neoreactionaries. As such, I am grateful to the author for making me interrogate my own prejudices, although I suspect that if anything, I have been shifted in the opposite direction than they intended.
Don’t get me wrong, it is important to note what uses are made of philosophies by movements. Further, movements are hijacked by bad actors all the time (which is to say, actors whose ends may have little to do with the stated goals of the movement), and it is important to be aware of that. But once again, what are we doing by lumping a bunch of movements together? Maybe I need to see the red string on this murder-board and I will be persuaded. Until then, gerrymandering them together seems suspect to me.
If “TESCREALists” are functioning as a bloc, then… by all means, analyse this. I think that some signatories to some components of the acronym do indeed function as a bloc from time to time (cf rationalists and effective altruists).
Broadly, however, I am not convinced there is a movement to hijack in the acronym. Cosmism and Effective Altruism are not in correspondence with each other, not least because all the Cosmists are dead AFAIK.
To be clear, I think there are terribly dangerous philosophies in the various movements name-checked. Some flavours of longtermism (e.g. the ones that apparently allow an unlimited budget of suffering now against badly quantified futures) seem to me also undercooked.
That said, if the goal is to have a banner to rally under, complaining about TESCREALists is one, a Schelling point for the “anti-TESCREAL movement”. That which might actually be a coherent thing, or at least a more coherent thing than that thing it is criticising. And why not? Uniting against a common enemy might be more important than the enemy existing. There are many implied demands that the anti-TESCREALists seem to make, about corporate accountability, about equity for workers involved in the AI industry, and against tech-accelerationism. Some of these demands are ones I make myself. I am not sure that any of those are dependent upon the TESCREAL bundle bundling, or shadowy cabals coordinating, or the existence of an underlying coherent opposed philosophy.
A weak spot of the contemporary social internet infrastructure: Scheduling events with friends, or potential friends.
Facebook had an effective monopoly on this in the West, which I was not a fan of. But maybe it was better than the current situation where nothing works.
Everyone — including you! — should host more events (with Nick Gray)
If one must do it online, try Priya Parker’s Together Apart Podcast.
“Do one thing and do it well.” Earnest, cute, low-feature invite service. Free.
Luma:
Slightly nerdier full-featured service that can integrate into fancy software and such.
We created Mixily in 2019 because we had a vision for a better way to host events. Mixily started as an event hosting platform that allows the sending of invitations, collection of RSVPs, and selling of tickets. […]
Imagine a world in which you can easily set up and host events, find a convenient day for all parties involved, and make the event look exactly like you want with simple and stylish design.
Meetup is for thematic, regular community discovery things. It’s OK for that, but it is annoying that it keeps trying to sell me wine vouchers.
Still works even for people who don’t use Facebook. It’s still feeding the Meta Social Graph Moloch, though, many Facebook-deniers will refuse to use it (including, sometimes, me).
Calagator is an open-source community calendar platform written in Ruby on Rails that runs calagator.org, a Portland tech calendar. Think open-source-meetup.
GigTripper is an online gig booking platform, built specifically for the Australian Live Music Industry. We uniquely focus on a solution to improve the process of booking live music gigs for independent artists, primarily (but not limited to) those early in their career.
calndr.link creates calendar links.
I built calndr.link after I had multiple clients request the exact same thing — a simple and easy way to generate calendar links, for adding to their website or in email newsletters.
There are a few existing providers out there, but they’re extremely pricey for what they do — take some basic data (title/date/etc) and reformat it into a URL for a calendar provider (be it Google or Apple).
Nothing to do with a Gibbs sampler or a Gibbs distribution.
Syring (2018):
Bayesian inference is, by far, the most well-known statistical method for updating beliefs about a population feature of interest in light of new data. Current beliefs, characterized by a probability distribution called a prior, are updated by combining with data, which is modeled as a random draw from another probability distribution. The Bayesian framework, therefore, depends heavily on the choices of model distributions for prior and data, and it is the latter that is of particular concern in this dissertation. Often, as will be shown in various examples, it is particularly difficult to make a good choice of data model: a bad choice may lead to misspecification and inconsistency of the posterior distribution, or may introduce nuisance parameters, increasing computational burden and complicating the choice of prior. Some particular statistical problems that may give Bayesians pause are classification and quantile regression. In these two problems a mathematical function called a loss function serves as the natural connection between the data and the population feature. Statistical inference based on loss functions can avoid having to specify a probability model for the data and parameter, which may be incorrect. Bayes’ Theorem cannot reconcile a posterior update using anything other than a probability model for data, so alternative methods are needed, besides Bayes, in order to take advantage of loss functions in these types of problems.
Gibbs posteriors, like Bayes posteriors, incorporate prior information and new data via an updating formula. However, the Gibbs posterior does not require modeling the data with a probability model as in Bayes; rather, data and parameter may be linked by a more general function, like the loss functions mentioned above. The Gibbs approach offers many potential benefits including robustness when the data distribution is not known and a natural avoidance of nuisance parameters, but Gibbs posteriors are not common throughout statistics literature. In an effort to raise awareness of Gibbs posteriors, this dissertation both develops new theoretical foundations and presents numerous examples highlighting the usefulness of Gibbs posteriors in statistical applications.
Two new asymptotic results for Gibbs posteriors are contributed. The main conclusion of the first result is that Gibbs posteriors have similar asymptotic behaviour to a class of statistical estimators called M-estimators in a wide range of problems. The main advantage of the Gibbs posterior, then, is its ability to incorporate prior information.
There is a compact and clear explanation in Martin and Syring (2022).
Question: Is this the same as Bissiri, Holmes, and Walker (2016)? The use of a loss function instead of a likelihood sounds like a shared property of the two.
A function which maps an arbitrary -vector to the weights of a categorical distribution (i.e. the -simplex).
The -simplex is defined as the set of -dimensional vectors whose elements are non-negative and sum to one. Specifically,
This set describes all possible probability distributions over outcomes, which aligns with the purpose of the softmax function in generating probabilities from “logits” (un-normalised log-probabilities) in classification problems.
Ubiquitous in modern classification tasks, particularly in neural networks.
Why? Well for one, it turns the slightly fiddly problem of estimating a constrained quantity into an unconstrained one, in a computationally expedient way. It’s not the only such option, but it is simple and has lots of nice mathematical symmetries. It is kinda-sorta convex in its arguments. It falls out in variational inference via KL, etc.
The softmax function transforms a vector of real numbers into a probability distribution over predicted output classes for classification tasks. Given a vector , the softmax function for the -th component is
The first derivative with respect to is
where is the Kronecker delta.
The second derivative is then i.e.
Suppose we do not use the map, but generalize the softmax to use some other invertible, differentiable, increasing function . Given a vector , the generalized softmax function for the -th component is defined as
TBD
The softmax function can be approximated using the Gumbel-softmax trick, which is useful for training neural networks with discrete outputs.
We consider the entropy of a categorical distribution with probabilities , where the probabilities are given by the softmax function, with
The entropy is by definition Substituting into the entropy expression, we obtain:
Thus, the entropy of the softmax distribution simplifies to
If we are using softmax we probably care about derivatives, so let us compute the gradient of the entropy with respect to , where we used and .
Thus, the gradient vector is thence the Hessian matrix
For compactness, we define . Using the Taylor expansion, we approximate the entropy after a small change :
Let’s extend the reasoning to category probabilities given by the generalized softmax function. where is an increasing, differentiable function, and .
The entropy becomes
To compute the gradient , we note that where .
Then, the gradient is
The score function estimator, a.k.a. log-derivative trick, a.k.a. REINFORCE (all-caps, for some reason?), is a generic method that works on various types of variables; it has notoriously high variance if done naïvely. Credited to (Williams 1992), it must be older than that.
This is the fundamental insight:
This suggests a simple and obvious Monte Carlo estimate of the gradient by choosing sample :
For unifying overviews, see (Mohamed et al. 2020; Schulman et al. 2015; van Krieken, Tomczak, and Teije 2021) and the Storchastic docs.
It is annoyingly hard to find a clear example of this method online, despite its simplicity; all the code examples I see wrap it up with reinforcement learning or some other unnecessarily specific complexity.
Laurence Davies and I put together this demo, in which we try to find the parameters that minimise the difference between the categorical distribution we sample from and some target distribution.
import torch
# True target distribution probabilities
true_probs = torch.tensor([0.1, 0.6, 0.3])
# Optimisation parameters
n_batch = 1000
n_iter = 3000
lr = 0.01
def loss(x):
"""
The target loss, a negative log-likelihood for a
categorical distribution with the given probabilities.
"""
return -torch.distributions.Multinomial(
total_count=1, probs=true_probs).log_prob(x)
# Set the seed for reproducibility
torch.manual_seed(42)
# Initialize the parameter estimates
theta_hat = torch.nn.Parameter(torch.tensor([0., 0., 0.]))
optimizer = torch.optim.Adam([theta_hat], lr=lr)
for epoch in range(n_iter):
optimizer.zero_grad()
# Sample from the estimated distribution
x_sample = torch.distributions.Multinomial(
1, logits=theta_hat).sample((n_batch,))
# evaluate log density at the sample points
log_p_theta_x = torch.distributions.Multinomial(
1, logits=theta_hat).log_prob(x_sample)
# Evaluate the target function at the sample points
f_hat = loss(x_sample)
# Compute the gradient of the log density wrt parameters.
# The `grad_outputs` multiply the `f_hat` by gradient directly.
grad_log_p_theta_x = torch.autograd.grad(
outputs=log_p_theta_x,
inputs=theta_hat,
grad_outputs=torch.ones_like(log_p_theta_x),
create_graph=True)[0]
# The final gradients are weighted over the sample points
final_gradients = (
f_hat.detach().unsqueeze(1)
* grad_log_p_theta_x
).mean(dim=0)
theta_hat.grad = final_gradients
optimizer.step()
if epoch % 100 == 0:
print(f"Epoch {epoch}, Estimated Probs:"
f"{torch.softmax(theta_hat, dim=0).detach().numpy()}")
# Display the final estimated probabilities
estimated_final_probs = torch.softmax(theta_hat, dim=0)
print("Final Estimated Probabilities: "
f" {estimated_final_probs.detach().numpy()}"
f" (True Probabilities: {true_probs.detach().numpy()}")
Note that the batch size there is very large. If we set it to be smaller, the variance of the estimator is too high to be useful.
Classically we might address such problems with a diminishing learning rate as per SGD, but I have lazily not done that here.
Rao-Blackwellization (Casella and Robert 1996) seems like a natural extension to reduce the variance. How would it work? Liu et al. (2019) is a contemporary example; I have a vague feeling that I saw something similar in Rubinstein and Kroese (2016). 🚧TODO🚧 clarify.
Placeholder to talk about one hyped means of explaining models, especially large language models, by using sparse autoencoders. Popular as an AI Safety technology.
As set functions, transformers look a lot like ‘generalized inference machines’. Are they? Can we make them do ‘proper’ inference, in some formal sense?
This is a scrapbook of interesting approaches; Bayesian inference over LLM outputs, understanding in-context learning as Bayesian conditioning, and so on.
Probably connected: LLM explanation via Sparse Autoencoders.
First one:
Alireza Makhzani introduces Zhao et al. (2024):
Many capability and safety techniques of LLMs—such as RLHF, automated red-teaming, prompt engineering, and infilling—can be viewed from a probabilistic inference perspective, specifically as sampling from an unnormalized target distribution defined by a given reward or potential function. Building on this perspective, we propose to use twisted Sequential Monte Carlo (SMC) as a principled probabilistic inference framework to approach these problems. Twisted SMC is a variant of SMC with additional twist functions that predict the future value of the potential at each timestep, enabling the inference to focus on promising partial sequences. We show the effectiveness of twisted SMC for sampling rare, undesirable outputs from a pretrained model (useful for harmlessness training and automated red-teaming), generating reviews with varied sentiment, and performing infilling tasks.
Our paper offers much more! We propose a novel twist learning method inspired by energy-based models; we connect the twisted SMC literature with soft RL; we propose novel bidirectional SMC bounds on log partition functions as a method for evaluating inference in LLMs; and finally we provide probabilistic perspectives for many more controlled generation methods in LLMs.
More methods in the references.
Placeholder. See Wikipedia Ratio of uniforms for now.
Placeholder. Are the representations we organisms have likely to correspond to things in the world, and if so, how? The TED talk version is “Does evolution favour organisms seeing truth.” I do not love this framing myself; I prefer to think about “likely to perceive truth, and if so what that might be.” I do not love this framing, for all that it makes great TED talks. I would prefer to think about decision theory of representations in learners, but that is not catchy.
See also predictive coding and semantics.
Placeholder for this very important bit of spatial modeling.
See climate+ML.
“Where is that pollution coming from?”
Databases for proximity search over vectors. Made important especially by vector embeddings. Related: learnable indices.
Built on top of popular vector search libraries including Faiss, Annoy, HNSW, and more, Milvus was designed for similarity search on dense vector datasets containing millions, billions, or even trillions of vectors. Before proceeding, familiarize yourself with the basic principles of embedding retrieval.
Milvus also supports data sharding, data persistence, streaming data ingestion, hybrid search between vector and scalar data, time travel, and many other advanced functions. The platform offers performance on demand and can be optimized to suit any embedding retrieval scenario. We recommend deploying Milvus using Kubernetes for optimal availability and elasticity.
Milvus adopts a shared-storage architecture featuring storage and computing disaggregation and horizontal scalability for its computing nodes. Following the principle of data plane and control plane disaggregation, Milvus comprises four layers: access layer, coordinator service, worker node, and storage. These layers are mutually independent when it comes to scaling or disaster recovery.
Milvus Lite is a simplified alternative to Milvus that offers many advantages and benefits.
- You can integrate it into your Python application without adding extra weight.
- It is self-contained and does not require any other dependencies, thanks to the standalone Milvus’ ability to work with embedded Etcd and local storage.
- You can import it as a Python library and use it as a command-line interface (CLI)-based standalone server.
- It works smoothly with Google Colab and Jupyter Notebook.
- You can safely migrate your work and write code to other Milvus instances (standalone, clustered, and fully-managed versions) without any risk of losing data.
Voyager features bindings to both Python and Java, with feature parity and index compatibility between both languages. It uses the HNSW algorithm, based on the open-source hnswlib package, with numerous features added for convenience and speed. Voyager is used extensively in production at Spotify and is queried hundreds of millions of times per day to power numerous user-facing features.
Think of Voyager like Sparkey, but for vector/embedding data; or like Annoy, but with much higher recall. It got its name because it searches through (embedding) space(s), much like the Voyager interstellar probes launched by NASA in 1977.
One of the really big spatiotemporal systems: the ocean. Pairs with atmospheric science and computational fluid dynamics.
UNWAVE–TVD is the Total Variation Diminishing (TVD) version of the fully nonlinear Boussinesq wave model (FUNWAVE) developed by Shi et al. (2012). The FUNWAVE model was initially developed by Kirby et al. (1998) based on Wei et al. (1995). The development of the present version was motivated by recent needs for modeling of surfzone–scale optical properties in a Boussinesq model framework, and modeling of Tsunami waves in both a global/coastal scale for prediction of coastal inundation and a basin scale for wave propagation.
This version features several theoretical and numerical improvements, including:
- A more complete set of fully nonlinear Boussinesq equations;
- Monotonic Upwind Scheme for Conservation Laws (MUSCL)–TVD solver with adaptive Runge–Kutta time stepping;
- Shock–capturing wave breaking scheme;
- Wetting–drying moving boundary condition with incorporation of Harten-Lax-van Leer (HLL) construction method into the scheme;
- Lagrangian tracking;
- Option for parallel computation.
The most recent developments include ship-wake generation (Shi et al. 2018), meteo-tsunami generation (Woodruff et al. 2018), and sediment transport and morphological changes [Malej, Shi, and Smith (2019),WoodruffEstimating2018
Suppose I have two random functions and , each of which is a Gaussian process, i.e. a well-behaved random function. The two functions are defined on the same domain , and I want to reconcile them in some way. This could be because they are noisy observations of the same underlying function or because they are two different models of the same phenomenon. This kind of thing arises often in the context of spatiotemporal modeling. There is also a close connection to Gaussian Belief Propagation. In either case, I want to find a way to combine them into a single random function that captures the information in both. Ideally, it should behave like a standard update does, e.g. if they both agree with high certainty, then the combined function should also agree with high certainty. If they do not agree, or have low certainty, then the combined function should reflect that by having low certainty.
I am sure that this must be well-studied, but it is one of those things that is rather hard to google for and ends up being easier to work out by hand, which is what I do here.
Suppose I have two GP priors and defined over the same index set. They could be two posterior likelihood updates, or two expert priors, or whatever.
How do we reconcile these two GPs into a single GP ?
The standard answer for Gaussian processes is to find a new one whose density is the product of the density of the two components.
This idea bears a resemblance to Griffin-Lim iteration phase recovery, where we have two overlapping signals and we want to combine them into a single signal that is consistent with both. That case is somewhat different because it assumes
Turns out we can use GANs as time series predictors. How does this work? No idea.
Should look into this (Yang, Zhang, and Karniadakis 2020; Kidger et al. 2021).
Not sure, but mentioned in (Salvi et al. 2024, 2021).
Placeholder, for the infinite-dimensional version of SDEs.
Point of confusion: Cylindrical versus q-Wiener processes.
Placeholder, about the Inverse Gaussian distribution, which is a tractable exponential family distribution for non-negative random variables.
tl;dr
As a non-negative exponential family, it also induces a Lévy subordinator.
Banerjee and Bhattacharyya (1979) present a reasonably nice conjugate prior, albeit with an alternative parameterization of the distribution.
Write the IG pdf as
the likelihood of a random sample from , is where are respectively the sample mean of the observations and that of their reciprocals, and .
Their major result is as follows
[…]a bivariate natural conjugate family for can be taken as where are parameters and the constant is given by […]
Hence the joint posterior pdf of and can be reduced to the form […] the marginal posterior distribution of is the modified gamma , and the marginal posterior pdf of is the truncated distribution with .
The modified gamma is derived from this guy:
This looks … somewhat tedious, but basically feasible I suppose.
The most visceral case study of the history and philosophy of science For thousands of years, physicians have been an esteemed profession, and yet they have only been a net benefit to their patients for the last hundred years, by some estimates. What does that say about our connection to effectiveness? About prestige? About our ability to evaluate the effectiveness of our institutions? About the influence of literal skin in the metaphorical game?
The 60-Year-Old Scientific Screwup That Helped Covid Kill (Randall et al. 2021).
the constant on medicine
The Dream podcast on wellness
See medicalisation.
Ben Krauss, The strange history of osteopathic medicine
Kelsey Piper, Scientific fraud solutions: Should research misconduct be illegal?
That is, the policy which Poldermans had recommended using falsified data, adopted in Europe on the basis of his research, was actually dramatically increasing the odds people would die in surgery.
Millions of surgeries were conducted across the US and Europe during the years from 2009 to 2013 when those misguided guidelines were in place. One provocative analysis from cardiologists Graham Cole and Darrel Francis estimated that there were 800,000 deaths compared to if the best practices had been established five years sooner.
Data scientists who must pretend they can remember statistics
A conjugate prior is one that is closed under sampling given its matched likelihood function. I occasionally see people talk about this as if it usefully applies to non-exponential family likelihoods, but I am familiar with it only in the case of exponential families, so we restrict ourselves to that case here.
It seems to arise in the 60s (DeGroot 2005; Raiffa and Schlaifer 2000), and be re-interpreted in the 70s (Diaconis and Ylvisaker 1979). A pragmatic intro is Fink (1997). Robert (2007) chapter 3 is gentler.
Exponential families have tractable conjugate priors, which means that the posterior distribution is in the same family as the prior, and moreover, there is a simple formula for updating the parameters. This is deliciously easy, and also misleads one into thinking that Bayes inference is much easier than it actually is in the general case, because it is so easy in this one.
We are going to observe lots of i.i.d. realisations of some variate and would like a consistent procedure for updating our beliefs about .
Our observation is assumed to arise from an exponential family likelihood. That is, given (vector) parameter , has a density of the following form: Here:
Rewriting in natural parameters, we have
If we knew we would now have a distribution for . In practice, we are not sure about , so we have a prior distribution for . Things will go well for us if we choose this prior to have a particular, and particularly convenient, form.
The conjugate prior for is designed to ensure that the posterior distribution remains within the same family after a realization from that likelihood we just introduced.
A conjugate prior has to look like this: where means something like ‘accumulated sufficient statistics from prior knowledge’ and the ‘weight’ of the prior or the ‘number of prior observations’. These are effectively hyperparameters encoding how certain we are. This looks like an exponential family distribution, ( is the -like base measure), except for this weird scaling of the log-partition function by . It is in fact a tempered exponential family.
The prior predictive distribution for a new observation is obtained by integrating the product of the likelihood and the prior over the natural parameter :
This integral represents the normalization constant of an updated exponential family distribution with parameters updated to and . Thus, the integral simplifies to , where is the normalizing factor ensuring that the distribution integrates to 1. Hence, the prior predictive distribution becomes:
The prior predictive distribution essentially provides the likelihood of observing before any actual data are observed, based solely on the prior parameters and . This distribution reflects how beliefs encoded in the prior (through and ) influence expectations about future data points, integrated over all possible values of the natural parameter .
Let us suppose an observation arrives. We would like a conjugate posterior update that incorporates the new information. The update to the conjugate prior’s parameters is:
The posterior distribution of , after observing , in full, is thus:
As with the prior predictive, we need to integrate out the natural parameters of the likelihood. The posterior predictive distribution for a new observation given the observed data is obtained by integrating over the posterior distribution of : Expanding this using the forms we derived above for and , we find: This integral represents the normalizing constant of an updated exponential family distribution with parameters and . Thus, the integral simplifies to , where is the normalizing constant. Hence,
The conjugate prior for is designed to ensure that the posterior distribution remains within the same family. We partition the parameter of the prior into A conjugate prior has the following form: where (something like ‘accumulated sufficient statistics from prior knowledge’) and (the ‘weight’ of the prior or the ‘number of prior observations’) are effectively hyperparameters encoding how certain we are.
The prior predictive distribution for a new observation is obtained by integrating the product of the likelihood and the prior over the natural parameter : The last line follows from the observation that the integral represents the normalization constant of an exponential family distribution with parameters updated to and . Thus, the integral simplifies to , where is the normalizing factor.
Let us suppose an observation arrives. We would like a conjugate posterior update that incorporates the new information. The update to the conjugate prior’s parameters is:
The posterior distribution of , after observing , continues to belong to the same exponential family and is given by: [ p(+ T(x), + 1) = f(+ T(x), + 1) ( ^T (+ T(x)) - (+ 1) A() ) ] i.e.~The updated parameters are and .
The posterior predictive distribution for a new observation given the observed data is obtained by integrating over the posterior distribution of
The last line follows from the observation that the integral term represents the normalizing constant of an updated exponential family distribution with parameters and . Thus, the integral simplifies to .
The under-rated bit of the conjugate prior thing is that, while the priors are themselves, not that flexible, there are some very interesting priors that can be constructed by mixtures of conjugate priors.
TBC. See Dalal and Hall (1983), O’Hagan (2010),…
Farrow’s tutorial introduction:
Consider what happens when we update our beliefs using Bayes’ theorem. Suppose we have a prior density for a parameter and suppose the likelihood is . Then our posterior density is where
Now let our prior density for a parameter be Our posterior density is
Hence we require so and the posterior density is where
See (Broderick, Wilson, and Jordan 2018; Orbanz 2011).## Incoming
Another way to think about model selection is as traversing a space of models with different dimensionality.
Keywords: Reversible Jump MCMC, …
Placeholder.
Transformers for generic time series prediction.
Placeholder.
A surprising phenomenon whereby jugaad processes and idk what else, a lot of excellent AI research is being done by the people.
cf Stability, RWKV
Democratizing the hardware side of large language models seems to be an advertisement for some new hardware, but there is interesting background in there.
HuggingFace distributes, documents, and implements a lot of Transformer/attention NLP models and seems to be the most active neural NLP project. Certainly too active to explain what they are up to in between pumping out all the code.
An example: Agora
Agora is a collective of AI Engineers and creators advancing humanity through artificial intelligence.
0.0.1 Our Guiding Principles:
- Humanity First: Our research has one definitive purpose… To advance Humanity above all else.
- Open Source: From start to finish every project we work on is completely open source even if it does not work yet, our devotion to radical openness facilitates accelerated learning.
- High-Impact: All of our research is high impact, we focus on projects that can improve the quality of life for millions of human beings through multi-modal, continual, and collective learning!
0.0.2 Research Priorities:
- 🌐 Multi-modal Approach: Utilizing AI to process and reason with multiple forms of data.
- 💧 Adaptability: Models that are as fluid as water, capable of being fine-tuned and trained in real-time.
- 🌲 Reasoning: Our work aims to achieve superintelligence as swiftly as possible. To this end, we strive to augment models’ reasoning capabilities through innovative prompting techniques like the Tree of Thoughts or Forest of Thoughts.
- Collective Intelligence: Scaling up collaboration between agents to accomplish real-world tasks.
Our ultimate goal is to foster the development of superintelligent AI that can reason across modalities, tasks, and environments and can collaborate with other AIs!
Now, how can you contribute?
Daily Paper Club: Every night at 10pm we read, analyse, and review the newest and most impactful AI papers! Join the forum chat in the discord
Contribute to our active projects below or look towards Kye’s github and Agora’s Github for projects to collaborate on! Find a project that piques your interest.
Every Saturday at 3pm NYC time, we host a community meeting to talk about the progress made so far in AI research and the remaining challenges left to solve! Sign up here:
Share your favourite papers in our papers channel in our discord and discuss it
Gain access to GPUs and resources for ML experiments!
Join our Finetuning team where we fine-tune the latest models with extreme precision and share them openly! Join the Agora Discord to learn more!
Soon, we’ll build research monasteries in various cities around the world where you can conduct research without distractions like paying rent, food, and other obstructions!
Placeholder for noting the existence of the field of continual learning, i.e. training algorithms not just once but updating in the field. As such, it is something like the predictive-loss-minimization-equivalent of predictive coding, I guess.
Notoriously tricky because of catastrophic forgetting.
How do humans avoid this problem? Possibly sleep (Golden et al. 2022).
A useful result from probability theory for, for example, reparameterization, or learning with symmetries.
Bloem-Reddy and Teh (2020):
Noise outsourcing is a standard technical tool from measure theoretic probability, where it is also known by other names such as transfer […]. For any two random variables and taking values in nice spaces (e.g., Borel spaces), noise outsourcing says that there exists a functional representation of samples from the conditional distribution in terms of and independent noise: . […]the relevant property of is its independence from , and the uniform distribution could be replaced by any other random variable taking values in a Borel space, for example a standard normal on , and the result would still hold, albeit with a different .
Basic noise outsourcing can be refined in the presence of conditional independence. Let be a statistic such that and are conditionally independent, given : . The following basic result[…] says that if there is a statistic that d-separates and , then it is possible to represent as a noise-outsourced function of .
Lemma 5. Let and be random variables with joint distribution . Let be a standard Borel space and a measurable map. Then -separates and if and only if there is a measurable function such that
In particular, has distribution . […]Note that in general, is measurable but need not be differentiable or otherwise have desirable properties, although for modelling purposes it can be limited to functions belonging to a tractable class (e.g., differentiable, parameterized by a neural network). Note also that the identity map trivially d-separates and , so that , which is standard noise outsourcing (e.g., Austin (2015), Lem. 3.1).
Placeholder.
Historical depictions of note.
]]>Connection to predictive coding, feelings, attribution in reinforcement learning…
Bandit problems where the reward is nonstationary.
Wikipedia recommends we consider Whittle (1988) as a foundational article and that hip recent results include Garivier and Moulines (2008).
Placeholder for a discussion of monastic traditions and their role in society, esp pseudo-monastic.
Related to generic concentration of measure, but about the expectation of functions.
This note exists simply because I had not heard about this concept before, but it ended up being really useful even to name it.
Remember the classic Jensen inequality, where for some convex function and random variable we have
The Jensen Gap is the value for a given and (not necessarily convex) .
Amazingly, we can sometimes say things about how big this gap is. For continuous and with differentiable, the most impressive results are (Abramovich and Persson 2016; Gao, Sitharam, and Roitberg 2020).
If is discrete and may or may not be differentiable, we can still say some things — see Simic (2008).
Variational inference using generative models whose density cannot be evaluated. See Variational Inference using Implicit Models.
Even though it does not evaluate likelihoods, implicit VI still seems to use KL divergence as a loss function.
There seems to be a connection to adversarial learning too.
Related concepts, perhaps? Variational interpretation of adversarial losses:
Not sure yet. I should check out Conor Hassan’s implementation.
See (Cheng et al. 2024; Lim and Johansen 2024; Moens et al. 2021; Molchanov et al. 2019; Yu et al. 2024; Yu and Zhang 2023).
How does reproducible science happen for ML models? How can we responsibly communicate how the latest sexy paper is likely to work in practice?
When the model is both too large and too secretive to be interrogated. TBD
How do we know that our models generalize to the wild? See Domain adaptation.
See ML benchmarks.
REFORMS: Reporting standards for ML-based science:
The REFORMS checklist consists of 32 items across 8 sections. It is based on an extensive review of the pitfalls and best practices in adopting ML methods. We created an accompanying set of guidelines for each item in the checklist. We include expectations about what it means to address the item sufficiently. To aid researchers new to ML-based science, we identify resources and relevant past literature.
The REFORMS checklist differs from the large body of past work on checklists in two crucial ways. First, we aimed to make our reporting standards field-agnostic, so that they can be used by researchers across fields. To that end, the items in our checklist broadly apply across fields that use ML methods. Second, past checklists for ML methods research focus on reproducibility issues that arise commonly when developing ML methods. But these issues differ from the ones that arise in scientific research. Still, past work on checklists in both scientific research and ML methods research has helped inform our checklist.
Various syntheses arise from time to time: Albertoni et al. (2023); Pineau et al. (2020).
If the social standard is set by the most vocal, when does that correspond to a desirable state of affairs, and when not?
Connection: pluralistic ignorance, YIMBYism.
An axis against which we measure cultures. Collectivism is, loosely, where individuals prioritise the goals and needs of the group (such as a family, community, or nation) over their own personal goals. People in collectivist cultures tend to emphasise strong familial ties, group loyalty, and group cohesion. They often value conformity, cooperation, and interdependence, notionally.
Collectivism is often contrasted with individualism, which prioritises personal goals and independence over the goals of the group. People in individualistic cultures value autonomy, personal achievement, and individual rights. They are encouraged to express their own opinions and pursue their own interests. Success is often defined in terms of personal attainment and individual merit.
The distinction is often used to explain differences between cultures, especially in the East (more collectivism) and West (more individualist).
Talheim and co-workers have written at length (Liu et al. 2019; Thomas Talhelm and Oishi 2018; T. Talhelm et al. 2014; Thomas Talhelm and Dong 2024) on the definition of collectivism:
The emerging picture of collectivism is less warm and fuzzy, more nuanced and complicated. […] For example, my recent research has found that people in collectivistic cultures are more likely to agree that “We should keep our ageing parents with us at home.” […]
And although people living in collectivistic cultures report less intimacy with their friends, they are also more likely to think that they should stick together through tough times (Liu et al. 2019). When I asked people to imagine a friend advising them to break up with a new boyfriend, Americans tended to say they’d find more supportive friends. In China, people tended to think these friends were being supportive. Collectivism often values things other than warmth and feeling good.
Under Talheim’s definition, collectivism is generated by rice agriculture, not just nation-of-origin (Thomas Talhelm and Oishi 2018; T. Talhelm et al. 2014; Thomas Talhelm and Dong 2024).
Interested laypeople, working statisticians, maybe ML people
We recently released a paper on a method called GEnBP (MacKinlay et al. 2024). GEnBP was an unexpected discovery: we were working on exotic methods for solving some important complicated geospatial problems (see below) and while we were doing that we discovered an unusual, but simple, method that ended up outperforming our complex initial approach and also surpassed existing solutions.
This is useful not just because we like being the best. 😉 Geospatial problems are crucial for our future on this planet. Many people, even inside the worlds of statistics, may not realize the difficulties of solving such problems. I’m excited about our findings, and here I’ll explain both the difficulties and our solution.
In statistical terms, our target problems are
A classic example of such a problem can be seen in Figure 2. That diagram is a simplified representation of what scientists want to do when trying to use lots of different kinds of data to understand something big, in the form of a graphical model. In this case, it is a graphical model of the ocean and the atmosphere, the kind of thing which we need to do when we are modelling the planetary climate. An arrow means (more or less) “ causes ” or “ influences ”. We can read off the diagram things like “The state of the ocean on day 3 influences the state of the ocean on day 4 and also the state of the atmosphere on day 4”. We also know that we have some information about the state of the ocean on that day because it is recorded satellite photo, and some information about the atmosphere that same day from radar observations.
Now the question is: how can we put all this information together? If I know some stuff about the ocean on day 3, how much do I know about it on day 4? How much can I use my satellite images to update my information about the ocean? How much does that satellite photo actually tell me?
What we really want to do is exploit our knowledge of the physics of the system. Oceans are made of water, and we know a lot about water. We are pretty good at simulating water, even. For a neat demonstration of fluid simulation, see Figure 3, showcasing work from our colleagues in Munich. You could run this on your home computer right now if you wanted.
The problem is that: when these simulators have all the information about the world they are not bad at simulating it. But in practice, we don’t know everything there is to know. The world is big and complicated and we are almost always missing some information. Maybe our satellite photos are blurry, or don’t have enough pixels, or a cloud wandered in front of the camera at the wrong moment, or any one of a thousand other things.
Our goal in the paper is to leverage good simulators to get high fidelity understanding of complicated stuff even when we have partial, noisy information. There are a lot of problems we want to solve at once here: We want to trade off statistical difficulty, and computational difficulty, and we would like to leverage as much as we can the hard work of the scientists who have gone before us.
There is no silver bullet method that can solve this for every possible system; everything has trade-offs; we will need to choose something like ‘the best guess we can make given how much computer time we have’.
The reason that we care about problems like that is that basically anything at large scale involves this kind of problem. Power grids! Agriculture! Floods! Climate! Weather! Most things that impact the well-being of millions of people at once end up looking like this, and as such All of these important problems are difficult to get accurate results for, and it is slow to calculate solutions.
We think in GEnBP, we find good answers very cheaply for precisely this punishing problem.
Our approach utilizes existing simulators, originally designed for full information, to make educated guesses about what is going on, by feeding in random noise to indicate our uncertainty. We check the outputs of this educated guess (the prior for statisticians following along at home). Then we make an update rule that uses the information we have to improve this guess, until we have squeezed all the information we can out of the observations.
How exactly we do this is not very complicated, but it is technical. Long story short— our method is in a family of called belief propagation methods, which tells us how to update our guesses when new information arrives. Moreover, this family updates our guesses using only local information, e.g. we only look at “yesterday” and “today” together rather than looking at all the days at once. We need to walk through all the data to do this, ideally many times, so it takes a while. Each time we try to squeeze more information out of the update.
That is not a new technique; people have been doing it for decades. Our trick is that we can use simulators to do it, rather than a complicated statistical model. To gauge the certainty of my input data, we run the simulator with various inputs and observe the outputs.
The minimum viable example for us, and the one we tackle in the paper, is a (relatively) simple test case, Figure 4. This is what’s known as a system identification problem. Each circle represents a variable (i.e. some quantity we want to measure, like the state of the ocean), and each arrow signifies a ‘physical process we can simulate’. The goal is to identify the query parameter (red circle) from the evidence, i.e. the some noisy observations we made of the system.
The data in this case comes from a fluid simulator, of are simply modelling a fluid flowing around a doughnut-shaped container. We didn’t collect actual data for this experiment, which is just as well—I’m a computer scientist, not a lab scientist 😉. Our observations are low-resolution, distorted, and noisy snapshots of the state of the fluid. That red circle indicates a hidden influence on the system; in our case it is force that “stirs” the fluid. Our goal is “can I work out what force is being applied to this fluid, by taking a series of photos of it?”
In fact, many modern methods ignore this hidden influence entirely; the Ensemble Kalman Filter, for example, struggles with hidden influences.. But in practice we care a lot about such hidden influence. For one thing, such hidden factors might be interesting in themselves. Sailors, for example, care not just about the ocean waves, but also with what the waves’s behaviour might indicate about underwater hazards. Also, we want to be able distinguish between external influences and the dynamics of a system itself, because, for example, we might want to work out how to influence the system. If I am careful about identifying which behaviour comes from some external factor, this lets me deduce what external pressure I need to apply to it myself. What if I want to not only work out how the water flows, but how it will flow when I pump more water in?
This challenge belongs to a broader field known as Causal inference from observational data, which concerns itself with unravelling such complexities. I feel that is probably worth a blog post on its own, but for now let us summarize it with the idea that a lot of us think it is important to learn what is actually going on underneath, rather than what only appears to be going on.
Because this is a simulated problem we can cheaply test our method against many different kinds of fluids, so we do, testing out thing on runny fluid, Figure 5 (a); and thick, viscous fluid, Figure 5 (c); and stuff in between, Figure 5 (b). For all of these, the starting condition () is the same, and our goal is to find the magenta-tinted answer ().
Having outlined the problem, let’s now delve into the performance of GEnBP, our proposed solution. As a researcher, it’s essential to justify the efficacy of our methods. How good is GEnBP, relatively speaking? As far as we can tell, there is only one real competitor in this space, which is classic Gaussian Belief Propagation (GaBP) — more on GaBP below. We use that as a baseline.
It’s challenging to present multiple simulations from a 2D fluid simulator effectively, as they primarily consist of colourful squares that are not immediately interpretable. This is easier if we can plot it in 1 spatial dimension. For demonstration purposes, I set up a special 1D pseudo-fluid simulator. The question, in this one, is “how well do our guesses about the influencing variable converge on the truth?”
The classic GaBP results Figure 6 are… OK, I guess? In the figure, the red lines represent our initial educated guesses (‘prior’), and the blue lines show the optimised final guesses (‘posterior’), compared to the dotted black line representing the truth. The blue guesses are much better than the red ones, for sure, but they are not amazing. Our fancy GEnBP guesses Figure 7 are way better, clustering much closer to the truth. So tl;dr on this very made-up problem, our method does way better at guessing the ‘hidden influence’ on the system.
This example is extremely contrived though! So let us quantify how much better the answers are on the more complex, more realistic, 2-dimensional problem.
At this stage, our approach manages millions of pixels, surpassing the capabilities of classical methods. GaBP tends to choke on a few thousand pixels. But our method can eat up big problems like that for breakfast.
What we are looking at in this graph is some measures of performance as we scale up the size of the problem, i.e. the number of pixels. The top graph displays the execution time, which ideally should be low, plotted against the problem size, measured in pixels. Notice that as the problem gets bigger, GaBP gets MUCH slower — For the curious, it scales roughly cubically, i.e. . GaBP is always slower for the kind of problems we care about. Eventually, GaBP runs out of memory. In the middle graph, we see something else interesting: GEnBP also has superior accuracy to the classic GaBP, in the sense that its guesses are closer to the truth. The bottom graph shows the posterior likelihood, which essentially tells us how confident our method is in the vicinity of the truth.
We have been strategic with our choice of problem here: GaBP is not amazing at fluid dynamics precisely like this, and that is kind of why we bother with this whole thing. GaBP has trouble with very nonlinear problems, which are generally just hard. GEnBP has some trouble with them too, but often much less trouble, and it can handle a much more diverse array of problems, especially ones with this geospatial flavour.
In Figure 9, we test both these methods on a lot of different fluid types and see how they go. GEnBP is best at speed for … all of them. The story about accuracy is a little more complicated. We win some, and we lose some. So, it’s not a silver bullet.
However, GEnBP can tackle problems far beyond GaBP’s capacity, and it often has superior accuracy even in scenarios manageable by GaBP.
The core insight of our work is combining the best elements of the Ensemble Kalman Filter (EnKF) and classic Gaussian belief propagation (GaBP). We’ve developed an alternative Gaussian belief propagation method that outperforms the classic GaBP. These methods are closely related, and yet their research communities don’t seem to be closely connected. GaBP is associated with the robotics community. If you have a robotic vacuum cleaner, it probably runs on something like that. Rumour holds that the Google Street View software that makes all those Street View maps uses this algorithm. There are many great academic papers on it; here are some examples(Bickson 2009; Davison and Ortiz 2019; Murphy, Weiss, and Jordan 1999; Ortiz, Evans, and Davison 2021; Dellaert and Kaess 2017).
The Ensemble Kalman Filter is widely used in areas such as weather prediction and space telemetry. While I don’t have visually striking demos to share, I recommend Jeff Anderson’s lecture for a technical overview. Aside: Maybe if climate scientists could make their demos as sexy as the robotics folks, we would have more interest in climate modelling? Here are some neat introductory papers on that theme: (Evensen 2009a, 2009b; Fearnhead and Künsch 2018; Houtekamer and Zhang 2016; Katzfuss, Stroud, and Wikle 2016).
As far as work connecting those two fields, there does not seem to be much, even though they have a lot in common. That is where our GEnBP comes in. Our method is surprisingly simple, but it appears to be new, at least for some value of new.
Connecting the two methods is mostly a lot of “manual labour”. We use some cool tricks such as the Matheron update (Doucet 2010; Wilson et al. 2020, 2021), and a whole lot of linear algebra tricks such as Woodbury identities. Details in the paper.
Code for GEnBP can be found at danmackinlay/GEnBP. And of course, read the GEnBP paper (MacKinlay et al. 2024).
Development was supported by CSIRO Machine Learning and Artificial Intelligence Future Science Platform. The paper was an effort by many people, not just myself but also Russell Tsuchida, Dan Pagendam and Petra Kuhnert.
If you’d like to cite us, here is a convenient snippet of BibTeX:
@misc{MacKinlayGaussian2024,
title = {Gaussian {{Ensemble Belief Propagation}} for {{Efficient Inference}} in {{High-Dimensional Systems}}},
author = {MacKinlay, Dan and Tsuchida, Russell and Pagendam, Dan and Kuhnert, Petra},
year = {2024},
month = feb,
number = {arXiv:2402.08193},
eprint = {2402.08193},
publisher = {arXiv},
doi = {10.48550/arXiv.2402.08193},
archiveprefix = {arxiv}
}
Curious how fast this method is? Here is the speed-comparison table for experts:
Operation | GaBP | GEnBP | |
---|---|---|---|
Time Complexity | Simulation | ||
Error propagation | — | ||
Jacobian calculation | — | ||
Space Complexity | Covariance Matrix | ||
Precision Matrix |
Computational Costs for Gaussian Belief Propagation, Ensemble Belief propagation for node dimension , ensemble size/component rank .
Tools for back-of-envelope calculations under uncertainty, but principled. Some of these are essentially probabilistic spreadsheets.
Fermi calculations are rough estimates, named after the physicist Enrico Fermi. The Wikipedia page is a portal of sorts to the whole field.
Many people have written fun guides to Fermi calculations. (Harte 1988; Mahajan 2010; Swartz 2003).
Relatively few address the problem of uncertainty in the inputs, but Hubbard (2014) does, and I recommend it highly.
Guesstimate is a probabilistic spreadsheet, which does Monte Carlo error estimation. (source, Blog post).
For a good worked example, see Elizabeth’s blog.
3.1 What Squiggle Is
- A simple programming language for doing math with probability distributions.
- An embeddable language that can be used in Javascript applications.
- A tool to encode functions as forecasts that can be embedded in other applications.
3.2 What Squiggle Is Not
- A complete replacement for enterprise Risk Analysis tools.[…]
- A probabilistic programming language. Squiggle does not support Bayesian inference.
- A tool for substantial data analysis. (See programming languages like Python or Julia)
- A programming language for anything other than estimation.
- A visually-driven tool. (See Guesstimate and Causal)
The listed strengths are:
- Simple and readable syntax, especially for dealing with probabilistic math.…
- Optimized for using some numeric and symbolic approaches, not just Monte Carlo.
- Embeddable in Javascript.
The last one sounds useful.
It looks like this:
//Here, you're expressing a 90% confidence that the value is between 8.1 and 8.4 Million.
populationOfNewYork2022 = 8.1M to 8.4M
proportionOfPopulationWithPianos = {
// This is a block.
percentage = 0.2 to 1 // Block body can declare local variables.
percentToRatio = 0.01
percentage * percentToRatio // Final expression of a block is the block’s value.
}
pianoTunersPerPiano = {
pianosPerPianoTuner = 2k to 50k
1 / pianosPerPianoTuner
}
totalTunersIn2022 = populationOfNewYork2022 * proportionOfPopulationWithPianos *
pianoTunersPerPiano
Enterprise versions of guesstimate. I have not used these.
Notes on the Australian public sphere, governance, and civil service. Things I did not know before working for them and think that others might benefit from knowing.
The Australian public service has been subject, since 1987, to a perpetually ratcheting budget cut regime called the Efficiency dividend, which is an attempt to address bureaucratic inefficiency.
Incentives to reduce expenses don’t sound like a bad idea. There are many critiques of the details of this particular one:
For some agencies for example research entities which are supposed to expand, this situation might be an especial problem.
From my systems thinking perspective, there is at least one clearly missing feedback loop here, which is the provision of public benefit. What if my agency works out a way to spend 10% more money and deliver 20% more public benefit? Or 100%?
Employment and workplace conditions at public institutions are governed by the Australian Public Service Commission, a meta-bureaucracy which manages Australian government bureaucracies.
For some, this is a deeply unpopular entity, set up, according to some, to discredit socialism by imposing all of its mistakes and yet none of its wisdom upon the organisations it manages. Others like it because it seems to provide reliable job security and good employment conditions, which it often does.
APS-constrained organisations are not permitted to price wages to market, and so for the most in-demand, critical skills, or staff in places where the cost of living is high, it must recruit from the pool of people who will accept bad salary compared to their earning potential.
This bites hard for specialist skills such as legal advice, engineering, IT, and trades, where the market rate is high. There are such people, because they are patriotic, or really need visa sponsorship, or are idealistic about public goods, or are outcomes-focused and really want access to the facilities, or are independently wealthy and don’t need to worry about money so they treat Australian government work as a charity project. Alternatively, agencies fall back to people who are not very good at their job, or make do without crucial skills.
This system breaks in many entertaining ways:
Tech skills, recognition, finally hit APS pay bargaining table
Poor pay for in-house tech staff bites APS chiefs in real-time
Parliamentary Services sparkies pull plug on short APS pay deal
“The Department of Parliamentary Services’ refusal to lift wages for its full-time trades staff is senseless. Because permanent trades staff’s wages are so low, positions are left vacant, yet the department is willing to outsource the roles at a much higher rate of pay.”
Government spent at least $1.9bn in a year sourcing IT and digital skills
Professionals Australia (2021)
Public political speech by public servants may be punished by sacking, which sounds reasonable, except that the extent is remarkable.
Social media still counts as public speech. Anonymous speech is still public speech. The precise criteria for sacking are opaque and arbitrary (Gray 2021; Morris and Sorial 2023).
An apparently innocuous blog post by Josh Krook, if his testimony is correct, has resulted in sacking.
Better documented, the High court decision about Michaela Banerji
APSC guidance:
I feel the lack of the rich political blogosphere of the USA. But there are commentators in Australia who are worth reading.
Australian Policy and History Network
Australian Policy and History is a network of historians that provides politicians, bureaucrats, journalists and the public with historical knowledge in the pursuit of better public policy outcomes. We publish a range of material that connects historical research to current-day policy issues, and we run conferences and workshops. Australian Policy and History is run chiefly by historians at Deakin University, with support from the University of Melbourne and the Australian National University.
The public sector plays a critical and sometimes underappreciated role in this country. It needs strong and independent news coverage, and a place where its leaders can discuss the issues they face at the coalface of modern bureaucracy.
That’s where we come in. The Mandarin is made for public sector leaders and executives and reaches 1.5 million public sector readers and the many stakeholders interested in their work each year.
Australians for Science and Freedom is a local ❄️🍑 mob, affable enough. Heavy on the fringe science (e.g. “Are viruses even a thing?”). The adverse selection problem is clear. If you run a heterodox media site, you will soon find your reasonable but under-represented opinions joined by fruity-nut-job opinions which have nowhere else to go.
…commentary and analysis from a young Australian on global and Australian politics, history, and the raging social and cultural questions of our time.
[…]we need to build economic democracy.
As our industrial profile changes, we need new ways of approaching economic management that maximize the benefits to all stakeholders in the economy, not just private shareholders. The ongoing crisis associated with our once proud national air carrier is symptomatic of the disease at root in our economy.
For too long, we have allowed corporate Australia to do as they please, relying on revenue and profit to be the only yardstick against which their operations are judged. However, if the pandemic and its associated economic crisis has taught us anything, it is that we need to build a new economy that is more resilient and self-sufficient, one that is better managed and able to deliver better social outcomes for Australians.
Traditional approaches to economic development, which treat all investment agnostically, do little to keep key production and essential services grounded in communities. Instead, there should be a focus on investment that has a long-term interest in the development of the community.
Wil Anderson is a comedian, but his Wilosophy Podcast interviews pundits.
Placeholder for notes about what builds mental well-being and resilience. How do we solve for the optimum amount of challenge?
Incorporating various neural approximations to (functions of) the likelihood of an otherwise-intractable model.
Neural Posterior Estimation (amortized NPE and sequential SNPE), (Deistler, Goncalves, and Macke 2022; Glöckler, Deistler, and Macke 2022; Greenberg, Nonnenmacher, and Macke 2019; Papamakarios and Murray 2016)
Neural Likelihood Estimation ((S)NLE), (Boelts et al. 2022; Lueckmann et al. 2017; Papamakarios, Sterratt, and Murray 2019) and
Neural Ratio Estimation ((S)NRE) (Delaunoy et al. 2022; Durkan, Murray, and Papamakarios 2020; Hermans, Begy, and Louppe 2020; Miller, Weniger, and Forré 2022) (see also density ratio)
Neural point estimators:
NeuralEstimators facilitates the user-friendly development of neural point estimators, which are neural networks that transform data into parameter point estimates. They are likelihood-free, substantially faster than classical methods, and can be designed to be approximate Bayes estimators. The package caters for any model for which simulation is feasible.
Permutation-invariant neural estimators (Sainsbury-Dale, Zammit-Mangion, and Huser 2022, 2024) which lean on deep sets.
Connects closely to neural processes which target the posterior predictive, and simulation-based inference which targets the case where we have a good but uncalibrated simulator.
A summary of some methods is in Cranmer, Brehmer, and Louppe (2020).
See the Mackelab sbi page for several implementations:
Goal: Algorithmically identify mechanistic models which are consistent with data.
Each of the methods above needs three inputs: A candidate mechanistic model, prior knowledge or constraints on model parameters, and observational data (or summary statistics thereof).
The methods then proceed by
- sampling parameters from the prior followed by simulating synthetic data from these parameters,
- learning the (probabilistic) association between data (or data features) and underlying parameters, i.e. to learn statistical inference from simulated data. The way in which this association is learned differs between the above methods, but all use deep neural networks.
- This learned neural network is then applied to empirical data to derive the full space of parameters consistent with the data and the prior, i.e. the posterior distribution. High posterior probability is assigned to parameters which are consistent with both the data and the prior, low probability to inconsistent parameters. While SNPE directly learns the posterior distribution, SNLE and SNRE need an extra MCMC sampling step to construct a posterior.
- If needed, an initial estimate of the posterior can be used to adaptively generate additional informative simulations.
Code here: mackelab/sbi: Simulation-based inference in PyTorch
Compare to contrastive learning.
Notes on South-east Asia, where I live.
Book review: A Swiss soldier on Dutch Formosa
Elie Ripon (Lausanne,? –?) was a Swiss soldier in the service of the VOC, who took part in most of the VOC’s military exploits in those years from 1618 to 1626. After his military service, he wrote down his experiences in a manuscript that was found in Switzerland in 1865 and was first published in 1990.
A wild itinerary, including Fort Jakarta/Batavia, Banda Island, a China Expedition, Fortress on Sibesi (between Java and Sumatra) diamond trade in Borneo, Privateering in the Gulf of Siam.