OpenAstrocytes – Forecast Research

tl;dr

Astrocytes make up about a third of the cells in your brain, and form a second, distinct brain network from the more-studied neuronal network. As neuroscience has discovered in only the last few years, the astrocyte network carries out extremely important computations, integrating information from across different brain networks, and using this information to drive feedback to neurons that controls physiological state and maintains long-term homeostasis.

To help push the field’s understanding of astrocytes’ unique role forward, we’re releasing OpenAstrocytes—the largest repository of structured astrocyte calcium dynamics to date, drop-in ready for use in AI workflows. Try it out with:

You can also re-generate and play around with this pub from the latest source on GitHub.

pip install astrocytes

# or, with uv
uv add astrocytes

As a demo of the richness of the OpenAstrocytes data, we take a look at one dataset—measuring changes of astrocyte activity in response to the application of distinct applied compounds—by applying a recently-released large vision transformer model, DINOv3. We find that off-the-shelf patch embeddings have subspaces that reflect astrocyte network anatomy and evoked dynamics; that these subspaces capture generalizable, distinct profiles of astrocyte calcium activity evoked by different compounds; and, that the information in a single ~60µm patch is sufficient to accurately decode the identity of the applied compound.

Background

The human brain is about one-third neurons. But two-thirds of it is made from other, non-neuronal cells; and, in recent years, the neuroscience field has become more acutely aware of the incredible importance of the roles played by those other players.

One of those other thirds is made up of astrocytes—a crucially important subdivision of the brain’s little-understood glia¹. Astrocytes form a cellular network of their own, woven between the neurons (Figure 1); and, as recent work has demonstrated, this network encodes rich information about both local and global biochemical milieu and physiological state (1).

¹ From “glue”, referencing old theories of their supposed passivity in brain function.

Figure 1: Schematic diagram of the *tripartite synapse*. Astrocytic processes (green) wrap around most cortical synaptic connections between neurons, allowing astrocytes to integrate the myriad chemical signals sensed in the neuronal chatter, and downstream, to incorporate that information with other signals about the local brain network sensed from other parts of the extracellular space.

All biological cells must incorporate a massive number of data-points, coming both from the outside world and also from the internal processes of the cell itself; indeed, this binding-together of disparate interactions into a shared, coherent, soliton-like packet is, perhaps, what makes life life (2). While all the diverse kinds of cells across the entire tree of life have unique approaches to sensing and making sense of these various signals, because of shared evolutionary history, a few common structural threads span vast swaths of this diversity. One of those central commonalities is the use of intracellular calcium signaling: because of the calcium ion’s unique physical properties, cells very early in the evolutionary history of life developed machinery that allowed them to maintain extremely low concentrations of calcium within themselves, thereby giving bursts of calcium inside the cell a very robust signal-to-noise ratio. And, like many contingent solutions of evolution, the machinery stuck around!

But astrocytes have a unique—and incredibly rich—way of signaling with calcium; the best way to appreciate it is to look at it by eye:

Figure 2: (Left) Dynamics of astrocyte calcium, as recorded with two-photon imaging of mouse cortical slices expressing the calcium-binding fluorescent indicator GCaMP. (Right) Segmentation of this activity into discrete events using the AQuA toolbox. Reproduced from (3).

As is evident in the video above, naturalistic astrocyte calcium activity is composed of discrete calcium events—high-dimensional, complex, activity that is dynamically coupled to both neuronal activity and activity within the rest of the astrocyte network. These sparkles constantly fly across the entire connected astrocyte network, and amazingly, the full details of what lies in the essential chatter of this third of the brain is still under active investigation.

While recent work using more classical ML techniques has gotten us closer to cracking this “calcium code” in the astrocyte network, the rise of today’s AI tools raises the possibility of being able to find so many more features in this rich imaging that we, by eye, could never have even conceived of. Unfortunately, much of the data that would be useful in attacking this problem is either locked away or formatted in such a fashion to strongly obfuscate the relevant features for AI training and inference for any non-subspecialist. What’s needed to accelerate the field’s ability to dig deeper into the structure of this unique form of biological signaling is a data repository designed from the ground up to take advantage of these new tools.

Architecture

The best biological data science today is heavily leveraging the advances that have been made in AI over the last few years, and taking advantage of these tools to understand the latent structure of new biological datasets. We have previously used advances in machine learning to get at new features of our data that have led us to exciting new biology (1); but harnessing the full power of the methods that have risen to prominence over the last few years—and the new infrastructure and logistical challenges they present—has necessitated a ground-up reconceptualization of how we work with data at Forecast.

In particular, we wanted to take advantage of both:

the ease and flexibility provided by streaming data for prototyping and deploying AI workflows; and,
the richness of structured data that allows us to quickly make sense of the different metadata elements for samples or batches streamed in.

The former of these points is crucial for the nuts and bolts of scaling model training and inference, and has motivated advances in data tooling for AI like WebDatasets (4). The latter, we believe strongly, is a crucial point of departure for biomedical data; while immense progress has been made across domains by taking advantage of the incredible empirical power that lies in scaling, we believe architecting workflows at their foundation around keeping track of the richness of data’s explicit structure—and, most crucially, the structured relationships between those explicit data structures (5)—is a central part of how to leverage the power of these methods for biological applications, where the intrinsic squishy messiness makes purely tabula rasa inference of latent structure much more of an uphill battle.

Running this pub

If you’re running through the examples in this pub yourself, you’ll need additional scripts in separate modules from the snippets below with some bonus content; the full reproducible source for this pub can be found in this repo.

You’ll find each cell of code folded above the figure that cell generates, like so:

Code

## Required imports across the pub

import numpy as np

import matplotlib.pyplot as plt
from forecast_style import HouseStyle

import atdata
import astrocytes


##

# As noted later, there is still intrinsic randomness imparted by the random
# shuffling for `webdataset`'s streaming that is difficult to address w/o
# support for `wids` in `atdata`
RANDOM_SEED = 89
rng = np.random.RandomState( RANDOM_SEED )


##

# For progress bars during development
from tqdm import tqdm

DEV = False
"Set to true to generate progress bars during individual cell development"

Backbone: Distributed, typed WebDatasets with `atdata`

To realize this, we built the OpenAstrocytes dataset on atdata, a new open-source package for structured, distributed, and interoperable streaming datasets we’re developing at Forecast to aid in AI training and inference, both in our own work and for the larger investigative community.

The original motivation for atdata internally was to have fast infrastructure for speeding up our AI iteration cycle when putting together the rich, multimodal datasets we encounter in biomedical R&D; we’re excited about the features we’re working on rolling out to the community soon!

atdata is built on the open WebDataset format, designed for use as a high-performance AI I/O layer used by many other research groups, and so is able to take advantage of their streaming, caching, smart batching, and drop-in PyTorch connections. atdata adds functionality that makes it straightforward to understand structured data from remote repositories:

from oa_pub.figures import plot_micrograph

from astrocytes.schema import BathApplicationFrame


##

dataset = astrocytes.data.bath_application \
            .as_type( BathApplicationFrame )

# Pop off a single sample
exemplar = next( x for x in dataset.ordered( batch_size = None ) )

# Pop off and z-project a single batch
batch = next( x for x in dataset.ordered( batch_size = 60) )
batch_summary = np.quantile( batch.image[:, 0], 0.75, axis = 0 )


##

with HouseStyle() as s:
    fig, ax = plt.subplots( figsize = (8, 8) )

    plot_micrograph( s, batch_summary, ax,
        scale_x = exemplar.scale_x,
        scale_y = exemplar.scale_y,
        scale_bar = 100.,
        clim = (0, 140),
    )

Figure 3: Example data from the OpenAstrocytes dataset, showing mean calcium fluorescence in an *ex vivo* brain slice from a mouse over 60 frames (about one minute).

Here, the as_type method is hiding logic under the hood that takes advantage of pre-implemented bidirectional lenses that are opinionated about the way that generic microscopy images should be interpreted, if possible, as data coming from the specific experiments in the OpenAstrocytes data. These lenses have a rich compositional structure, and we are excited about the possibilities that follow downstream from starting at bedrock by keeping track of the full semantics of underlying datasets (that humans worked hard to put together!), as well as the ways they fit together.

The full documentation for atdata can be found on GitHub.

Datasets

Our initial release of OpenAstrocytes is built from two large collections of previously-published calcium imaging from mouse astrocytes, originally analyzed in (6) and (1). In total, these datasets comprise close to 1M images, making this release among the the largest open datasets of astrocyte calcium dynamics to date, to our knowledge.

The images in this first release of the OpenAstrocytes corpus were obtained using two-photon imaging of astrocytes expressing a genetically-encoded calcium sensor. These data were recorded during two different experimental paradigms:

Bath application (astrocytes.data.bath_application): in these experiments, ex vivo visual cortical slices from mice were exposed to one of two different agents (baclofen or t-ACPD) after a baseline period, with the agents added to the solution bathing the full slice after the onset time.
Two-photon photo-uncaging (astrocytes.data.uncaging): in these experiments, ex vivo visual cortical slices from mice were exposed to one of two caged neurotransmitters (RuBi-GABA or RuBi-glutamate), with a pulse (or pulses) of a second laser applied after a baseline period to “free” the GABA or glutamate from the ruthenium bipyridine cage rendering it inert initially, but only within the small, ~10µm locus directly stimulated by the second laser (7).

If you want to play around with OpenAstrocytes, you can add it to your Python project with

pip install astrocytes

# or, if you're *really* cool 😎:
uv add astrocytes

Demo

Let’s take a look at some of the fun structure that we have in the OpenAstrocytes data. Here, we’ll be digging into one of the experiments from (1), in which ex vivo slices from mouse visual cortex had two drugs—baclofen and t-ACPD²—bath-applied across the slice. Here, we’ll use the OpenAstrocytes dataset with contemporary tools to give views into the differences between these two compounds’ effects on brain networks.

² Baclofen acts as an agonist of the GABA_B receptor; t-ACPD acts as an agonist of the metabotropic glutamate receptor subtypes mGluR2 and mGluR3

At the outset, we’ll tuck away any parameters for this experimental setup that will come up across analyses.

Click any of the chevrons to expand the code that generates figure panels in this pub.

Code

# You found me, the parameter!
T_INTERVENE = 300. # s
"The applied compound begins to enter the bath at 300s after recording onset"

Approach

In previous work (1), we identified the differences in astrocyte networks’ responses to baclofen and t-ACPD by using a spatial-temporal segmentation technique known as graphical time warping (8), as implemented in the AQuA toolbox (3). However, recent advances in large, self-supervised image models like DINOv3 (9) have greatly expanded our ability to pull out salient, highly-nontrivial latent structures within image data in an unbiased way. As a simple demonstration of this paradigm as applied to the OpenAstrocytes dataset, let’s investigate the extent to which this model is able to represent physiologically-relevant information in these specialized biological images—even without any post-training.

The overall approach presented in this demo is to run inference on the OpenAstrocytes bath application data using one of the vanilla DINOv3 models (dinov3-vit7b16-pretrain-lvd1689m), and leverage the fact that this model’s internal architecture separates out its image feature representations into patch embeddings. Each microscopy image is separated by the model into a 14-by-14 grid of image patches, with each patch corresponding to a ~60µm square portion of the imaged slice. At each time point over the course of an individual recorded movie, DINOv3 provides each one of these patches with a time series of 4096-dimensional embeddings, corresponding to its decomposition of the image into representative features as learned from its self-supervision over a vast swath of (principally non-biological, but with some exceptions) images.

Here, we use a simple linear basis change (via principal component analysis) to find a 64-dimensional subspace of this 4096-dimensional space already learned within the DINOv3 self-supervision objective that captures the subset of overall image features that appear in individual astrocyte calcium patches. This subspace captures both the image feature variability across different local astrocyte network morphologies as imaged in individual patches (as in Figure 9), as well as the image feature variability induced by the specific changes of the applied pharmacology in the bath application experiment. As it turns out, the richness in these feature representations—even in the absence of any biology-specific fine-tuning!—is enough to decode the identity of the applied compound from the calcium fluorescence pattern within a single ~60µm patch at a single time-point.

DINOv3 patch embeddings capture rich, dynamic responses in astrocyte networks

First, we’ll examine the latent feature structure of local patches of astrocyte network calcium activity, and how those features unfold after application of pharmacology. We’re particularly interested in looking at local patches to start because of our prior on the biology: because of a significant and growing body of work elucidating the information content of astrocyte calcium activity on these spatial scales (10), we have a hint that these features in our observables are most likely paint a generalizable readout of physiologic changes induced by pharmacology.

Individual patch responses

Let’s download a representative movie from the OpenAstrocytes bath application dataset—the same recording we projected above in Figure 3 to take a peek at the overall anatomy of the slice—and inspect the behavior of a couple representative patches from it:

Code

from oa_pub.analysis import (
    get_movie,
    movie_times,
    #
    N_PATCHES_Y, N_PATCHES_X,
)

# Snag a convenient smoother for image preprocessing
from scipy.ndimage import gaussian_filter


## Demo example parameters

FIRST_MOVIE_ID = 'd22f6a65-b3e2-46c2-ba5c-551920af1fe3'
"Microscope-supplied UUID of the demo movie to show"

DEMO_PATCHES = [
    (10, 12),
    (3, 7),
]
"Indices (vertical, horizontal) of the demo patches to show"

IDX_FRAMES_COMPARE = (250, 350, 450, 550, 650, 750)
"Indices of time-slices to display for patch snapshot figure"

DEMO_PATCH_IMAGE_SMOOTHING_KERNEL = (3., 0.6, 0.6)
"S.D. of gaussian smoothing kernel used for visualization (t, y, x)"

#

n_patches = len( DEMO_PATCHES )
n_frame_compare = len( IDX_FRAMES_COMPARE )


## Figure parameters

PATCH_CMAX = 110
"Maximum intensity value for patch snapshot figure"


##

first_movie_frames = get_movie( FIRST_MOVIE_ID, verbose = DEV )
first_movie_times, movie_dt = movie_times( first_movie_frames )

# The current `OpenAstrocytes` data architecture has imaging channels in the
# first axis; we're refactoring the frame schemas to make channel contents
# explicit now that lenses easily handle conversion. Channel 0 here is GCaMP.
# => see https://github.com/forecast-bio/open-astrocytes/issues/12
first_frame = first_movie_frames[0]
first_image = first_frame.image[0]

#
patch_size_y = first_image.shape[0] / N_PATCHES_Y
patch_size_x = first_image.shape[1] / N_PATCHES_X

test_patch_frames = []
for cur_i, cur_j in DEMO_PATCHES:
    patch_idx_y = ( int( np.floor( patch_size_y * cur_i ) ),
                    int( np.ceil( patch_size_y * (cur_i + 1) ) ) )
    patch_idx_x = ( int( np.floor( patch_size_y * cur_j ) ),
                    int( np.ceil( patch_size_y * (cur_j + 1) ) ) )

    test_patch_frames.append(
        np.array( [ 
            frame.image[0][patch_idx_y[0]:patch_idx_y[1], :][:, patch_idx_x[0]:patch_idx_x[1]]
            for frame in first_movie_frames
        ] )
    )

test_patch_frames_filtered = [
    gaussian_filter( fs, sigma = DEMO_PATCH_IMAGE_SMOOTHING_KERNEL )
    for fs in test_patch_frames
]


##

with HouseStyle() as s:

    fig, axs = plt.subplots( n_patches, n_frame_compare,
        figsize = (10, 4),
        sharey = True,
    )

    for i_patch in range( n_patches ):
        cur_patch_frames = test_patch_frames_filtered[i_patch]

        for i_t in range( n_frame_compare ):
            ax = axs[i_patch, i_t]
            cur_image = cur_patch_frames[IDX_FRAMES_COMPARE[i_t]]

            ax.imshow( cur_image,
                clim = (0, PATCH_CMAX),
                cmap = 'afmhot',
            )

            if i_patch == 0:
                # Show timings on top row
                t_disp = first_movie_times[IDX_FRAMES_COMPARE[i_t]] - T_INTERVENE
                ax.set_title( f'{t_disp:+0.1f}s',
                    fontsize = 12,
                )
            
            if i_t == 0:
                # Show patch label on left column
                axs[i_patch, i_t].set_ylabel( f'Patch {i_patch+1}' )

            axs[i_patch, i_t].tick_params(
                axis = 'x',
                length = 0,
            )
            axs[i_patch, i_t].set_xticks( [-0.5, patch_size_x + 0.5], ['', ''] )
            axs[i_patch, i_t].set_yticks( [] )

    # Set up display ticks
    ax_xticks = axs[-1, 0]
    ax_xticks.tick_params(
        axis = 'x',
        length = 3,
    )
    ax_xticks.set_xticks(
        [-0.5, patch_size_x + 0.5],
        ['0', f'{first_frame.scale_x * patch_size_x:0.0f} µm'],
        ha = 'left',
    )

    # Plot a nice lil' highlight behind the post-onset images
    fig.patches.extend( [
        plt.Rectangle( (0.382, 0.103), 0.53, 0.818,
            figure = fig,
            transform = fig.transFigure,
            #
            fill = True,
            linewidth = 0,
            color = 'k',
            alpha = 0.05,
            zorder = -1_000,
        )
    ] )

    # Make the spacing nice and ~tight~.
    plt.subplots_adjust(
        wspace = 0.04,
        hspace = 0.,
    )

Figure 4: Two example patches sub-sampled from the same recording as above in Figure 3, showing evolution of calcium activity over time relative to bath application of baclofen at \(t = 0\). (A small Gaussian smoothing kernel in space and time has been applied here, to aid visualization.)

These two patches are good illustrations of the “median” structure of this astrocyte two-photon imaging data. Both patches do exhibit some changes after compound application; in comparing, for example, the right four columns with the left two, a close inspection reveals some parts of the patch with evoked activations. But, these changes can be subtle to the unaided or non-specialist eye—particularly here with Patch 2, shown in the bottom row of panels, which we picked to show what a very subtle case looks like in raw form.

However, the large image model is remarkably able to pull out a good deal of structure in its latent feature embeddings. We see this best if we first reduce to looking at the particular dimensions of that embedding space maximally capturing the features exhibited in the OpenAstrocytes corpus³. Specifically, we find that there are axes (PCs) within the patch-embedding space that capture robust changes induced in the astrocyte network as seen through our imaging data—even though those changes, as we saw before, were at first quite subtle to the unaided eye:

³ I.e., the top few PCs of the embedding features.

Code

from oa_pub.analysis import (
    baseline_normalize,
    is_between,
)

# => see https://github.com/forecast-bio/open-astrocytes/issues/14
from astrocytes._datasets._embeddings import (
    PatchEmbeddingTrace,
)

## Demo params
DEMO_PCS = [
    0,
    1,
    4,
]

PC_WINDOW = 84
"The smoothing window used to postprocess the PCs was 84 frames (~60s)"

## Figure params
DEMO_PC_COLORS = {
    0: 'C0',
    1: 'C2',
    4: 'C3',
}

BASELINE_WINDOW = (-90, -30)
"The baseline window (in s relative to onset) used for normalization"
PLOT_WINDOW = (-100, 240)
"The window (in s relative to onset) to focus on for plotting"

CAUSAL_OFFSET_FRACTION = 0.25
"The fraction of the full smoothing window width to use as a causal offset"

PANEL_YLIM = (-8, 13)
PANEL_XTICKS = [-90, 0, 60, 120, 180, 240]

# Migration of the PCA results to the full typed schema in `astrocytes` is
# in-progress!
# => see https://github.com/forecast-bio/open-astrocytes/issues/13
PATCH_EMBEDDING_TRACE_WDS_URL = (
    'https://data.forecastbio.cloud'
    + '/testing/patch-pc-traces/bath-application/'
    + 'bath_app-dinov3_vit7b16-pca64-smooth84.tar'
)
"URL of the atdata blob for the demo pre-computed `PatchEmbeddingTrace`s"


##

ds = atdata.Dataset[PatchEmbeddingTrace]( PATCH_EMBEDDING_TRACE_WDS_URL )

it = ds.ordered( batch_size = None )
if DEV: it = tqdm( it )


##

test_traces = [ None for _ in DEMO_PATCHES ]
for trace in it:
    try:
        # Ensure traces have the needed metadata
        assert trace.metadata is not None and 'uuid' in trace.metadata
        assert trace.metadata['uuid'] == FIRST_MOVIE_ID

        for i_patch, (cur_i, cur_j) in enumerate( DEMO_PATCHES ):
            if trace.i_patch == cur_i and trace.j_patch == cur_j:
                test_traces[i_patch] = trace
                if DEV:
                    print( f'Snagged trace for {cur_i, cur_j}')
        
        should_finish = True
        for x in test_traces:
            if x is None:
                should_finish = False
        if should_finish:
            break

    except:
        # Skip traces without needed metadata
        continue


##

causal_offset = (PC_WINDOW * movie_dt) * CAUSAL_OFFSET_FRACTION

with HouseStyle( grids = True ) as s:
    fig, axs = plt.subplots( n_patches, 1,
        figsize = (7, 7),
        sharex = True,
    )

    for i_patch in range( n_patches ):
        ax = axs[i_patch]

        cur_trace = test_traces[i_patch]
        t_rel = cur_trace.ts - T_INTERVENE + causal_offset
        filter_plot = is_between( t_rel, PLOT_WINDOW )

        #

        for i_seq, cur_pc in enumerate( DEMO_PCS ):
            cur_color = DEMO_PC_COLORS[cur_pc]
            
            pc_values_norm, _, _ = baseline_normalize(
                t_rel, cur_trace.values[cur_pc, :],
                BASELINE_WINDOW
            )

            ax.plot( t_rel[filter_plot], pc_values_norm[filter_plot],
                f'{cur_color}-',
                label = f'PC {cur_pc + 1}',
            )

        # Drug application overlay
        ax.fill_between( [0, PLOT_WINDOW[1]], PANEL_YLIM[0], PANEL_YLIM[1],
            color = 'k',
            alpha = 0.05,
            linewidth = 0,
        )

        # Baseline
        ax.plot( PLOT_WINDOW, [0, 0],
            'k-',
            linewidth = 1,
        )
        # Left y-axis
        ax.plot( [PLOT_WINDOW[0], PLOT_WINDOW[0]], PANEL_YLIM,
            'k-',
            linewidth = 1,
        )

        ax.set_xticks( PANEL_XTICKS )
        ax.set_xlim( PLOT_WINDOW )
        ax.set_ylim( PANEL_YLIM )

        ax.set_ylabel( f'Patch {i_patch + 1}' )

        # Bottom plot gets x label and legend
        if i_patch == (n_patches - 1):
            ax.set_xlabel( 'Time (s)' )
            plt.legend(
                fontsize = 12,
                loc = 'upper left',
            )

    # "+ Drug" text on one sample axis
    axs[0].text( 5, -6.5, '+ Drug',
        va = 'center',
        alpha = 0.5,
    )

    # Make the spacing nice and ~tight~.
    plt.subplots_adjust( hspace = 0.08 )

Figure 5: For each of the two astrocyte network patches shown above in Figure 4, each trace shows the time-evolution of one representative embedding PC, relative to the onset of bath application of baclofen at t = 0. Time shown as seconds relative to compound entering the bath (shaded box); PC changes shown as multiples of the standard deviation of fluctuations in the pre-application baseline period. (*n.b.*: a 60s boxcar filter in time has been applied to each PC in order to compensate for intrinsic imaging noise, leading to blurring of activation edges.)

Response heterogeneity across the imaging field

The traces above depict only three of the 64 PCs we computed as our bath application–specific subspace⁴—and, showed only two of the 196 (14-by-14) patches obtained from inference on this one experimental session alone. Already, we are getting a sense for the detail that the embedding model can pull out from these data.

⁴ These 64 PCs captured a large majority of the observed variance in the 4096-dimensional DINOv3 patch-embedding features. Hence, while our astrocyte two-photon calcium dynamics data is quite rich, it is also the case, as expected, that the statistical structure of these specific sorts of biological images occupies a very small subspace of the full space of naturalistic images.

Indeed, when we look at the feature dynamics across all patches within the imaging field, the full richness of this method in characterizing astrocyte calcium becomes apparent:

Code

# For typing assists
from matplotlib.axes import Axes
from numpy.typing import NDArray

from oa_pub.analysis import get_movie_traces
from oa_pub.figures import plot_trace_grid


## Plot params
GRIDS_XTICKS = [-90, 0, 240]

plot_pc = 0

cur_grid_ylim = (-12.5, 12.5)
cur_grid_yticks = [-10, 0, 10]


##

# Collate full dataset
first_movie_traces = get_movie_traces( FIRST_MOVIE_ID,
    verbose = DEV,
)

# Do the grid plot itself

with HouseStyle() as s:

    fig, axs = plt.subplots( N_PATCHES_Y, N_PATCHES_X,
        figsize = (16, 12),
    )

    ts_grid = first_movie_traces[0][0].ts
    ts_grid_rel = ts_grid - T_INTERVENE + causal_offset

    filter_grid = is_between( ts_grid_rel, PLOT_WINDOW )

    # For each location in the grid, the y-values to plot are created by
    # taking this specific PC's values from the structured data, normalizing
    # them (the first return value of baseline_normalize), and then filtering for the
    # time-window we want to plot
    ys_grid = [
        [
            baseline_normalize(
                t_rel, first_movie_traces[i][j].values[plot_pc, :],
                BASELINE_WINDOW
            )[0][filter_grid]
            for j in range( N_PATCHES_Y )
        ]
        for i in range( N_PATCHES_X )
    ]

    plot_trace_grid( axs, ts_grid_rel[filter_grid], ys_grid,
        ylim = cur_grid_ylim,
        xticks = GRIDS_XTICKS,
        yticks = cur_grid_yticks,
        color = f'{DEMO_PC_COLORS[plot_pc]}',
        highlight = DEMO_PATCHES,
    )

    # Make spacing nice and ~tight~.
    fig.subplots_adjust( wspace = 0.1, hspace = 0.1 )

Figure 6: Baseline-normalized changes in DINOv3 embedding PC 1 features after bath application of baclofen for each of the 196 astrocyte network patches, derived from the same slice depicted above in Figure 3. The locations of the two representative patches shown in greater detail in Figure 4 and Figure 5 are highlighted in black boxes. Note the heterogeneity of responses across the imaging field, as well as the correspondence between portions of the network exhibiting PC1 activation and anatomical features of the local astrocyte network in Figure 3. Time shown as seconds relative to compound entering the bath (shaded box); changes shown as multiples of the baseline distribution standard deviation.

Code

## Plot params
plot_pc = 1

cur_grid_ylim = (-18, 18)
cur_grid_yticks = [-15, 0, 15]


## ^ as above

with HouseStyle() as s:
    fig, axs = plt.subplots( N_PATCHES_Y, N_PATCHES_X,
        figsize = (16, 12),
    )

    ts_grid = first_movie_traces[0][0].ts
    ts_grid_rel = ts_grid - T_INTERVENE + causal_offset

    filter_grid = is_between( ts_grid_rel, PLOT_WINDOW )
    ys_grid = [
        [
            baseline_normalize(
                t_rel, first_movie_traces[i][j].values[plot_pc, :],
                BASELINE_WINDOW
            )[0][filter_grid]
            for j in range( N_PATCHES_Y )
        ]
        for i in range( N_PATCHES_X )
    ]

    plot_trace_grid( axs, ts_grid_rel[filter_grid], ys_grid,
        ylim = cur_grid_ylim,
        xticks = GRIDS_XTICKS,
        yticks = cur_grid_yticks,
        color = f'{DEMO_PC_COLORS[plot_pc]}',
        highlight = DEMO_PATCHES,
    )

    # Make spacing nice and ~tight~.
    fig.subplots_adjust( wspace = 0.1, hspace = 0.1 )

Figure 7: As in Figure 6, but for embedding PC 2 features. Note that while the distribution of loci that exhibit modulation post-application still respects the underlying anatomy of Figure 3, the spatial and temporal distribution is distinct from the changes seen for PC 1. Time shown as seconds relative to compound entering the bath (shaded box); changes shown as multiples of the baseline distribution standard deviation.

Code

## Plot params
plot_pc = 4

cur_grid_ylim = (-18, 18)
cur_grid_yticks = [-15, 0, 15]


## ^ as above

with HouseStyle() as s:
    fig, axs = plt.subplots( N_PATCHES_Y, N_PATCHES_X,
        figsize = (16, 12),
    )

    ts_grid = first_movie_traces[0][0].ts
    ts_grid_rel = ts_grid - T_INTERVENE + causal_offset

    filter_grid = is_between( ts_grid_rel, PLOT_WINDOW )
    ys_grid = [
        [
            baseline_normalize(
                t_rel, first_movie_traces[i][j].values[plot_pc, :],
                BASELINE_WINDOW
            )[0][filter_grid]
            for j in range( N_PATCHES_Y )
        ]
        for i in range( N_PATCHES_X )
    ]

    plot_trace_grid( axs, ts_grid_rel[filter_grid], ys_grid,
        ylim = cur_grid_ylim,
        xticks = GRIDS_XTICKS,
        yticks = cur_grid_yticks,
        color = f'{DEMO_PC_COLORS[plot_pc]}',
        highlight = DEMO_PATCHES,
    )

    # Make spacing nice and ~tight~.
    fig.subplots_adjust( wspace = 0.1, hspace = 0.1 )

Figure 8: As in Figure 7, but for embedding PC 5 features. Time shown as seconds relative to compound entering the bath (shaded box); changes shown as multiples of the baseline distribution standard deviation.

Overall, we see that while the activity of individual astrocyte network patches may be evident-but-subtle to the eye, the large, generalist DINOv3 image model provides a quantitative view of robust changes in calcium imaging features after drug application—a view that captures both the many-dimensional richness of local changes, as well as the heterogeneity of those responses in individual segments of the astrocyte network across an individual cortical slice.

Components of static anatomy and dynamic physiology in patch embeddings

Because the original PCA we performed to determine the bath application–specific subspace of DINOv3 patch embeddings was computed across the full dataset, the image feature variance optimized through PCA necessarily incorporated contributions from two distinct facets of the image structure:

the contribution from variations across the static astrocyte network anatomical architectures imaged within the various patches; and
the physiological dynamics that unfold within a single patch.

In order to appreciate the structure of these two distinct contributions within the PC subspace projections in the OpenAstrocytes bath application dataset, we can take a look at a handful of representative patches to visualize the way astrocyte network patches “dance” around PC-space over the course of an experimental session:

Code

from oa_pub.analysis import (
    is_between,
)
from oa_pub.figures import (
    plot_trajectory_3d,
)


## Figure parameters
DEMO_PCS_3D = [
    0,
    1,
    4
]

N_TRACES_SAMPLE_3D = 300
"Number of samples to draw from the remote dataset for 3D plots"
N_TRACES_SHOW_3D = 100
"Number of samples to pull from those downloaded for including in plots"

PLOT_WINDOW_3D = (0, 240)


##

# Preprocess shuffled subset of traces

dataset = atdata.Dataset[PatchEmbeddingTrace]( PATCH_EMBEDDING_TRACE_WDS_URL )

# TODO The shuffling mechanism in WebDataset inside this routine means that we
# can't avoid nondeterminism in the data streamed in here, leading to intrinsic
# nondeterminism in the output for this pub; this may be avoided in a future
# `atdata` release supporting `wids` for indexed streaming.
# => see https://github.com/foundation-ac/atdata/issues/32
it = dataset.shuffled( batch_size = None )
if DEV: it = tqdm( it )

i_trace_sample = 0
traces_3d_raw = []
traces_3d_norm = []

for i_trace_sample, trace in zip( range( N_TRACES_SAMPLE_3D ), it ):

    ts_rel = trace.ts - T_INTERVENE + causal_offset
    filter_3d = is_between( ts_rel, PLOT_WINDOW_3D )

    cur_trace_norm = np.array( [
        baseline_normalize(
            ts_rel, trace.values[cur_pc, :],
            BASELINE_WINDOW
        )[0]
        for cur_pc in DEMO_PCS_3D
    ] )
    
    traces_3d_raw.append( trace.values[DEMO_PCS_3D, :][:, filter_3d]  )
    traces_3d_norm.append( cur_trace_norm[:, filter_3d] )
    
traces_3d_raw = np.array( traces_3d_raw )
traces_3d_norm = np.array( traces_3d_norm )


##

idx_show = rng.permutation( N_TRACES_SAMPLE_3D )[:N_TRACES_SHOW_3D]

with HouseStyle() as s:
    fig = plt.figure( figsize = (10, 10) )

    ax = fig.add_subplot( projection = '3d' )
    ax.view_init( azim = -62 )

    #

    it = idx_show
    if DEV: it = tqdm( it )

    for i_trace in it:
        plot_trajectory_3d( ax, traces_3d_raw[i_trace, :, :],
            color = f'C{i_trace % 4}',
        )

    ax.set_xticks( [0], '' )
    ax.set_yticks( [0], '' )
    ax.set_zticks( [0], '' )

    ax.set_xlabel( f'PC {DEMO_PCS_3D[0]+1}', rotation = -17 )
    ax.set_ylabel( f'PC {DEMO_PCS_3D[1]+1}', rotation = 48 )
    ax.set_zlabel( f'PC {DEMO_PCS_3D[2]+1}', rotation = 90 )
    
    ax.set_box_aspect( (1, 1, 1.1), zoom = 0.92 )

Figure 9: Three-dimensional slice of patch embedding PCs for a random selection of 100 astrocyte network patches as they unfold in time; colors are randomly chosen for each individual patch’s dynamics. Black dots indicate the embedding of the network patch when the applied compound enters the bath at t = 0; colored dots with black circles indicate the embedding at t = 4 min after application, with the colored line between the two dots depicting the embedding-space trajectory of the patch over the course of the recording. Grey lines indicate the PC-space origin. Among this ~1% subset of patches within the OpenAstrocytes bath application dataset, trajectories are generally separable, indicating contributions of overall differences in underlying astrocytic anatomy from patch to patch. As expected, a number of the patches are *relatively* unresponsive to the applied compound, with only a small degree of motion in embedding-space across the recording (wiggly lines). However, a subset of the patches show strong, directed deviations after compound application, indicating reflections of local physiological modulation in the patch’s image statistics.

As expected, the largest contributor to the heterogeneity of individual frames’ latent embeddings appears to come from the differences in anatomical structure—that is, from the regularities of the frames’ image content arising from the difference in the shape of the cells within the specific patch—as opposed to physiologic changes within a static anatomical architecture. We see this in the distinctive separation of the trajectories of individual patches within embedding-space: while the positions of individual patches do change over time across the recording, these dynamic changes tend to be notably smaller relative to the separation between full patch trajectories.

However, quite interestingly, those dynamic deflections do also appear to possess their own distinct structure. We can emphasize these deviations by plotting the same patch trajectories again, but this time normalized to each individual patch’s baseline period:

Code

with HouseStyle() as s:
    fig = plt.figure( figsize = (10, 10) )
    ax = fig.add_subplot( projection = '3d' )
    ax.view_init( elev = 20 )

    for i_trace in idx_show:
        plot_trajectory_3d( ax, traces_3d_norm[i_trace, :, :],
            color = f'C{i_trace % 4}',
        )
    
    ax.set_xlim( -32, 32 )
    ax.set_ylim( -32, 32 )
    ax.set_zlim( -12, 52 )

    # `pyplot` 3d ticks are a bit too loose by default!
    ax.tick_params(
        axis = 'y',
        which = 'major',
        pad = -5,
    )

    ax.set_xticks( [-30, 0, 30], ['', '', ''],
        fontsize = 12,
    )
    ax.set_yticks( [-30, 0, 30], ['–30', '', '+30'],
        rotation = -55,
        ha = 'left',
        fontsize = 12,
    )
    ax.set_zticks( [-10, 0, 50], ['–10', '', '+50'],
        fontsize = 12,
    )

    ax.set_xlabel( f'PC {DEMO_PCS_3D[0]+1}',
        rotation = -14 )
    ax.set_ylabel( f'PC {DEMO_PCS_3D[1]+1}',
        rotation = 37 )
    ax.set_zlabel( f'PC {DEMO_PCS_3D[2]+1}',
        rotation = 90 )
    
    ax.set_box_aspect( (1, 1, 1.1),
        zoom = 0.87,
    )

Figure 10: Embeddings of the same patches from Figure 9, but normalized to the fluctuations of each patch within the baseline period before application of pharmacology. Scale shown as multiples of the per-patch baseline standard deviation for each PC. This view highlights that there are a number of patches that are relatively unresponsive to applied pharmacology and remain relatively close to the origin (possibly owing, for example, to heterogeneity in calcium sensor expression)—and at the same time, a number of patches that exhibit extremely large (~50\(\sigma\)!) excursions.

Put together, these results indicate that the latent structure of patch embeddings from DINOv3—a large, self-supervised image model without specific biological emphasis during training—contains subspaces rich in information about both the anatomical structure of cells within the astrocyte network (as captured through two-photon imaging of our particular intracellular calcium indicator). This is a remarkable testament to both the intricate order of the compositional structure within image data writ large, as well as to the remarkable power that the self-supervised learning objective has in yielding models able to internalize many highly-nontrivial features of this intricate structure within the statistics of naturalistic images, even within highly-specific domains.

Astrocyte patch embeddings linearly separate physiological impacts of distinct compounds

The amazing richness of these AI-generated embeddings’ portrait of physiologic changes within local network patches after the application of pharmacology—even if not the largest contributor per se to the structure of embeddings within a vanilla image model like DINOv3—raises the question of whether a small subspace of imaging features contains information about the identity of the applied compound. This would extend our prior work demonstrating that specific features of astrocyte calcium dynamics, derived from more classical ML techniques, encode this sort of information at the network level, aggregating across full imaging fields (1). The extent to which these sorts of physiologic encodings are present within individual small network patches—and the intricacies of such encodings’ structure—have not yet been fully elucidated.

Individual patch responses

To investigate the extent of these local differences, we can first look at one of the same representative patches we took a gander at in Figure 4, but this time, across multiple distinct chemical exposures:

Code

from oa_pub.analysis import get_movie

## Demo parameters
COMPARISON_MOVIE_ID = 'cb2ee929-6bf5-4742-b9a5-21675c183cf9'


## Plot parameters
COMPOUND_COLORS = {
    'tacpd': 'C0',
    'baclofen': 'C3',
}
COMPOUND_CMAPS = {
    'tacpd': 'managua',
    'baclofen': 'managua_r',
}
COMPOUND_DISPLAY = {
    'tacpd': 't-ACPD',
    'baclofen': 'baclofen',
}

PATCH_CMAX_COMPARE = 70


##

comparison_movie_frames = get_movie( COMPARISON_MOVIE_ID, verbose = DEV )
comparison_movie_times, movie_dt = movie_times( comparison_movie_frames )

demo_patch_compare = DEMO_PATCHES[0]
movies_compare = [
    FIRST_MOVIE_ID,
    COMPARISON_MOVIE_ID,
]
frame_blocks_compare = [
    first_movie_frames,
    comparison_movie_frames,
]


##

demo_patch_frames_compare = []
cur_i, cur_j = demo_patch_compare

for cur_frames in frame_blocks_compare:

    patch_idx_y = ( int( np.floor( patch_size_y * cur_i ) ),
                    int( np.ceil( patch_size_y * (cur_i + 1) ) ) )
    patch_idx_x = ( int( np.floor( patch_size_y * cur_j ) ),
                    int( np.ceil( patch_size_y * (cur_j + 1) ) ) )

    demo_patch_frames_compare.append(
        np.array( [
            # Slice calcium channel
            frame.image[0]\
                # Slice vertical patch
                [patch_idx_y[0]:patch_idx_y[1], :]\
                # Slice horizontal patch
                [:, patch_idx_x[0]:patch_idx_x[1]]
            
            for frame in cur_frames
        ] )
    )

demo_patch_frames_compare_filtered = [
    gaussian_filter( fs, sigma = DEMO_PATCH_IMAGE_SMOOTHING_KERNEL )
    for fs in demo_patch_frames_compare
]

n_series = len( demo_patch_frames_compare )
n_frame_compare = len( IDX_FRAMES_COMPARE )


#

with HouseStyle() as s:

    fig, axs = plt.subplots( n_patches, n_frame_compare,
        figsize = (10, 4),
        sharey = True,
    )

    for i_series in range( n_series ):
        cur_patch_frames = demo_patch_frames_compare_filtered[i_series]
        cur_exemplar = frame_blocks_compare[i_series][0]

        cur_compound = cur_exemplar.applied_compound

        for i_t in range( n_frame_compare ):
            ax = axs[i_series, i_t]

            ax.imshow(
                cur_patch_frames[IDX_FRAMES_COMPARE[i_t], :, :],
                #
                # Go two-sided here to enable us to use a compatible diverging
                # colormap for the two compounds
                clim = (-PATCH_CMAX_COMPARE, PATCH_CMAX_COMPARE),
                cmap = COMPOUND_CMAPS[cur_compound],
            )

            if i_series == 0:
                t_disp = comparison_movie_times[IDX_FRAMES_COMPARE[i_t]] - T_INTERVENE
                ax.set_title( f'{t_disp:+0.0f}s',
                    fontsize = 12,
                )

            if i_series == (n_patches - 1) and i_t == 0:
                ax.tick_params(
                    axis = 'x',
                    length = 3,
                )
                ax.set_xticks( [-0.5, patch_size_x+0.5], ['0', f'{cur_exemplar.scale_x * patch_size_x:0.0f} µm'],
                    ha = 'left',
                )
            else:
                ax.tick_params(
                    axis = 'x',
                    length = 0,
                )
                ax.set_xticks( [-0.5, patch_size_x+0.5], ['', ''] )
            
            ax.set_yticks( [] )

            if i_t == 0:
                ax.set_ylabel( f'{COMPOUND_DISPLAY[cur_compound]}',
                    color = COMPOUND_COLORS[cur_compound]
                )

    fig.patches.extend( [
        plt.Rectangle( (0.382, 0.103), 0.53, 0.818,
            fill = True,
            linewidth = 0,
            color = 'k',
            alpha = 0.05,
            zorder = -1_000,
            transform = fig.transFigure,
            figure = fig,
        )
    ] )

    plt.subplots_adjust( wspace = 0.04, hspace = 0. )

Figure 11: Example frames of patch 1 from Figure 4 above, now showing evolution of calcium activity over time within the *same* patch relative to bath application of either baclofen (top row) or t-ACPD (bottom row). Note that, consistent with the findings from (1), t-ACPD evokes a much larger response in this astrocyte patch. (A small Gaussian smoothing kernel in space and time has been applied here, to aid visualization.)

We see by eye that t-ACPD evokes a considerably larger response in this patch, and this is reflected in the patch embeddings: while in Figure 5 above we were able to clearly distinguish the subtle modulations induced by the introduction of baclofen, when placed on the same scale (top row) as the changes induced in this same patch by t-ACPD (bottom row), in Figure 12 below, the difference is stark:

Code

from tqdm import tqdm

import atdata

import astrocytes
# => see https://github.com/forecast-bio/open-astrocytes/issues/14
from astrocytes._datasets._embeddings import (
    PatchEmbeddingTrace,
)

#

## Plot params
PANEL_YLIM = (-6, 85)

#

ds = atdata.Dataset[PatchEmbeddingTrace]( PATCH_EMBEDDING_TRACE_WDS_URL )


it = ds.ordered( batch_size = None )
if DEV: it = tqdm( it )

demo_traces_compare = [ None for _ in movies_compare ]

for trace in it:

    try:
        assert trace.metadata is not None and 'uuid' in trace.metadata
        assert trace.metadata['uuid'] in movies_compare
        
        i_movie = movies_compare.index( trace.metadata['uuid'] )

        if (
            trace.i_patch == demo_patch_compare[0]
            and trace.j_patch == demo_patch_compare[1]
        ):
            demo_traces_compare[i_movie] = trace
        
        should_finish = True
        for x in demo_traces_compare:
            if x is None:
                should_finish = False
        if should_finish:
            break

    except:
        continue

#

with HouseStyle( grids = True ) as s:

    fig, axs = plt.subplots( n_patches, 1,
        figsize = (7, 12),
        sharex = True,
        # sharey = True,
    )

    for i_series in range( n_patches ):
        ax = axs[i_series]

        cur_trace = demo_traces_compare[i_series]
        cur_compound = frame_blocks_compare[i_series][0].applied_compound
        t_rel = cur_trace.ts - T_INTERVENE + causal_offset
        filter_plot = is_between( t_rel, PLOT_WINDOW )

        #

        for i_seq, cur_pc in enumerate( DEMO_PCS ):
            cur_color = DEMO_PC_COLORS[cur_pc]
            
            pc_values_norm, _, _ = baseline_normalize(
                t_rel, cur_trace.values[cur_pc, :],
                BASELINE_WINDOW
            )

            ax.plot( t_rel[filter_plot], pc_values_norm[filter_plot],
                f'{cur_color}-',
                label = f'PC {cur_pc + 1}',
            )

        # Drug application overlay
        ax.fill_between( [0, PLOT_WINDOW[1]], PANEL_YLIM[0], PANEL_YLIM[1],
            color = 'k',
            alpha = 0.05,
            linewidth = 0,
        )

        # Baseline
        ax.plot( PLOT_WINDOW, [0, 0],
            'k-',
            linewidth = 1,
        )
        # Left y-axis
        ax.plot( [PLOT_WINDOW[0], PLOT_WINDOW[0]], PANEL_YLIM,
            'k-',
            linewidth = 1,
        )

        ax.set_xticks( PANEL_XTICKS )
        ax.set_xlim( PLOT_WINDOW )
        ax.set_ylim( PANEL_YLIM )

        ax.set_ylabel( f'{COMPOUND_DISPLAY[cur_compound]}',
            color = COMPOUND_COLORS[cur_compound]
        )

        # Bottom plot gets x label and legend
        if i_series == (n_patches - 1):
            ax.set_xlabel( 'Time (s)' )
            plt.legend(
                fontsize = 12,
                loc = 'upper left',
            )

        # Bottom plot also gets "+ Drug" label overlay
        if i_series == (n_patches - 1):
            ax.text( 5, 50, '+ Drug',
                alpha = 0.5,
            )

    fig.subplots_adjust( hspace = 0.05 )

Figure 12: For the astrocyte network patch depicted in Figure 11, the time-evolution of three representative embedding PCs relative to the onset of bath application of baclofen (top row) or t-ACPD (bottom row) at t = 0. Time shown as seconds relative to compound entering the bath (shaded box); changes shown as multiples of the baseline distribution standard deviation. (*n.b.*: a 60s boxcar filter in time has been applied to each PC, leading to blurring of activation edges.)

Consistent pharmacology-specific profiles

The fact that these feature axes within patch embedding–space separate responses to the two compounds is not unique to this patch or this cortical slice: in fact, we can average across the responses in all patches in all recordings throughout the entire OpenAstrocytes bath application dataset, in order to view the unique profiles of dynamic responses to baclofen (red) and t-ACPD (blue), and quantify those responses’ heterogeneity:

Code

from oa_pub.analysis import boot_stat


## Data parameters

MODULATION_MIN_CONCENTRATION = 50. #µM
"""The minimum concentration (µM) of applied compound to include in PC
modulation analyses

(This excludes recordings where there may have been little engagement, for
demonstration purposes)
"""

MODULATION_DEMO_PCS = list( range( 8 ) )
n_pcs_show = len( MODULATION_DEMO_PCS )


## Collate traces by compound
from oa_pub.analysis import COMPOUND_ALIASES

ds = atdata.Dataset[PatchEmbeddingTrace]( PATCH_EMBEDDING_TRACE_WDS_URL )

compound_traces: dict[str, list[PatchEmbeddingTrace]] = {
    c: []
    for c in COMPOUND_DISPLAY.keys()
}

it = ds.ordered( batch_size = None )
if DEV: it = tqdm( it )

for trace in it:

    try:
        assert trace.metadata is not None, \
            'No metadata for trace'
        assert 'compound' in trace.metadata and 'concentration' in trace.metadata, \
            'Missing metadata keys for trace'
        
        if trace.metadata['concentration'] >= MODULATION_MIN_CONCENTRATION:
            for k, vs in COMPOUND_ALIASES.items():
                if trace.metadata['compound'].lower() in vs:
                    compound_traces[k].append( trace )

    except Exception as e:
        continue


## Plot

exemplar_trace = compound_traces['baclofen'][0]
ts_rel = exemplar_trace.ts - T_INTERVENE + causal_offset

filter_plot = is_between( ts_rel, PLOT_WINDOW )

with HouseStyle( grids = True ) as s:

    fig, axs = plt.subplots( n_pcs_show // 2, 2,
        figsize = (11, 12),
        sharex = True,
    )

    for i_seq, i_pc in enumerate( MODULATION_DEMO_PCS ):
        i_ax = i_seq // 2
        j_ax = i_seq % 2
        ax = axs[i_ax, j_ax]

        raw_values = {
            c: np.array( [
                baseline_normalize( ts_rel, trace.values[i_pc, :],
                    BASELINE_WINDOW,
                )[0]
                for trace in traces
            ] )
            for c, traces in compound_traces.items()
        }

        summary_stat = lambda xs: np.mean( xs, axis = 0 )

        middle_vals = {
            c: summary_stat( vs )
            for c, vs in raw_values.items()
        }

        middle_boots = {
            c: boot_stat( vs, summary_stat,
                n = 200,
                axis = 0,
                rng = rng,
            )
            for c, vs in raw_values.items()
        }
        middle_low = {
            c: np.quantile( vs, 0.025, axis = 0 )
            for c, vs in middle_boots.items()
        }
        middle_high = {
            c: np.quantile( vs, 0.975, axis = 0 )
            for c, vs in middle_boots.items()
        }

        #

        for cur_compound, cur_middle in middle_vals.items():

            ax.plot( ts_rel[filter_plot], cur_middle[filter_plot], '-',
                color = COMPOUND_COLORS[cur_compound],
                label = COMPOUND_DISPLAY[cur_compound],
            )

            cur_low = middle_low[cur_compound]
            cur_high = middle_high[cur_compound]
            ax.fill_between( ts_rel[filter_plot], cur_low[filter_plot], cur_high[filter_plot],
                #
                color = COMPOUND_COLORS[cur_compound],
                alpha = 0.2,
                linewidth = 0.5,
            )

        xl = PLOT_WINDOW
        yl = ax.get_ylim()

        ax.plot( xl, [0, 0], 'k-',
            linewidth = 1,
        )

        ax.fill_between( [0, xl[1]], yl[0], yl[1],
            color = 'k',
            alpha = 0.05,
            linewidth = 0,
        )

        ax.plot( [xl[0], xl[0]], yl, 'k-',
            linewidth = 1,
        )

        ax.set_xlim( xl )
        ax.set_ylim( yl )

        ax.set_ylabel( f'PC {i_pc + 1}' )
        ax.set_xticks( PANEL_XTICKS )

        if j_ax == 0 and i_ax == len( MODULATION_DEMO_PCS ) // 2:
            ax.set_xlabel( 'Time (s)' )
        
        if j_ax == 0 and i_ax == 0:
            ax.legend(
                fontsize = 12,
            )
            ax.text( 5, -1.5, '+ Drug',
                alpha = 0.5,
                va = 'center',
            )
    
    fig.subplots_adjust( wspace = 0.23, hspace = 0.09 )

Figure 13: Average changes in the first eight principal components of patch embeddings relative to application of baclofen (red) or t-ACPD (blue). Time shown as seconds relative to compound entering the bath (shaded box); changes shown as multiples of the baseline distribution standard deviation. Error corridors show population mean ± trace-wise bootstrapped 95% confidence interval of the mean.

Amazingly, we see that these feature axes capture consistent differences in the evoked imaging features between the two compounds, across a large array of different individual mice, slices, and experiments. This suggests that the image feature axes modulated by applied pharmacology are not arbitrary divisions in image-space, but are instead capturing regularities in the ways that these compounds’ impacts on astrocytes consistently change intracellular calcium computations in response.⁵

⁵ This accords with our previous work, in which we found one specific image feature of astrocyte calcium dynamics—the extent to which individual calcium events propagate in space—defined a subset of calcium excitations carrying its own separable information content about a distinct chemical exposure, in that case to extracellular glutamate (1).

Decoding chemical identity

The question then arises: Are these consistent differences robust enough to decode the compound applied to a particular patch of astrocyte network? We can test this by using a simple (generalized) linear readout⁶ of our PC-space to classify the applied compound. To understand in greater detail the way this information content about the applied compound dynamically unfolds, we’ll restrict ourselves to looking at decoding from the activity at a single time-point after compound application, and then vary across all time-points within the experiment. We’ll also restrict ourselves, for simplicity, to decoding the compound identity from one single ~60µm patch; this makes the decoding problem its most difficult flavor (rather than giving ourselves additional information by, for example, aggregating across patches within the same slice), in order to avoid ceiling effects.

⁶ Here, via support vector classification.

Evaluating this single-patch compound decoding performance across time reveals some interesting patterns:

Code

from oa_pub.analysis import is_between

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression


## Decoding parameters

COMPOUND_LABELS_DECODE = {
    'baclofen': 0,
    'tacpd': 1,
}
"Assign an integer class label to each compound"

COMPUTE_WINDOW_DECODE = (-60, 240)
"Time window across which to slide the decoding problem"

SUBSAMPLING_DECODE = 20
"Decoding at each frame would be time-prohibitive; this gives a downscaling factor"

N_PC_DECODE_VARIANTS = [
    4,
    8,
    12,
    16,
]
"Test decoding based off the top k PCs, for each k in this list"

N_CV_DECODE = 20
"Number of distinct cross-validation splits to use in estimating performance"

FRAC_TRAIN_DECODE = 0.8
"Fraction of data to use for training in each CV split"


## Plot parameters

PANEL_YLIM = (0.37, 1.03)
DECODE_XTICKS = [-60, 0, 60, 120, 180, 240]


##

exemplar_trace = compound_traces['baclofen'][0]
ts_rel = exemplar_trace.ts - T_INTERVENE + causal_offset
filter_decode = is_between( ts_rel, COMPUTE_WINDOW_DECODE )
idx_t_decode = np.where( filter_decode )[0]
# In order to speed computation, reduce the number of indices we decode on
idx_t_decode = idx_t_decode[::SUBSAMPLING_DECODE]


##

acc_cv = [ np.zeros( (idx_t_decode.shape[0], N_CV_DECODE) )
           for _ in N_PC_DECODE_VARIANTS ]
acc_cv_shuffled = np.zeros( (idx_t_decode.shape[0], N_CV_DECODE) )

for i_pcs_seq, n_pc_decode in enumerate( N_PC_DECODE_VARIANTS ):

    it = enumerate( idx_t_decode )
    if DEV: it = tqdm( it, total = idx_t_decode.shape[0] )

    for i_t_seq, i_t in it:
        features_cur = []
        labels_cur = []

        # Form current feature and label arrays
        for cur_compound, cur_traces in compound_traces.items():
            for trace in cur_traces:
                features_cur.append( trace.values[:n_pc_decode, i_t] )
                labels_cur.append( COMPOUND_LABELS_DECODE[cur_compound] )
        features_cur = np.array( features_cur )
        labels_cur = np.array( labels_cur )

        # Determine n's in train:test split
        n_train = int( FRAC_TRAIN_DECODE * labels_cur.shape[0] )
        n_test = labels_cur.shape[0] - n_train
        filter_train = np.r_[np.ones( (n_train,) ), np.zeros( (n_test,) )]

        for i_cv in range( N_CV_DECODE ):

            # Pick a random training set for this split
            filter_train = np.random.permutation( filter_train ).astype( bool )
            # The test set is whatever we don't train on
            filter_test = ~filter_train

            classifier = SVC( gamma = 'scale' )
            # Left to the reader as an exercise!
            # classifier = LogisticRegression()

            classifier.fit( features_cur[filter_train, :], labels_cur[filter_train] )
            
            labels_test = labels_cur[filter_test]
            labels_test_hat = classifier.predict( features_cur[filter_test, :] )
            acc_cur = (
                np.sum( labels_test_hat == labels_test )
                / labels_test.shape[0]
            )
            acc_cv[i_pcs_seq][i_t_seq, i_cv] = acc_cur
        
        #

        # Most favorable possible comparison for perms. is most PCs
        if n_pc_decode == max( N_PC_DECODE_VARIANTS ):
            # Do the permuted form with this number of PCs

            for i_cv in range( N_CV_DECODE ):

                # Build the permuted labels
                labels_cur_shuffled = np.random.permutation( labels_cur )

                # Build the train:test split for the permuted iteration
                n_train = int( FRAC_TRAIN_DECODE * labels_cur_shuffled.shape[0] )
                n_test = labels_cur_shuffled.shape[0] - n_train
                filter_train = np.r_[np.ones( (n_train,) ), np.zeros( (n_test,) )]
                filter_train = np.random.permutation( filter_train ).astype( bool )
                filter_test = ~filter_train

                classifier_shuffled = SVC( gamma = 'scale' )
                # classifier_shuffled = LogisticRegression()

                classifier_shuffled.fit(
                    features_cur[filter_train, :],
                    labels_cur_shuffled[filter_train]
                )

                labels_test_shuffled = labels_cur_shuffled[filter_test]
                labels_test_hat_shuffled = classifier_shuffled.predict( features_cur[filter_test, :] )
                acc_cur = (
                    np.sum( labels_test_hat_shuffled == labels_test_shuffled )
                    / labels_test_shuffled.shape[0]
                )
                acc_cv_shuffled[i_t_seq, i_cv] = acc_cur


##

summary_stat = lambda xs: np.mean( xs, axis = 0 )

with HouseStyle( grids = True ) as s:
    fig, ax = plt.subplots(
        figsize = (8, 5),
    )

    for i_seq, (n_pc_cur, acc_cv_cur) in enumerate(
                zip( N_PC_DECODE_VARIANTS, acc_cv )
            ):

        middle_trace = summary_stat( acc_cv_cur.T )

        middle_boots = boot_stat( acc_cv_cur.T, summary_stat,
            axis = 0,
            n = 1_000,
            rng = rng,
        )
        middle_low = np.quantile( middle_boots, 0.005, axis = 0 )
        middle_high = np.quantile( middle_boots, 0.995, axis = 0 )

        ax.plot( ts_rel[idx_t_decode], middle_trace,
            label = f'{n_pc_cur} PCs',
            zorder = 200 - i_seq * 10,
        )
        ax.fill_between( ts_rel[idx_t_decode], middle_low, middle_high,
            alpha = 0.2,
            zorder = 199 - i_seq * 10,
        )

    #

    middle_trace = summary_stat( acc_cv_shuffled.T )

    middle_boots = boot_stat( acc_cv_shuffled.T, summary_stat,
        axis = 0,
        n = 1_000,
        rng = rng,
    )
    middle_low = np.quantile( middle_boots, 0.005, axis = 0 )
    middle_high = np.quantile( middle_boots, 0.995, axis = 0 )

    ax.plot( ts_rel[idx_t_decode], middle_trace,
        color = 'k',
        label = 'Permuted',
        zorder = -100,
    )
    ax.fill_between( ts_rel[idx_t_decode], middle_low, middle_high,
        color = 'k',
        alpha = 0.2,
        zorder = 20,
    )

    xl = COMPUTE_WINDOW_DECODE
    yl = PANEL_YLIM

    ax.plot( xl, [0, 0], 'k-', linewidth = 1 )

    ax.fill_between( [0, xl[1]], yl[0], yl[1],
        color = 'k',
        alpha = 0.05,
        linewidth = 0,
        zorder = -1_000,
    )

    ax.text( 5, 0.75, '+ Drug',
        alpha = 0.5,
        va = 'center',
    )

    plt.legend( fontsize = 12 )

    s.label(
        data_ylim = (yl[0], 1.),
        yaxis_x = xl[0],
    )

    ax.set_xticks( DECODE_XTICKS )
    ax.set_xlabel( 'Time (s)' )
    ax.set_ylabel( 'Decoding accuracy' )

    ax.set_xlim( xl )
    ax.set_ylim( yl )

Figure 14: Accuracy of classifying applied pharmacology (baclofen *vs.* t-ACPD) from embeddings of a single astrocyte network patch using SVC, as a function of time after application of the compound. Colored bands indicate bootstrapped 99% confidence interval of accuracy across cross-validation replicates (each using a random 80:20 train:test split). Each trace color corresponds to a distinct number of PCs included in the decoder. The black trace indicates a null condition in which the compound labels were permuted at each time point, but decoding was performed otherwise in the same manner—*i.e.*, the random chance level for decoding performance.

Not only are we able to decode quite well the identity of the applied compound from a single patch of the astrocyte network, but it is clear that the particular richness of these imaging features is necessary in order to tease the compound differences apart: while decoding from only 4 PCs gives performance above chance, increasing to 12 or 16 included PCs dramatically increases peak decoding accuracy. This indicates that what is at play here in distinguishing systematically between the two compounds is not a simple matter of intensity, or some other single aspect of the imaging: rather, a sparse but rich subspace of the overall DINOv3 patch embedding–space contains the information content needed to determine applied chemical identity from the induced changes in the astrocyte network. This may indicate that these image features contain a nontrivial reflection, via our astrocyte calcium activity observable, of underlying differences in physiological processes recruited by the two distinct pathways engaged by baclofen and t-ACPD (i.e., GABA_B vs mGluR2/3 receptors).

Discussion

This brief foray into analyzing the OpenAstrocytes data demonstrates the remarkable richness present in astrocyte calcium dynamics, as well as the amazing potential that lies in making this data readily available in a format that slots directly into today’s AI-driven biological data science workflows.

We’re particularly interested in highlighting three key patterns we’ve seen here:

Generalist AI features contain specialized information

While DINOv3 (9) was trained on a very broad image corpus, we were able to see in the structure of embeddings inferred from the OpenAstrocyte bath application data in particular two major, distinct contributors: (1) anatomical features of local astrocyte networks, as well as (2) physiological features of the dynamics of intracellular astrocyte calcium. These two contributions were reflected in small subspaces of the full 4096-dimensional patch embedding space, indicating, as expected, that while there is a rich spectrum of image structure present in astrocyte network calcium, this activity is constrained to a smaller, more regular subspace.

The fact that this large model, trained in a largely-unbiased way on principally non-biological images, has subspaces that are so strongly reflective of detailed image features for this sub-field, is at this point unsurprising but still remarkable. This finding (and many others like it applying large AI models across other specialist domains of science that have emerged in recent years) has massive implications for the potential utility of these generalist large models to biological data science.

Heterogeneity of responses across individual fields

It is amazing to see consistency in the responses of particular subspaces to particular drugs, as demonstrated in Figure 13 and Figure 14; however, as we move forward in understanding the potential lying in these models to find new biology, we’re very excited to dig into what underlies the differences between responses in specific small pieces of the same slice of brain tissue. That neighboring ~60µm patches can exhibit drastically diverging calcium-feature responses to the same applied stimulus suggests a heterogeneity in the processing apparatus of constituent parts of the astrocyte network that is hinted at in the field but as-yet not fully elucidated. We think there’s a lot more to know about astrocytes—and brains—hidden in the characterization of these response differences.

Decoding performance from rich features

We saw great chemical decoding performance from a single frame of one ~60µm patch of a cortical slice—and this is a pessimistic estimate, given how many patches were unresponsive due to differences in expression of the calcium indicator! Similar to the above, there’s a lot to tease apart in the richness of the heterogeneity we saw, particularly in the structure time-evolution. Our previous work in (1) indicated that astrocyte networks integrate physiologic information across minutes, and it’s fantastic to see that similarly reflected in the time-course of our chemical identity decoding here.

While we didn’t need to include all of the DINOv3 embedding features to get good decoding, we needed a few. We find this pattern of results the most exciting possibility, as it suggests that the decoding performance isn’t arising from some trivial feature (like increased intensity), but instead from a richer combination of features that could generalize to more interesting predictive tasks. We are interested to see how this rich structure reflects the intricate machinery astrocyte networks have for integrating diverse chemical signals when probed with a larger swath of pharmacological space.

Looking ahead

The OpenAstrocytes corpus is one step in our larger thrust into providing open tools to understand the dynamics of astrocyte physiology, as well as our approach within Forecast to use this incredible science to understand how our chemical tools change brain networks.

Building on `atdata`

atdata’s addition of sample-typing functionality to WebDatasets already helps for keeping streamed AI datasets sane; but, where keeping track of sample types really shines is when we register mappings between types. As an example, in the the astrocytes library accompanying this pub, we have snippets like:

@atdata.lens
def from_generic( s: ts.Frame ) -> 'BathApplicationFrame':
    # 1. Performs extraction of features and
    # validation  from unstructured metadata ...
    return BathApplicationFrame(
        # 2. Wraps into a structured sample class
    )

Under the hood, the atdata.lens decorator keeps track of the full registry of lenses between sample schema types, and the way they compose. Because lenses can have a very nice compositional behavior (11), this design enables us, as we’re building AI models from a bunch of different sources, to easily and automatically aggregate across our diverse hive of WebDatasets, built across various modalities and experiments—and, to take advantage of the natural ways these data can be compatibly merged.

The full atdata ecosystem is architected to go one step further, placing metadata for type schemas of scientific datasets—and the code for these lenses between them!—in a public, decentralized repository on ATProto, the protocol layer that lies at the heart of the Bluesky social network. This allows AI coding agents on data science tasks to automatically leverage domain knowledge of structured semantic relationships to decide what data to pull from for a given task.

More on atdata soon, including our ATProto appview for the full distributed dataset network with MCP support—reach out to sign up for updates in order to hear first as we push more features to the public release!

Loving neurons, too

Of course, the brain also is one-third neurons—a fact we certainly wouldn’t want to forget. This is why we’re building out the OpenAstrocytes repository with newly-collated data from the Allen Brain Observatory (12), also formatted with atdata for easy use with existing AI workflows.

We’re continuing to collate this fantastic dataset’s movies of raw two-photon calcium imaging from neurons in visual cortex, in order to make this phenomenal source of neurobiological imaging data more accessible for augmenting workflows like AI fine-tuning; since neuronal and astrocytic imaging share tremendous overlap in latent image statistics, we anticipate that combining large datasets of similar dynamic imaging from the two cell types will provides immense benefits to the study of each.

A new kind of data

Here at Forecast, we’re building out a new experimental dataset centered not on brain slices extracted from mice, but instead on 3D in vitro systems built with human induced pluripotent stem cells. Human astrocyte dynamics are a near-completely unexplored frontier within neuroscience; we are extremely excited about the potential lying within this system, as we build out OpenAstrocytes in the next few months with more and more data taken from this completely blank canvas. Stay tuned ;).

Acknowledgements

We’d like to thank Dr. Michelle Cahill and Dr. Michael Reitman for driving the original experimental work and data collection for the images that went into the initial release of OpenAstrocytes presented here.

Shout out to our beta testers 🤩.

Support for the production of OpenAstrocytes at Forecast was generously provided by the Special Initiatives division of the Astera Institute.

Citation

Please reference this work in BibTeX as:

@article{levesque2025openastrocytes,
  author = {Maxine Levesque and Kira Poskanzer},
  title = {OpenAstrocytes},
  journal = {Forecast Research},
  year = {2025},
  note = {https://forecast.bio/research/open-astrocytes/},
}

AI use

No language model tools were used to produce any of the writing in this pub, or the original code architecture and initial implementations.

Claude helped with filling out some of the documentation. If they hallucinated, let us know in our GitHub issues, and we’ll give it a human touch!

Copyright

References

M. K. Cahill, M. Collard, V. Tse, M. E. Reitman, R. Etchenique, C. Kirst, K. E. Poskanzer, Network-level encoding of local neurotransmitters in cortical astrocytes. Nature 629, 146–153 (2024).

G. Pezzulo, T. Parr, K. Friston, Active inference as a theory of sentient behavior. Biological Psychology 186, 108741 (2024).

Y. Wang, N. V. DelRosso, T. V. Vaidyanathan, M. K. Cahill, M. E. Reitman, S. Pittolo, X. Mi, G. Yu, K. E. Poskanzer, Accurate quantification of astrocyte and neurotransmitter fluorescence dynamics for single-cell and population-level physiology. Nature neuroscience 22, 1936–1944 (2019).

A. Aizman, G. Maltby, T. Breuel, High performance i/o for large scale deep learning (2020). https://arxiv.org/abs/2001.01858.

nLab authors, Vertical categorification (2025).

M. E. Reitman, V. Tse, X. Mi, D. D. Willoughby, A. Peinado, A. Aivazidis, B.-E. Myagmar, P. C. Simpson, O. A. Bayraktar, G. Yu, others, Norepinephrine links astrocytic activity to regulation of cortical state. Nature Neuroscience 26, 579–593 (2023).

E. Fino, R. Araya, D. S. Peterka, M. Salierno, R. Etchenique, R. Yuste, RuBi-glutamate: Two-photon and visible-light photoactivation of neurons and dendritic spines. Frontiers in neural circuits 3, 556 (2009).

Y. Wang, D. J. Miller, K. Poskanzer, Y. Wang, L. Tian, G. Yu, “Graphical time warping for joint alignment of multiple curves” in Proceedings of the 30th International Conference on Neural Information Processing Systems (Curran Associates Inc., Red Hook, NY, USA, 2016)NIPS’16, pp. 3655–3663.

O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, P. Bojanowski, DINOv3 (2025). https://arxiv.org/abs/2508.10104.

10.

A. Lia, V. J. Henriques, M. Zonta, A. Chiavegato, G. Carmignoto, M. Gómez-Gonzalo, G. Losi, Calcium signals in astrocyte microdomains, a decade of great advances. Frontiers in cellular neuroscience 15, 673433 (2021).

11.

nLab authors, Lens (in computer science) (2025).

12.

S. E. de Vries, J. H. Siegle, C. Koch, Sharing neurophysiology data from the allen brain observatory. Elife 12, e85550 (2023).

tl;dr

Background

Architecture

Running this pub

Backbone: Distributed, typed WebDatasets with atdata

Datasets

Demo

Approach

DINOv3 patch embeddings capture rich, dynamic responses in astrocyte networks

Individual patch responses

Response heterogeneity across the imaging field

Components of static anatomy and dynamic physiology in patch embeddings

Astrocyte patch embeddings linearly separate physiological impacts of distinct compounds

Individual patch responses

Consistent pharmacology-specific profiles

Decoding chemical identity

Discussion

Generalist AI features contain specialized information

Heterogeneity of responses across individual fields

Decoding performance from rich features

Looking ahead

Building on atdata

Loving neurons, too

A new kind of data

Acknowledgements

Citation

AI use

Copyright

References

Backbone: Distributed, typed WebDatasets with `atdata`

Building on `atdata`