Say you want to build a model of biological tissue (e.g. muscle or hearth tissue) using a set of cells which all have their own position variables. The straight forward way of modelling this is by placing all variables in a continuous array. Let's create 1000 cells each with a position in 3D space.
int num_cells = 1000;
struct Position { int x; int y; int z; };
struct Position[] positions = malloc(num_cells * sizeof(struct Position));
The indices into this array are the handles to a cell in the model, assuming you don't shuffle the order of the position data. The moment you want to subdivide a tissue into two or more specific specialized tissues you could get a view to the array to obtain all positions belonging to one of the tissues. Let's say the first 500 cells belong to muscle tissue and the others are heart cells. We could use a simple View type to get handles to the cells.
typedef struct {
void* data;
size_t size; // number of elements
size_t elem_size; // size of each element
} View;
View slice(void* array, size_t elem_size, size_t start, size_t count) {
return (View){
.data = (char*)array + (start * elem_size),
.size = count,
.elem_size = elem_size
};
}
struct Position* muscle_tissue = (struct Position*)slice(positions, sizeof(struct Position), 0, 500);
struct Position* heart_tissue = (struct Position*)slice(positions, sizeof(struct Position), 500, 500);
// iterating through the muscle_tissue
for (size_t i=0; i<muscle_tissue.size; i++) {
muscle_tissue.data[i]++;
}
Now you can access a subset of the cells to perform operations on this subset. This approach to organizing your models works well if all your components operate at the same scale, there are no complex hierarchies. Model components have a one-to-one relationship and the memory layout is cache-friendly and well suited for SIMD (Single Input, Multiple Data) optimization.
But what if there is no one-to-one relation between model components? Such a situation occurs when there are components at different scales which dependent on each other. For example, when computing the mean fiber orientation from the positions in muscle_tissue, or when global tissue stress influences individual cell behavior. Ofcourse you could just create a new variable which stores the global variables representing the fiber properties:
struct Fiber { View pos; double orientation; double tension; };
But this is very cumbersome and not very elegant. It takes a lot of attention to not lose track of all relations between model components and all data arrays needs to kept up to data. Also the moment you want to extend your model to include more muscle, diversify the tissue, or add even more scales at the top or bottom of the hierarchy you run will end up creating a spagetti of dependencies between variables which quickly will become unmanagable. If you are not careful the indices become unstable handles to the data, and adding new components or scales requires extensive refactoring. Using a basic slicing method becomes unmanagable for any model with more than two levels of hierarchy. To prevent running into such problems we need a data structure which makes the hierarchical, multi-scale relationships explicit without losing the cache-friendly allocation of data components.
Inspired by component system architectures I use integer handles to represent both variables (e.g. a position, a velocity, a membrane potential) and collections of variables (e.g. a cell, a membrane, a muscle) using unique integer based IDs. Using such handles, defining a model becomes a question of moving around such IDs and writing functionality to make it this an easy process. This way we separate the actual allocation of the model data from the model design. Scientists think in terms of cells, tissues, organs, and ecosystems - not of memory adresses and SIMD registers. Humans tend to model systems as they appear in nature, with clear physical boundaries and dynamics that depend only on local neighborhoods. However, computers excel at performing similar operations in bulk, especially when data is aligned for SIMD operations. Traditional approaches often sacrifice readability and extensibility for performance, forcing researchers to re-optimize their code whenever the model changes. By abstracting the model structure, researchers can focus on implementing the dynamics of individual systems while maintaining a flexible, human-readable way to define the model and its component relationships.
Generating unique IDs could be done with a simple pattern:
static size_t gen_ID() {
static atomic_size_t uid = 0;
return atomic_fetch_add(&uid, 1);
}
uid increments atomically each time genID() is called ensuring unique IDs.
Component specific information can then be kept using arrays of enums:
typedef enum {
INACTIVE,
VARIABLE,
PARAMETER,
COLLECTION,
} ComponentType;
Now we have a way of creating model components by generating unique IDs.
test "create components" {
const size_t capacity = 10;
ComponentType components[capacity] = {};
const size_t x = gen_ID();
const size_t y = gen_ID();
components[x] = VARIABLE;
components[y] = PARAMETER;
assert(components[x] == VARIABLE);
assert(components[y] == PARAMETER);
assert(components[2] == INACTIVE);
}
The second part of the data structure is a graph which handles relationships between components (i.e. IDs). To build such associations we need a container to store these relations, an adjacency matrix (relations). For convenience we store the components and relations in a registry.
typedef struct {
size_t size;
size_t capacity;
ComponentType* components;
size_t* relations;
} Registry;
bool init(Registry* reg, size_t capacity) {
reg->components = malloc(capacity * sizeof(ComponentType));
if (!reg->components) return -1;
reg->relations = malloc(capacity * capacity * sizeof(size_t));
if (!reg->relations) return -1;
reg->size = 0;
reg->capacity = capacity;
for (size_t i = 0; i < capacity; ++i) reg->components[i] = INACTIVE;
for (size_t i = 0; i < capacity * capacity; ++i) reg->relations[i] = 0;
return 1;
}
void cleanup(Registry* reg) {
free(reg->components);
free(reg->relations);
free(reg);
}
bool add(Registry* reg, size_t* uid, ComponentType ctype) {
*uid = gen_ID(); // Local variable
if (*uid >= reg->capacity) {
fprintf(stderr, "registry is at full capacity");
return -1;
}
reg->components[*uid] = ctype;
reg->size++;
return 1;
}
bool has(Registry reg, size_t uid) {
return uid < reg.capacity && reg.components[uid] != INACTIVE;
}
Each component can parent a child component as long as the child component is not already a parent of the parent or is down stream in the ancestry-tree of the child. This ensures the acyclic property and enforces the hierarchical nature of the graph. Multi-scale models often have a hierarchical tree like structure in which components at a higher abstraction level contain many similar components at a lower abstraction level. Therefore the directed acyclic graph structure is ideal to represent such an architecture. We can provide easy to use functionality to build up the graph by providing the give and assemble functions below.
bool give(Registry* reg, size_t parent, size_t child, size_t copies) {
if (!has(*reg, parent) || !has(*reg, child) || isDescendant(reg, child, parent)) {
return -1;
}
reg->relations[IDX(parent, child, reg->capacity)] = copies;
return 1;
}
bool assemble(Registry* reg, size_t* parent, size_t* children, size_t num_children, size_t copies) {
add(reg, parent, COLLECTION);
for (size_t i=0; i < num_children; i++) {
if (!give(reg, *parent, children[i], copies)) {
return -1;
}
}
return 1;
}
As one can see we need some additional functionality to get a valid registration of relations between components. By valid I mean we would not want circular relations between components. For example, we would like to make a leaf and a root children of a tree, but then it should be prevented to make tree a child of root.
size_t x;
add(®, &x, VARIABLE);
printf("%i \n", has(reg, 3));
We write the functions isDescendent and hasParentInAncestry to find out if the requested relation is valid:
// macro for 2D indexing
#define IDX(row, col, ncol) ((row) * (ncol) + (col))
bool isDescendant(Registry* reg, size_t ancestor, size_t node) {
if (ancestor == node) return false;
if (ancestor >= reg->capacity || node >= reg->capacity) return false;
bool *visited = calloc(reg->capacity, sizeof(bool));
if (!visited) { return false; }
size_t *stack = malloc(reg->capacity * sizeof(size_t));
if (!stack) {
free(visited);
return false;
}
size_t sp = 0;
stack[sp++] = node;
visited[node] = true;
bool found = false;
while (sp > 0) {
size_t cur = stack[--sp];
for (size_t i = 0; i < reg->capacity; ++i) {
size_t rel = reg->relations[IDX(i, cur, reg->capacity)];
if (rel > 0) {
if (!visited[i]) {
if (i == ancestor) {
found = true;
break;
}
visited[i] = true;
stack[sp++] = i;
}
}
}
}
free(stack);
free(visited);
return found;
}
With minimalistic interface in place it is fairly straighforward to create complex model structures. Let's build up a system of muscle fibers in a flexible way with the above defined graph. We will give all components a position in three dimensions, a cells get a velocity component, and fibers an orientation and tension component. Each fiber consists of 10e5 cells, and we have 42 fibers in total, constituting the muscle tissue:
size_t x, y, z, position, dx, dy, dz, velocity, cell, fiber, muscle;
add(®, &x, VARIABLE);
add(®, &y, VARIABLE);
add(®, &z, VARIABLE);
assemble(®, &position, (size_t[]){x, y, z}, 3, 1);
add(®, &dx, VARIABLE);
add(®, &dy, VARIABLE);
add(®, &dz, VARIABLE);
assemble(®, &velocity, (size_t[]){dx, dy, dz}, 3, 1);
assemble(®, &cell, (size_t[]){position, velocity}, 2, 1);
assemble(®, &fiber, (size_t[]){position, cell}, 2, 10e5);
assemble(®, &muscle, (size_t[]){position, fiber}, 2, 42);
This will create the set of relations summarized in the table below. We now track all model components using unique IDs (e.g. x, cell, fiber, muscle) which function as flexible building blocks of the model. Now we have an interface for building models in the language of hierarchies. The directed acyclic graph holds all information on the relations between components, how many allocations should be made of each component and essentially where one should look to find a particular data point belonging to a specific part of the model.
| ID | Name | Type | Children ID(copies) |
| :--- | :------: | :--------: | :---------------------------------: |
| 0 | x | VARIABLE | |
| 1 | y | VARIABLE | |
| 2 | z | VARIABLE | |
| 3 | position | COLLECTION | 0(1), 1(1), 2(1) |
| 4 | dx | VARIABLE | |
| 5 | dy | VARIABLE | |
| 6 | dz | VARIABLE | |
| 7 | velocity | COLLECTION | 4(1), 5(1), 6(1) |
| 8 | cell | COLLECTION | 3(1), 7(1) |
| 9 | fiber | COLLECTION | 3(1000000), 8(1000000) |
| 10 | muscle | COLLECTION | 3(42), 9(42) |
We have not allocated any data components yet. In a next article I will describe how allocation can be done with SIMD operations in mind, and implement a view interface to expose the subsets of the data using the graph for specific computations on parts of the data.