Topological sort
I was fiddling with graphlib in the Python stdlib and found it quite nifty. It processes a Directed Acyclic Graph (DAG), where tasks (nodes) are connected by directed edges (dependencies), and returns the correct execution order. The "acyclic" part ensures no circular dependencies.
Topological sorting is useful for arranging tasks so that each one follows its dependencies. It's widely used in scheduling, build systems, dependency resolution, and database migrations.
For example, consider these tasks:
- Task A must be completed before Tasks B and C.
- Tasks B and C must be completed before Task D.
This can be represented as:
{{< mermaid >}} graph TD A --> B A --> C B --> D C --> D {{</ mermaid >}}
Here, A can start right away, B and C follow after A, and D is last, depending on both B and C.
The task order can be determined as:
- A
- B and C (in parallel, since both depend only on A)
- D (which depends on both B and C)
This method ensures tasks are executed in the right sequence while respecting all dependencies.
To resolve the above-mentioned case with graphlib, you'd do the following:
Running this will print the following:
Since Python's stdlib already has graphlib, I thought I'd write a sloppy one in Go to learn the mechanics of how it works.
Writing a topological sorter in Go
The API will be similar to what we've seen in the graphlib example.
Defining the graph structure
First, we need a graph structure to hold the tasks and their dependencies. We'll use an adjacency list to represent the graph, and a map to track the in-degree of each node (how many tasks it depends on).
Here:
- vertices: a list of tasks that each node points to (i.e., its dependents).
- inDegree: how many tasks must finish before each task can be processed.
- queue: tasks that can be processed because they have no unmet dependencies.
- active: how many tasks are currently ready for processing.
Adding dependencies
Next, we'll define how one task depends on another. The AddEdge function sets up this relationship, ensuring the source task knows it must finish before the destination task can proceed.
- The destination task is added to the list of tasks that the source task points to, marking the dependency.
- The in-degree of the destination task is increased by 1 because it depends on the source task.
- If the source task is new, we initialize its in-degree to 0.
Initializing and processing tasks in batches
Now we'll initialize the graph by identifying tasks that can be processed immediately - those with an in-degree of 0 (i.e., they have no dependencies). We then process tasks batch by batch.
- This function finds tasks with an in-degree of 0 (no dependencies) and adds them to the processing queue.
- The active count keeps track of how many tasks are ready to run.
Processing each batch of tasks
We use GetReady to retrieve the next batch of tasks that are ready for processing. These are tasks with no unmet dependencies.
- GetReady pulls the current batch of tasks from the queue and clears it for the next batch.
- Tasks are returned in the order they are ready to be processed.
Marking the processed tasks as done
Once a batch of tasks is completed, we mark them as done and reduce the in-degree of any tasks that depend on them.
- For each completed task, we reduce the in-degree of any dependent tasks.
- If a dependent task's in-degree reaches 0, it's added to the queue and is now ready to be processed in the next batch.
Running the full topological sort
Finally, we'll implement the TopologicalSortBatch function, which processes all tasks in batches until none are left.
- Prepare loads the first set of tasks that can be processed.
- IsActive checks if there are any tasks left to process.
- GetReady retrieves the next batch of tasks to process.
- Done marks tasks as finished, allowing dependent tasks to be processed next.
Using the sorter
You can use the API as follows:
This will return:
Here, A needs to run first. B and C can run in parallel after A finishes, and only then can D run.
Complete example
Here's the full implementation, heavily annotated for clarity:
You can use this to make custom task orchestrators.
Fin!
Discussion in the ATmosphere