Toolbox

A curated collection of tools, frameworks, and technologies I frequently use in my projects and recommend.

HPC Workflow Managers

Tools for creating and managing complex, scalable, and reproducible computational pipelines in high-performance computing environments. Particularly useful when working with heterogeneous inputs which may require different processing steps, compute resources or different compute environments.

Snakemake Used to be the OG tool for HPC workflow management, but has been surpassed by Nextflow. However, the newest version is still very close to nextflow in terms of features and ease of use. Still my go-to because of my personal experience with it.
Nextflow It is a more modern tool with more features including environment management, container support, better logging. It also has more boilerplate pipelines available that can be used off the shelf.

Data Collection/Entry

Platforms and tools designed for systematic and structured data gathering for research purposes. See my blog post for more details.

REDCap The only option for HIPAA and PII data collection though it requires your university to have a local IT department to manage it.
Directus Meant to be a headless CMS however its an excellent fit for creating a django like interface both to create a schema and to create forms to collect data. It also has a lot of features for automation and visualization.
Epicollect As simple to use as google forms however I prefer it over google forms due to their API (which allows automation) and ability to use it from a tablet or offline.
Animal Observer Niche tool for recording primate behavior in the field however it is the only tool for this purpose.

Databases

Systems for managing structured data.

PostgreSQL Unbeatable with the amount of plugins it has Can have features like full-text search, temporal data types, and more. With its JSONB support it can replace MongoDB for most use cases in research.
VictoriaMetrics Drop-in replacement for Prometheus, InfluxDB, Graphite, and OpenTSDB with better performance, compression. It also has a superset of PromQL with additional features like sort by label.

Data Visualization

Quick data visualizations and querying.

Metabase Metabase is a data visualization tool that allows users to create interactive and shareable visualizations from their data. Supports OAuth, self-hosting, and now in the latest versions also has data entry features.
Grafana Grafana is the go-to tool for visualizing time series data, metrics and I also prefer its alerting system over Prometheus' alertmanager.

Automation

Tools that I have used over the years to automate my or others' workflows.

Cronjobs Irreplaceable for running simple bash scripts at regular intervals. Add heartbeat monitoring to make them robust.
GitHub Actions Self-hosted runners allows researchers without DevOps experience to deploy updates or new version of their tools automatically to specific servers.
Windmill Initially started using them to have a central place to control automation workflows and logs however they are now a full-fledged platform for automating workflows and data pipelines. See my blog post for more details.

Secure Networking

Tools for creating and managing secure networks.

Cloudflare Zero Trust Cloudflare Zero Trust is a secure network that allows exposing internal services to the internet with minimal configuration. Also has 'access' feature that allows creating authentication/authorization before access. Support HTTP(s) services through browser and most other protocols (ssh, smb, rdp, vnc or any arbitrary TCP) through their cloudflared client. The only service I use that is not self-hosted because it is just so good at keeping everything secure. I'd like to ultimately replace it with pangolin.