Skip to content

AI Toolkit

For the development and implementation of different AI services, here we list a series of projects that can significantly help in managing these services.

Machine Learning

Framework
Ray Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
ZenML Develop ML pipelines locally that run on any MLOps stack.
Prefect Modern workflow orchestration for data and ML engineers.
Platform
Kubeflow Machine Learning Toolkit for Kubernetes.
Weights & Biases Weights & Biases helps AI developers build better models faster. Quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, and manage your ML workflows end-to-end.
MLflow Open source platform for the machine learning lifecycle.
Library
SciKit-Learn Machine learning in Python
XGBoost Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on a single machine, Hadoop, Spark, Dask, Flink and DataFlow.
Darts A python library for user-friendly forecasting and anomaly detection on time series.
OpenCV Open Source Computer Vision Library.

Model

Format & Interface
ONNX Open standard for machine learning interoperability.
Workflow
Airflow A platform to programmatically author, schedule, and monitor workflows.
Nifi NiFi automates cybersecurity, observability, event streams, and generative AI data pipelines and distribution for thousands of companies worldwide across every industry.

Deep Learning

Framework
Tensorflow
Pytorch Tensors and Dynamic neural networks in Python with strong GPU acceleration.
Library
Keras Deep Learning for humans.
Pytorch Lightning Deep learning framework to train, deploy, and ship AI products Lightning fast.
RAPIDS RAPIDS provides unmatched speed with familiar APIs that match the most popular PyData libraries. Built on state-of-the-art foundations like NVIDIA CUDA and Apache Arrow, it unlocks the speed of GPUs with code you already know.
OpenMMLab Covers a wide range of research topics of computer vision, e.g., classification, detection, segmentation and super-resolution.

Programming

Language
Python The Python programming language.
Library
Dask Parallel computing with task scheduling.
Numpy The fundamental package for scientific computing with Python.
Hydra Hydra is a framework for elegantly configuring complex applications
SciPy SciPy library main repository.

Notebook Environment

Notebook Environment
Jupyter Jupyter Interactive Notebook.
Colab Python libraries for Google Colaboratory.

Distributed Computing

Computing & Management
Docker
Podman A tool for managing OCI containers and pods.
Kubernetes An open-source system for automating deployment, scaling, and management of containerized applications.
Spark A unified analytics engine for large-scale data processing.
Portainer Portainer is the most versatile container management software that simplifies your secure adoption of containers with remarkable speed.
OpenShift Unified platform to build, modernize, and deploy applications at scale. Work smarter and faster with a complete set of services for bringing apps to market on your choice of infrastructure.
ArgoCD Argo CD is a declarative, GitOps continuous delivery tool for Kubernetes.

Data

Relation DB
MySQL MySQL Server, the world's most popular open source database, and MySQL Cluster, a real-time, open source transactional database.
Postgres Develop ML pipelines locally that run on any MLOps stack.
Storage & Format
Delta Lake An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs.
influxdb Scalable datastore for metrics, events, and real-time analytics.
pandas Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.
Versioning
DVC ML Experiments Management with Git.
Operations
Whylogs An open-source data logging library for machine learning models and data pipelines. Provides visibility into data quality & model performance over time. Supports privacy-preserving data collection, ensuring safety & robustness.
AI system Logging & Monitor AI system Logging & Monitor (RECICLAI)
Hive The Apache Hive ™ is a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale and facilitates reading, writing, and managing petabytes of data residing in distributed storage using SQL.
ETL
Airbyte The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
Feature Engineering
tsfresh tsfresh is a python package. It automatically calculates a large number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.
Stream Processing
Kafka Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Flink Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Visualization
D3 Bring data to life with SVG, Canvas and HTML.
Plotly-Dash Data Apps & Dashboards for Python. No JavaScript Required.
Grafana The open and composable observability and data visualization platform. Visualize metrics, logs, and traces from multiple sources like Prometheus, Loki, Elasticsearch, InfluxDB, Postgres and many more.
Prometheus The Prometheus monitoring system and time series database.
Streamlit A faster way to build and share data apps.
Kibana Run data analytics at speed and scale for observability, security, and search with Kibana. Powerful analysis on any data from any source, from threat intelligence to search analytics, logs to application monitoring, and much more.
Gradio Gradio is the fastest way to demo your machine learning model with a friendly web interface so that anyone can use it, anywhere!
Pipeline Management
TPOT TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
Labeling & Annotation
Label Studio Label Studio is a multi-type data labeling and annotation tool with standardized output format.
CVAT Annotate better with CVAT, the industry-leading data engine for machine learning. Used and trusted by teams at any scale, for data of any scale.
Supervisely Develop AI faster and better with on-premise, enterprise-grade end-to-end solution for every task: from labeling to building production models.

Validation

Validation
Evidently AI Evidently helps analyze and monitor the quality of machine learning models in production. It generates detailed reports on data drift and model performance, using visualizations to identify significant changes in input data or model performance.
Whylogs Whylogs is a lightweight and scalable library for logging and monitoring ML data in production. It provides statistical profiles of input and output data, facilitating the detection of data drift and anomalies in real-time or batch data.
Promehteus & Grafana Although not specific to ML, they can be adapted to monitor specific ML model metrics, including production accuracy. By defining custom metrics that reflect model performance, they can be used to capture and visualize data drift or model drift, though this requires manual configuration and clear metric definitions.
Alibi Detect Specialized in anomaly and data drift detection, Alibi Detect offers a series of techniques and algorithms designed specifically to identify changes in input data and model behavior, which may indicate the need for retraining.
MLPerf (and MLCommons) MLPerf is a suite of benchmarks that evaluates the performance of hardware, software, and machine learning models. It provides standardized metrics that allow comparing different implementations and configurations of ML, helping to identify best practices and optimizations in the field of machine learning.