Production data cleaning engine — from installation to every feature, with working examples.
CleanR installs as a regular Python package. Once installed, the cleanr command is available everywhere on your system — no virtual environment needed.
Run python3 --version in your terminal. Any version from 3.9 onward works.
This makes pip-installed commands like cleanr available in every terminal session.
The folder contains pyproject.toml and the cleanr/ package directory.
Run cleanr --help from any directory. You should see the full flag reference.
# Step 2 — Add ~/.local/bin to PATH (run once, then restart your terminal) echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc source ~/.bashrc # Step 3 — Clone the repo and install git clone https://github.com/Omensah-15/CleanR-v3.git cd CleanR-v3 pip install -e . --break-system-packages # Step 4 — Verify (works from any directory) cleanr --help
pip install pyarrow --break-system-packages # adds .parquet input/output support
cleanr from any directory in any terminal — no activation or setup needed.Point CleanR at any data file. It detects the format and encoding automatically, runs the full pipeline, and saves the result.
cleanr data.csv
This saves data_clean.csv alongside your input, plus a quality report and audit log. Here is what the output looks like:
CleanR v3.0.0 Format: CSV Encoding: utf-8 Delimiter: ',' Loaded: 5,200 rows x 14 cols (0.03s) Profile: quality Fair 68/100 Pipeline: + [normalize_columns] Normalized 11/14 column names to snake_case + [trim_whitespace] Trimmed and normalised whitespace in 5 string columns + [remove_duplicates] Removed 47 duplicate rows (5,200 -> 5,153) + [handle_missing] Imputed 312 nulls in 'age' via knn_k5 + [handle_missing] Imputed 28 nulls in 'salary' via median (fill=62450.0) + [type_coercion] Coerced 'join_date' -> datetime64[ns] + [type_coercion] Coerced 'country' -> category + [format_validator] Flagged 23 (0.4%) invalid email values in 'email' + [memory_optimize] Memory: 0.44 MB -> 0.29 MB (saved 0.15 MB, 34%) Post-clean: quality Excellent 96/100 (0.31s) Done: 0.41s | 5,153 rows x 15 cols | Quality: Excellent (96/100) Report: data_clean.report.json Audit: data_clean.audit.json
You can also name your output file explicitly:
cleanr messy_data.xlsx tidy_output.csv
By default, CleanR uses KNN imputation automatically when a column is 5-40% missing. No flag needed — it happens on every run.
# Auto strategy: KNN for mid-missing numerics, mode for categoricals, ffill for dates cleanr data.csv # Force KNN for everything cleanr data.csv --impute-strategy knn # Use median/mode only (faster, no ML) cleanr data.csv --impute-strategy median # Fill everything with a constant cleanr data.csv --impute-strategy constant --fill "N/A"
cleanr data.csv --drop-na # Removes every row that contains at least one empty cell — overrides imputation
cleanr data.csv --drop-col-threshold 0.8 # Drops any column that is 80% or more empty
Outlier detection uses Isolation Forest + IQR consensus — a row must be flagged by both methods before it is acted on. This minimises false positives.
cleanr data.csv --detect-outliers # Adds a '_is_outlier' boolean column — you decide what to do with flagged rows
cleanr data.csv --detect-outliers --outlier-method remove
# contamination = expected fraction of outliers (default: 0.02 = 2%) cleanr data.csv --detect-outliers --outlier-contamination 0.05 # Use 0.05 for dirtier data, 0.01 for cleaner data
cleanr data.csv --keep id,name,email,signup_date,country # All other columns are discarded
cleanr data.csv --drop internal_notes,debug_flag,temp_col
cleanr data.csv --rename fname=first_name lname=last_name cust_id=id # Format: OLD=NEW — space-separated pairs
# "John Smith" -> first_name="John", last_name="Smith" cleanr contacts.csv --split full_name "first_name,last_name" " " # "Austin,TX,78701" -> city, state, zip cleanr data.csv --split address "city,state,zip" ","
cleanr data.csv --add email_backup=email original_id=id
*Parquet requires pip install pyarrow --break-system-packages
# CSV input -> Excel output cleanr data.csv output.xlsx --output-format xlsx # Excel input -> JSON output cleanr report.xlsx report.json --output-format json # CSV -> TSV (tab-separated) cleanr data.csv data.tsv --output-format tsv # CSV -> JSONL (one object per line) cleanr data.csv data.jsonl --output-format jsonl
cleanr old_export.csv --encoding latin1 # Options: utf-8, utf-16, latin1, cp1252, iso-8859-1 # Usually not needed — CleanR auto-detects encoding
Write a JSON rules file, pass it with --rules. Each rule targets a column and checks a condition. Violations are flagged or removed.
[ { "column": "age", "type": "min", "value": 0 }, { "column": "age", "type": "max", "value": 120 }, { "column": "email", "type": "not_null" }, { "column": "status", "type": "allowed_values", "values": ["active", "inactive", "pending"] }, { "column": "phone", "type": "regex", "pattern": "^\\+?[\\d\\s\\-().]{7,20}$" } ]
# Flag violating rows (adds '_rule_violation' column) cleanr data.csv --rules rules.json # Remove violating rows cleanr data.csv --rules rules.json --rule-action remove
# Read and process 50,000 rows at a time cleanr huge_file.csv --chunk 50000
# --quick skips schema inference, KNN imputation, and memory optimisation cleanr 10gb_log.csv clean.csv --chunk 100000 --quick --no-fingerprint
# .gz, .bz2, .xz, and .zip are detected automatically cleanr data.csv.gz data_clean.csv
All steps are on by default. Use --no-* flags to disable individual ones while keeping everything else running.
cleanr data.csv --no-normalize # keep original column names cleanr data.csv --no-dedup # keep duplicate rows cleanr data.csv --no-trim # do not strip whitespace cleanr data.csv --no-auto-types # do not change column types cleanr data.csv --no-validate-formats # skip email/phone/URL checks cleanr data.csv --no-drop-constant # keep constant columns
With --bare, every default pipeline step is disabled. Only the flags you explicitly pass will run. Use this when you want a minimal, exact pipeline with no automatic side effects.
# Only fill missing values — nothing else touches the data cleanr data.csv --bare --fill "nan" # Only remove duplicates and fill missing values cleanr data.csv --bare --dedup --fill "nan" # Only normalize column names and strip whitespace cleanr data.csv --bare --normalize --trim # Only KNN imputation and outlier flagging cleanr data.csv --bare --impute-strategy knn --detect-outliers
--no-normalize, --no-dedup etc. to turn off individual steps while keeping everything else. Use --bare when you want to start from nothing and opt in to only what you need.With --bare, every default pipeline step is disabled. Only the flags you explicitly pass will run. Use this when you want a minimal, exact pipeline with no automatic side effects.
# Only fill missing values — nothing else touches the data cleanr data.csv --bare --fill "nan" # Only remove duplicates and fill missing values cleanr data.csv --bare --dedup --fill "nan" # Only normalize column names and strip whitespace cleanr data.csv --bare --normalize --trim # Only KNN imputation and outlier flagging cleanr data.csv --bare --impute-strategy knn --detect-outliers
--no-normalize, --no-dedup etc. to turn off individual steps while keeping everything else. Use --bare when you want to start from nothing and opt in to only what you need.cleanr data.csv --quiet # Prints nothing. Exit code 0 = success, 1 = error. # Safe to use in scripts and automated pipelines.
cleanr data.csv --report reports/quality.json --audit logs/audit.json
Every run automatically creates three files.
The cleaned dataset. Normalized column names, corrected data types, duplicates removed, whitespace stripped, missing values imputed. Ready for any analytics tool.
Full quality report: before/after row counts, a five-dimension quality score (Completeness, Uniqueness, Validity, Consistency, Accuracy), per-column statistics (mean, std, percentiles, skewness, kurtosis, normality p-value), detected anomalies catalogued as Critical/Warning/Info, and SHA-256 integrity fingerprints for both input and output.
Timestamped log of every action taken — which plugin ran, what it changed, elapsed time. Use this for compliance, debugging, or reproducing a cleaning run exactly.
{ "quality_score": 96.0, "quality_label": "Excellent", "dimensions": [ { "name": "Completeness", "score": 100.0, "weight": 0.30 }, { "name": "Uniqueness", "score": 100.0, "weight": 0.20 }, { "name": "Validity", "score": 99.2, "weight": 0.20 }, { "name": "Consistency", "score": 94.0, "weight": 0.15 }, { "name": "Accuracy", "score": 93.7, "weight": 0.15 } ], "fingerprints": { "input": { "data_hash": "2fab17210014a8c7..." }, "output": { "data_hash": "c035153d0c1de947..." } } }
Use CleanR inside your own scripts, notebooks, or data pipelines instead of the command line.
from pathlib import Path from cleanr.engine import CleanREngine engine = CleanREngine() result = engine.clean( input_path=Path("data.csv"), output_path=Path("data_clean.csv"), ) # Read the quality score print(result["post_profile"].quality_score) # 96.0 print(result["post_profile"].quality_label) # "Excellent" # Access per-column statistics age_profile = result["post_profile"].columns["age"] print(age_profile.mean, age_profile.std, age_profile.null_pct) # Access schema inference results age_schema = result["schema"]["age"] print(age_schema.inferred_dtype) # "int64" print(age_schema.distribution) # "normal" | "skewed" | ... print(age_schema.confidence) # 0.97
result = engine.clean(
input_path=Path("data.csv"),
output_path=Path("clean.csv"),
impute_strategy="knn",
drop_col_threshold=0.8,
detect_outliers=True,
outlier_method="flag",
keep=["id", "name", "email", "age"],
rename={"cust_id": "id"},
output_format="xlsx",
)from cleanr.plugins import CleanrPlugin from cleanr.engine import CleanREngine import pandas as pd class StandardiseCountry(CleanrPlugin): name = "standardise_country" def run(self, df: pd.DataFrame) -> pd.DataFrame: if "country" in df.columns: df["country"] = df["country"].str.upper().str.strip() df["country"] = df["country"].replace({ "UK": "GBR", "USA": "USA", "US": "USA" }) self.log("Standardised country codes to ISO 3166-1 alpha-3") return df engine = CleanREngine() engine.register_plugin(StandardiseCountry()) engine.clean(Path("data.csv"), Path("clean.csv"))
cleanr file.csvThese all happen automatically — no flags required.
Magic-byte inspection, extension hints, and CSV dialect sniffing identify the format and encoding automatically — including compressed files.
"First Name" becomes first_name. All column names are lowercased, punctuation replaced with underscores, duplicates disambiguated.
Strips leading/trailing whitespace, collapses double spaces, and replaces empty strings and common null placeholders (N/A, null, None, nan) with proper NaN.
Finds and removes fully duplicate rows. Near-duplicate detection is tracked in the quality report.
Columns that carry zero information (single unique value across all rows) are automatically removed.
Six-pass analysis per column: bool, integer, float, datetime (18 formats), semantic type (15 patterns), cardinality. Each inference has a confidence score.
Under 5% missing: median/mode. 5-40% missing: KNN with distance weighting. Over 40%: median with a warning. Datetimes: forward-fill then backward-fill.
Columns inferred as integers, floats, booleans, datetimes, or categories are coerced to their correct types — with currency and percentage symbols stripped.
Invalid emails, phone numbers, URLs, UUIDs, IPv4 addresses, and other semantic types are flagged in dedicated _invalid_* columns.
Integer columns are downcast to the smallest fitting type. Low-cardinality strings become categories. Typical savings: 20-40%.
Five-dimension weighted score: Completeness (30%), Uniqueness (20%), Validity (20%), Consistency (15%), Accuracy (15%). Issues catalogued as Critical / Warning / Info.
Cryptographic hash of both raw file bytes and DataFrame content — before and after cleaning. Use to verify data integrity at any time.