Complete Usage Guide

CleanR v3

Production data cleaning engine — from installation to every feature, with working examples.

Installation

CleanR installs as a regular Python package. Once installed, the cleanr command is available everywhere on your system — no virtual environment needed.

01

Confirm Python 3.9 or newer

Run python3 --version in your terminal. Any version from 3.9 onward works.

02

Add ~/.local/bin to your PATH (one time only)

This makes pip-installed commands like cleanr available in every terminal session.

03

Navigate into the Cleanr_v3 folder and install

The folder contains pyproject.toml and the cleanr/ package directory.

04

Verify it works

Run cleanr --help from any directory. You should see the full flag reference.

Terminal
# Step 2 — Add ~/.local/bin to PATH (run once, then restart your terminal)
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Step 3 — Clone the repo and install
git clone https://github.com/Omensah-15/CleanR-v3.git
cd CleanR-v3
pip install -e . --break-system-packages

# Step 4 — Verify (works from any directory)
cleanr --help

Optional: Parquet file support

pip install pyarrow --break-system-packages   # adds .parquet input/output support
After install, you can run cleanr from any directory in any terminal — no activation or setup needed.

Your first run

Point CleanR at any data file. It detects the format and encoding automatically, runs the full pipeline, and saves the result.

Minimal usage
cleanr data.csv

This saves data_clean.csv alongside your input, plus a quality report and audit log. Here is what the output looks like:

Example terminal output
  CleanR v3.0.0
  Format: CSV  Encoding: utf-8  Delimiter: ','
  Loaded: 5,200 rows x 14 cols  (0.03s)
  Profile: quality Fair 68/100

  Pipeline:
    + [normalize_columns]  Normalized 11/14 column names to snake_case
    + [trim_whitespace]    Trimmed and normalised whitespace in 5 string columns
    + [remove_duplicates]  Removed 47 duplicate rows (5,200 -> 5,153)
    + [handle_missing]     Imputed 312 nulls in 'age' via knn_k5
    + [handle_missing]     Imputed 28 nulls in 'salary' via median (fill=62450.0)
    + [type_coercion]      Coerced 'join_date' -> datetime64[ns]
    + [type_coercion]      Coerced 'country' -> category
    + [format_validator]   Flagged 23 (0.4%) invalid email values in 'email'
    + [memory_optimize]    Memory: 0.44 MB -> 0.29 MB (saved 0.15 MB, 34%)

  Post-clean: quality Excellent 96/100  (0.31s)

  Done: 0.41s  |  5,153 rows x 15 cols  |  Quality: Excellent (96/100)
  Report: data_clean.report.json
  Audit:  data_clean.audit.json

You can also name your output file explicitly:

cleanr messy_data.xlsx tidy_output.csv

All features with examples

KNN imputation (default for 5-40% missing)

By default, CleanR uses KNN imputation automatically when a column is 5-40% missing. No flag needed — it happens on every run.

# Auto strategy: KNN for mid-missing numerics, mode for categoricals, ffill for dates
cleanr data.csv

# Force KNN for everything
cleanr data.csv --impute-strategy knn

# Use median/mode only (faster, no ML)
cleanr data.csv --impute-strategy median

# Fill everything with a constant
cleanr data.csv --impute-strategy constant --fill "N/A"

Drop rows with any missing value

cleanr data.csv --drop-na
# Removes every row that contains at least one empty cell — overrides imputation

Drop columns that are mostly empty

cleanr data.csv --drop-col-threshold 0.8
# Drops any column that is 80% or more empty

Outlier detection uses Isolation Forest + IQR consensus — a row must be flagged by both methods before it is acted on. This minimises false positives.

Flag outlier rows (default)

cleanr data.csv --detect-outliers
# Adds a '_is_outlier' boolean column — you decide what to do with flagged rows

Remove outlier rows automatically

cleanr data.csv --detect-outliers --outlier-method remove

Adjust sensitivity

# contamination = expected fraction of outliers (default: 0.02 = 2%)
cleanr data.csv --detect-outliers --outlier-contamination 0.05
# Use 0.05 for dirtier data, 0.01 for cleaner data

Keep only the columns you need

cleanr data.csv --keep id,name,email,signup_date,country
# All other columns are discarded

Drop specific columns

cleanr data.csv --drop internal_notes,debug_flag,temp_col

Rename columns

cleanr data.csv --rename fname=first_name lname=last_name cust_id=id
# Format: OLD=NEW — space-separated pairs

Split one column into multiple

# "John Smith"  ->  first_name="John", last_name="Smith"
cleanr contacts.csv --split full_name "first_name,last_name" " "

# "Austin,TX,78701"  ->  city, state, zip
cleanr data.csv --split address "city,state,zip" ","

Add a copy of a column

cleanr data.csv --add email_backup=email original_id=id

Supported input formats

CSVTSVTXT JSONJSONL XLSXXLS Parquet* GZ / BZ2 / XZ compressed

*Parquet requires pip install pyarrow --break-system-packages

Output to a different format

# CSV input -> Excel output
cleanr data.csv output.xlsx --output-format xlsx

# Excel input -> JSON output
cleanr report.xlsx report.json --output-format json

# CSV -> TSV (tab-separated)
cleanr data.csv data.tsv --output-format tsv

# CSV -> JSONL (one object per line)
cleanr data.csv data.jsonl --output-format jsonl

Force a specific encoding

cleanr old_export.csv --encoding latin1
# Options: utf-8, utf-16, latin1, cp1252, iso-8859-1
# Usually not needed — CleanR auto-detects encoding

Write a JSON rules file, pass it with --rules. Each rule targets a column and checks a condition. Violations are flagged or removed.

Create a rules file

rules.json
[
  { "column": "age",     "type": "min",   "value": 0   },
  { "column": "age",     "type": "max",   "value": 120 },
  { "column": "email",   "type": "not_null" },
  { "column": "status",  "type": "allowed_values",
    "values": ["active", "inactive", "pending"] },
  { "column": "phone",   "type": "regex",
    "pattern": "^\\+?[\\d\\s\\-().]{7,20}$" }
]

Apply the rules

# Flag violating rows (adds '_rule_violation' column)
cleanr data.csv --rules rules.json

# Remove violating rows
cleanr data.csv --rules rules.json --rule-action remove

Files too large to fit in RAM

# Read and process 50,000 rows at a time
cleanr huge_file.csv --chunk 50000

Maximum throughput mode

# --quick skips schema inference, KNN imputation, and memory optimisation
cleanr 10gb_log.csv clean.csv --chunk 100000 --quick --no-fingerprint

Compressed input files

# .gz, .bz2, .xz, and .zip are detected automatically
cleanr data.csv.gz data_clean.csv

Turn off specific pipeline steps

All steps are on by default. Use --no-* flags to disable individual ones while keeping everything else running.

cleanr data.csv --no-normalize         # keep original column names
cleanr data.csv --no-dedup              # keep duplicate rows
cleanr data.csv --no-trim               # do not strip whitespace
cleanr data.csv --no-auto-types         # do not change column types
cleanr data.csv --no-validate-formats   # skip email/phone/URL checks
cleanr data.csv --no-drop-constant      # keep constant columns

Bare mode — run only what you specify

With --bare, every default pipeline step is disabled. Only the flags you explicitly pass will run. Use this when you want a minimal, exact pipeline with no automatic side effects.

# Only fill missing values — nothing else touches the data
cleanr data.csv --bare --fill "nan"

# Only remove duplicates and fill missing values
cleanr data.csv --bare --dedup --fill "nan"

# Only normalize column names and strip whitespace
cleanr data.csv --bare --normalize --trim

# Only KNN imputation and outlier flagging
cleanr data.csv --bare --impute-strategy knn --detect-outliers
--bare vs --no-* — Use --no-normalize, --no-dedup etc. to turn off individual steps while keeping everything else. Use --bare when you want to start from nothing and opt in to only what you need.

Bare mode — run only what you specify

With --bare, every default pipeline step is disabled. Only the flags you explicitly pass will run. Use this when you want a minimal, exact pipeline with no automatic side effects.

# Only fill missing values — nothing else touches the data
cleanr data.csv --bare --fill "nan"

# Only remove duplicates and fill missing values
cleanr data.csv --bare --dedup --fill "nan"

# Only normalize column names and strip whitespace
cleanr data.csv --bare --normalize --trim

# Only KNN imputation and outlier flagging
cleanr data.csv --bare --impute-strategy knn --detect-outliers
--bare vs --no-* — Use --no-normalize, --no-dedup etc. to turn off individual steps while keeping everything else. Use --bare when you want to start from nothing and opt in to only what you need.

Suppress terminal output

cleanr data.csv --quiet
# Prints nothing. Exit code 0 = success, 1 = error.
# Safe to use in scripts and automated pipelines.

Custom report and audit output paths

cleanr data.csv --report reports/quality.json --audit logs/audit.json

What CleanR produces

Every run automatically creates three files.

data_clean.csv

The cleaned dataset. Normalized column names, corrected data types, duplicates removed, whitespace stripped, missing values imputed. Ready for any analytics tool.

data_clean.report.json

Full quality report: before/after row counts, a five-dimension quality score (Completeness, Uniqueness, Validity, Consistency, Accuracy), per-column statistics (mean, std, percentiles, skewness, kurtosis, normality p-value), detected anomalies catalogued as Critical/Warning/Info, and SHA-256 integrity fingerprints for both input and output.

data_clean.audit.json

Timestamped log of every action taken — which plugin ran, what it changed, elapsed time. Use this for compliance, debugging, or reproducing a cleaning run exactly.

Quality score breakdown

report.json (excerpt)
{
  "quality_score": 96.0,
  "quality_label": "Excellent",
  "dimensions": [
    { "name": "Completeness",  "score": 100.0, "weight": 0.30 },
    { "name": "Uniqueness",     "score": 100.0, "weight": 0.20 },
    { "name": "Validity",       "score": 99.2,  "weight": 0.20 },
    { "name": "Consistency",    "score": 94.0,  "weight": 0.15 },
    { "name": "Accuracy",       "score": 93.7,  "weight": 0.15 }
  ],
  "fingerprints": {
    "input":  { "data_hash": "2fab17210014a8c7..." },
    "output": { "data_hash": "c035153d0c1de947..." }
  }
}

Using CleanR from Python

Use CleanR inside your own scripts, notebooks, or data pipelines instead of the command line.

Basic usage

Python
from pathlib import Path
from cleanr.engine import CleanREngine

engine = CleanREngine()
result = engine.clean(
    input_path=Path("data.csv"),
    output_path=Path("data_clean.csv"),
)

# Read the quality score
print(result["post_profile"].quality_score)   # 96.0
print(result["post_profile"].quality_label)   # "Excellent"

# Access per-column statistics
age_profile = result["post_profile"].columns["age"]
print(age_profile.mean, age_profile.std, age_profile.null_pct)

# Access schema inference results
age_schema = result["schema"]["age"]
print(age_schema.inferred_dtype)     # "int64"
print(age_schema.distribution)      # "normal" | "skewed" | ...
print(age_schema.confidence)        # 0.97

All pipeline options via Python

Python
result = engine.clean(
    input_path=Path("data.csv"),
    output_path=Path("clean.csv"),
    impute_strategy="knn",
    drop_col_threshold=0.8,
    detect_outliers=True,
    outlier_method="flag",
    keep=["id", "name", "email", "age"],
    rename={"cust_id": "id"},
    output_format="xlsx",
)

Writing a custom plugin

Python
from cleanr.plugins import CleanrPlugin
from cleanr.engine  import CleanREngine
import pandas as pd

class StandardiseCountry(CleanrPlugin):
    name = "standardise_country"

    def run(self, df: pd.DataFrame) -> pd.DataFrame:
        if "country" in df.columns:
            df["country"] = df["country"].str.upper().str.strip()
            df["country"] = df["country"].replace({
                "UK": "GBR", "USA": "USA", "US": "USA"
            })
            self.log("Standardised country codes to ISO 3166-1 alpha-3")
        return df

engine = CleanREngine()
engine.register_plugin(StandardiseCountry())
engine.clean(Path("data.csv"), Path("clean.csv"))

What runs on every cleanr file.csv

These all happen automatically — no flags required.

Format Detection

Magic-byte inspection, extension hints, and CSV dialect sniffing identify the format and encoding automatically — including compressed files.

Column Normalization

"First Name" becomes first_name. All column names are lowercased, punctuation replaced with underscores, duplicates disambiguated.

Whitespace Cleaning

Strips leading/trailing whitespace, collapses double spaces, and replaces empty strings and common null placeholders (N/A, null, None, nan) with proper NaN.

Duplicate Removal

Finds and removes fully duplicate rows. Near-duplicate detection is tracked in the quality report.

Constant Column Drop

Columns that carry zero information (single unique value across all rows) are automatically removed.

Schema Inference

Six-pass analysis per column: bool, integer, float, datetime (18 formats), semantic type (15 patterns), cardinality. Each inference has a confidence score.

Smart Imputation

Under 5% missing: median/mode. 5-40% missing: KNN with distance weighting. Over 40%: median with a warning. Datetimes: forward-fill then backward-fill.

Type Coercion

Columns inferred as integers, floats, booleans, datetimes, or categories are coerced to their correct types — with currency and percentage symbols stripped.

Format Validation

Invalid emails, phone numbers, URLs, UUIDs, IPv4 addresses, and other semantic types are flagged in dedicated _invalid_* columns.

Memory Optimization

Integer columns are downcast to the smallest fitting type. Low-cardinality strings become categories. Typical savings: 20-40%.

Quality Scoring

Five-dimension weighted score: Completeness (30%), Uniqueness (20%), Validity (20%), Consistency (15%), Accuracy (15%). Issues catalogued as Critical / Warning / Info.

SHA-256 Fingerprinting

Cryptographic hash of both raw file bytes and DataFrame content — before and after cleaning. Use to verify data integrity at any time.