Linux command
csvclean 命令
文本
复制后可按需替换文件名、目录或参数。
常用示例
Check for rows with length mismatches
csvclean --length-mismatch [data.csv]
Report length mismatches
csvclean --length-mismatch --omit-error-rows [data.csv]
Report empty columns
csvclean --empty-columns [data.csv]
Enable all checks
csvclean -a [data.csv]
Fix short rows by joining
csvclean --join-short-rows [data.csv]
Fill short rows
csvclean --fill-short-rows --fillvalue "N/A" [data.csv]
Validate with custom delimiter
csvclean --length-mismatch -d "[;]" -e [latin1] [data.csv]
说明
csvclean is part of csvkit that validates and cleans CSV files. It detects common problems like inconsistent column counts, empty columns, and encoding issues. Since csvkit 2.0, csvclean no longer reports or fixes errors by default. You must explicitly enable checks (such as --length-mismatch or --empty-columns) or fixes (such as --join-short-rows or --fill-short-rows). Output is written to standard output and errors to standard error. The tool handles various CSV dialects and can work with files using different delimiters, quote characters, and encodings. It is essential for preprocessing messy data before analysis.
参数
- --length-mismatch
- Report rows that are shorter or longer than the header row.
- --empty-columns
- Report empty columns as errors.
- -a, --enable-all-checks
- Enable all error reporting checks.
- --join-short-rows
- Merge consecutive short rows into a single row.
- --separator _SEPARATOR_
- String used to join short rows (default: newline).
- --fill-short-rows
- Fill short rows with missing values.
- --fillvalue _VALUE_
- Value used to fill short rows (default: empty string).
- --omit-error-rows
- Exclude rows containing errors from standard output.
- --label _LABEL_
- Add a label column to error output for automated workflows.
- --header-normalize-space
- Strip leading/trailing whitespace and normalize whitespace in headers.
- -d _CHAR_, --delimiter _CHAR_
- Field delimiter (default: comma).
- -t, --tabs
- Use tabs as delimiter.
- -q _CHAR_, --quotechar _CHAR_
- Quote character (default: double quote).
- -p _CHAR_, --escapechar _CHAR_
- Escape character for the delimiter or quote character.
- -e _ENCODING_, --encoding _ENCODING_
- Input file encoding.
- -S, --no-header-row
- File has no header row.
- -H
- Omit the header row from output.
- -K _N_, --skip-lines _N_
- Skip the first N lines of the input file.
- -v
- Verbose error output.
FAQ
What is the csvclean command used for?
csvclean is part of csvkit that validates and cleans CSV files. It detects common problems like inconsistent column counts, empty columns, and encoding issues. Since csvkit 2.0, csvclean no longer reports or fixes errors by default. You must explicitly enable checks (such as --length-mismatch or --empty-columns) or fixes (such as --join-short-rows or --fill-short-rows). Output is written to standard output and errors to standard error. The tool handles various CSV dialects and can work with files using different delimiters, quote characters, and encodings. It is essential for preprocessing messy data before analysis.
How do I run a basic csvclean example?
Run `csvclean --length-mismatch [data.csv]` in a terminal, then adjust file names, paths, flags, or remote targets for your system.
What does --length-mismatch do in csvclean?
Report rows that are shorter or longer than the header row.