datatools is a rich collection of command line programs targeting data conversion, cleanup and analysis directly from your favorite POSIX shell or Powershell. It has proven useful for data collaborations where individual members of a project may prefer different tool sets in their analysis (e.g. Julia, R, Python) but want to work from a common baseline. It also has been used intensively for internal reporting from various Caltech Library metadata sources.
The tools fall into three broad categories
- data transformation and conversion
- shell scripting helpers
- "string", a tool providing the common string operations missing from shell
See user manual for a complete list of the command line programs. The data transformation tools include support for formats such as Excel XML, csv, tab delimited files, json, yaml, toml and url encoding/decoding.
Compiled versions of the datatools collection are provided for Linux (aarch64/amd64), Mac OS X (aarch64/amd64), Windows 10 (aarch64/amd64) and Raspberry Pi OS (aarch64). See https://github.com/caltechlibrary/datatools/releases.
Use "-help" option for a full list of options for each utility (e.g. csv2json -help).
The tooling around transformation includes data conversion. These include tools that work with CSV, tab delimited, JSON, TOML, YAML, Excel XML, and url encoded text.
There is also tooling to change data shapes using JSON as the intermediate data format.
Various utilities for simplifying work on the command line.
- mergepath - prefix, append, clip path variables
- reldocpath - calculates relative paths given do paths
- range - emit a range of integers (useful for numbered loops in Bash)
- reldate - display a relative date in YYYY-MM-DD format
- reltime - display a relative time in 24 hour notation, HH:MM:SS format
- timefmt - format a time value based on Golang's time format language
- urlparse - split a URL into parts
datatools provides the string command for working with
text strings (limited to memory available). This is commonly needed when
cleanup data for analysis. The string command was created for when the
old Unix standbys- grep, awk, sed, tr are unwieldy or inconvenient.
string provides operations are common in most language like, trimming,
splitting, and transforming letter case. The string command also makes
it easy to join JSON string arrays into single a string using a delimiter
or split a string into a JSON array based on a delimiter. The form of the
command is string [OPTIONS] [ACTION] [ACTION_PARAMETERS...]
string toupper "one two three"Would yield "ONE TWO THREE".
Some of the features included
- change case (upper, lower, title, English title)
- length, position and count of substrings
- has prefix, suffix or contains
- trim prefix, suffix and cutsets
- split and join to/from JSON string arrays
See string for full details
See INSTALL.md for details for installing pre-compiled versions of the programs.