NAME select - Pass a subset of input to output, optionally changing formats. SYNOPSIS sel [options] [columns] DESCRIPTION Select reads lines from its STDIN, parses them into their component columns (or fields, or records), and sends to STDOUT the columns specified by the user. It may also, at the user's discretion, omit rows from STDOUT which fail some user-specified expression, and/or translate from one format to another. If one imagines STDIN as a database table, select is analogous to the SQL "select" operation. The formats understood by select are: tab-delimited text (tab), comma-separated values (csv), hash, javascript object notation (json), extensible markup language (xml), and eDiscovery loadfile format (lf). These formats are described in the FORMATS section. Select will attempt to convert from any format to any format if requested, but converting from the "complex" formats (json, xml) to the "flat" formats (tab, csv, hash, lf) will often not work well. OPTIONS --all Ignore any columns specified on the command line and instead emit to STDOUT all of the columns read from STDIN. (CAUTION, the output field order might not be preserved in all cases.) This is particularly useful when using select to translate between two formats without any filtering. --csv-delimiter= Specifies the field delimiter to use when emitting csv to STDOUT. Default is ",". --delimiter=, --delim= Specifies the character string to use to separate fields from the STDIN stream. This defaults to \t for tab or hash modes, comma for csv mode, and "^|^" for lf mode. --eat-lines Normally, when text is being extracted from a source which indent-formats for easier reading, select will flatten it by replacing newline markers with spaces. Specifying --eat-lines will cause consecutive lines to be replaced with a single space character (instead of producing one space per line). --eat-spaces Similar to eat-lines, causes any spaces between fields in the output text to be reduced to a single space. --escape Emits escaped output, as per RFC2396. --header[s], --noheader[s] Specifies that the first line on STDIN consists of header data (for modes that support headers: tab, csv, lf), or that the header is missing in the case of --noheaders. Lacking either of these options, select tries to guess if the first line is header information. --in=, --mode= Specifies which format to expect on STDIN. MODE may be one of: tab, csv, hash, json, xml, lf, auto. Default is "auto", which will cause select to autodetect (guess) input format. -j, --json Synonym for --in=json. --out= Specifies which format to emit on STDOUT. MODE may be one of: tab, csv, hash, json, lf. Default is the same as the input mode. --unescape Attempts to unescape encoded input before parsing, per RFC2396. --where= Specifies a filtering expression for rows. The expression will be evaluated in the context of each input row. The row will only be emitted to STDOUT if the expression resolves in the affirmative. The expression parser currently slightly sucks. Please read the ROW FILTERING section for expression syntax and caveats. --wheref= This is the same as the --where option, except that the expression to be parsed is read from the specified file. --xml-delimiter= Specifies the name of the outermost element encapsulating every line of xml-formatted output (xml output mode only). Default is "row". FORMATS A format determines how data's fields or records are organized, and how to associate a data field with its title or name. Select supports a few popular (and a couple of not-so-popular) formats. TAB-DELIMITED TEXT (mode: tab) Tab-delimited text very simply consists of several values separated within a line by tab characters. There may or may not be a header line which specifies the names of each column. If this header is present, then well-formed tab data will always have the same number of columns per line. For instance, the contents of marybob.txt: NAME GENDER AGE MOTTO Bob Male 13 haven't a clue Mary Female 15 no rest for the wicked Joe Female 14 what're you looking at? COMMA SEPARATED VALUES (mode: csv) CSV is a popular format which is similar to tab-delimited text, but uses a comma (or sometimes a tab) to separate fields, and encapsulates non-numeric fields inside single or double quotes. Within non-numeric fields, quotes are "escaped" by doubling them in sequence. For instance: "NAME","GENDER","AGE","MOTTO" "Bob","Male",13,"haven''t a clue" "Mary","Female",15,"no rest for the wicked" "Joe","Female",14,"what''re you looking at?" HASH FORMAT (mode: hash) Hash-formatted text consists of tab-delimited fields of [name]=[value] tuples. There is no header line. The main advantage to the hash format is that select rows can be easily annotated with additional columns without having to refactor the entire document (not every row needs to have the same number of fields). For instance: name=Bob gender=Male age=13 motto=haven't a clue name=Mary gender=Female age=15 motto=no rest for the wicked name=Joe gender=Female age=14 motto=what're you looking at? note=watch this one LOADFILE FORMAT (mode: lf) Loadfiles are used in eDiscovery to describe bodies of evidence. They are formatted exactly like tab-delimited text files, except that some crack- addled assmonkey decided it would be cute if every line began and ended with "^", and used "^|^" to separate fields. Some loadfiles choose other delimiters, such as ctl-F|ctl-F. When select is told its input is a loadfile, and no delimiter is specified by the user, it will attempt to guess at the correct delimiter to use. It will always use "^|^" delimiters on output. For instance: ^NAME^|^GENDER^|^AGE^|^MOTTO^ ^Bob^|^Male^|^13^|^haven't a clue^ ^Mary^|^Female^|^15^|^no rest for the wicked^ ^Joe^|^Female^|^14^|^what're you looking at?^ EXTENSIBLE MARKUP LANGUAGE (mode: xml) XML is another whacked format which has mysteriously gained widespread use. Select's support for XML rather sucks. Please refrain from using it unless you must. It will cause you tremendous pain. Example: BobMale13haven't a clue MaryFemale15no rest for the wicked JoeFemale14what're you looking at?watch this one Those unlucky readers familiar with XML will notice that these three rows together do not form a correctly-formed XML document. This is because select parses its rows one line at a time, and has no notion of an encapsulating field around the entire body of input. The following, while XML'y correct, will cause select to choke and die: BobMale13haven't a clue MaryFemale15no rest for the wicked JoeFemale14what're you looking at?watch this one The real solution to this problem is to use a "flatten files" mode which has not yet been implemented for select. I will get to it soon. When in place, select will read a file's worth of XML (or JSON, etc) and convert it into a single row that looks like: BobMale13haven't a clueMaryFemale15no rest for the wickedJoeFemale14what're you looking at?watch this one .. which, believe it or not, select will swallow quite happily. JAVASCRIPT OBJECT NOTATION (mode: json) JSON is the data encoding format used by Javascript. It is more compact and less brittle than XML, and may be used to represent complex data structures. For instance: {"NAME":"Bob","GENDER":"Male","AGE":13,"MOTTO":"haven't a clue"} {"NAME":"Mary","GENDER":"Female","AGE":15,"MOTTO":"no rest for the wicked"} {"NAME":"Joe","GENDER":"Female","AGE":14,"MOTTO":"what're you looking at?","note":"watch this one"} As with XML, these three rows together do not form a valid JSON document, and it does this for similar reasons. To correctly convert this data into a JSON entity, encapsulate it in square brackets and delimit with commas, thus: [ {"NAME":"Bob","GENDER":"Male","AGE":13,"MOTTO":"haven't a clue"}, {"NAME":"Mary","GENDER":"Female","AGE":15,"MOTTO":"no rest for the wicked"}, {"NAME":"Joe","GENDER":"Female","AGE":14,"MOTTO":"what're you looking at?","note":"watch this one"} ] An option will be added to select to do this automatically. FORMAT CAVEATS When attempting to convert complex structures from XML or JSON to one of the less-flexible formats, select will attempt to "linearize" the structures, often with tragic results. The hash format deals with this linearized data best. Automatic detection of headers in csv and tab modes sucks. Use the --header and --noheader options to avoid error. When no headers are available for tab, csv, or lf input, select will use numbers for names instead. Given this input, for instance: Bob Male 13 haven't a clue Mary Female 15 no rest for the wicked .. select will produce the following output, with a header: 0 1 2 3 Bob Male 13 haven't a clue Mary Female 15 no rest for the wicked .. which then may be used to select specific columns: # sel 1 2 < marybob.txt 1 2 Male 13 Female 15 COLUMN FILTERING The user may specify which columns should be written to STDOUT, and in what order, by passing them on the command line. For instance, if marybob.txt contains the following data: NAME GENDER AGE MOTTO Bob Male 13 haven't a clue Mary Female 15 no rest for the wicked Joe Female 14 what're you looking at? .. then passing this through "sel NAME AGE GENDER" will produce: NAME AGE GENDER Bob 13 Male Mary 15 Female Joe 14 Female Columns may be specified twice to make them repeat in the output, so that "sel NAME AGE NAME GENDER" produces: NAME AGE NAME GENDER Bob 13 Bob Male Mary 15 Mary Female Joe 14 Joe Female Instead of listing rows, the --all option may be used to pass all rows. This can be useful when changing formats, or when using row-filtering with --where. When using --all with formats which lack a header specifying column order, the output order will be alphanumeric by name. For instance, passing this into "sel --all --out=tab": Bob13 Mary15 Joe14 .. the following output will be produced: 13 Bob 15 Mary 14 Joe ROW FILTERING Select offers SQL-like filtering of lines (rows) based on comparing column values to string or numeric literals, or by comparing them to each other. It offers this functionality through the --where (and the similar --wheref) option. The syntax accepted by --where is a little limited. It must consist " " tuples, conjoined by "AND" or "OR" (or the equivalent && and ||). Entities may be column names or string or numeric literals. For instance, using the following input: NAME GENDER AGE BEGINS Bob Male 13 18 Mary Female 15 15 Joe Female 14 17 One may filter thus: # cat marybob.txt | sel NAME GENDER AGE --where='NAME eq "Bob" OR AGE > 14' Bob Male 13 Mary Female 15 # cat marybob.txt | sel NAME GENDER AGE --where='GENDER eq "Female" && AGE < 16' Mary Female 15 Joe Female 14 # cat marybob.txt | sel NAME GENDER AGE BEGINS --where='AGE == BEGINS' Mary Female 15 15 The subexpressions (between the "AND" and "OR" keywords) are actually passed to perl's eval(), so most perl operators may be used here: # cat marybob.txt | sel NAME GENDER AGE --where='NAME =~ /o/' Bob Male 13 Joe Female 14 "OR" boolean operators take precedence over "AND", and as of the time of this writing there is no support for setting precedence explicitly with parentheses. # cat marybob.txt | sel NAME GENDER AGE BEGINS --where='NAME ne "Bob" AND AGE > 16 OR BEGINS > 16' Joe Female 14 17 TODO BUG: When no columns are given and --all is not specified, output is horribly broken and useless. Write a usage() message (maybe just LWP get this document). Header detection really sucks. When in doubt, use the --header option. XML: Use a more robust parser than that offered by the libxml2 library. XML: Support emitting encapsulating tags around all rows, to make a valid XML document. JSON: Support emitting encapsulating []'s around all rows, to make a valid JSON document. WHERE: Support more complex expressions and terms (like: lc(NAME)). XML, JSON: Represent complex structures better for simpler output modes. TAB, CSV, LF: Try harder to produce valid headers. When converting from headerless formats to headerful formats, try harder to ascertain correct header order. META: New mode. Implement file inputs: XML Files, JSON Files, META Files (flatten from files to rows). Implement field ranges: NAME-AGE == NAME GENDER AGE Implement --allbut, so that user may specify which headers to omit. Implement --shuffle, so that users may move columns relative to each other.