Much of the info we expect for people to provide can also be handled
by file naming/placement conventions. Here is an outline of the
organization of the required/optional elements.
Also, note that all files with textual content are stored as
PDF or as plain text files (encoded using UTF-8). Of course, there
may also be binary files in a dataset, including things like movies,
audio recordings, images, and so on; we may give format recommendations
for these, but will probably take whatever format contributors
provide. Presumably, we should handle text document contributions in
Microsoft Word format (in addition to PDF or plain text) and
auto-print-to-PDF to convert them. For text files, we should handle
contributions with Unix, Mac, or Windows-style line ending
conventions and then normalize them. We should
probably also provide for text files included in downloads to be
translated to a given platform's line ending conventions (through
user preferences or browser request settings).
A dataset upload is a zip file that includes a "dataset.properties"
file at its root, like this one. Other files written in the
primary language are also included in the root of the zip file
(or in subdirectories based on subject study codes, if needed).
All translated data files in secondary languages should be placed in
a subdirectory within the zip file named after the language in which
the file is written. That means the top-level of a multi-lingual
upload might look like this:
/dataset.properties -- the primary metadata file file
/* -- data files in primary language (i.e., first
in order in dataset.languages list)
/de/* -- subdir for files written in German
/en/* -- subdir for files written in English
/fr/* -- subdir for files written in French
/se/* -- Swedish ...
At the root (and within each language dir, for translations), there are
several file names that, if present, contain specific information:
overview.txt
This file should contain a plain text description (or a PDF
version in overview.pdf) of the purpose for which this data
was collected, including any research questions and
experimental hypotheses. It could be anywhere from a brief
paragraph, to a long section extracted from a relevant
research paper, depending on what the contributor wants to
provide, however shorter is probably better.
subjects.txt
This file should describe the subject population involved
in the experiment, including its size, how subjects were
selected, what the background of the subjects was, etc.
This could be in subjects.pdf instead.
method.txt
This file should describe the experimental design and protocol
used in collecting the data. It may refer to supplementary
files by name, such as a text copy of a survey questionnaire,
or a text copy of a structured interview protocol, etc. We
can come up with recommended names for the common cases that we
can think of. This could be in method.pdf instead.
data.csv
This is the main tabular/numeric data file for the dataset, if
there is one. Columns, formats, etc., are completely user-defined,
with the exception that it should be in UTF-8, and we recommend
date/time columns use an iso-std format. The first row should
contain the column names. Some examples:
-- For a survey, there might be one row per subject, with one
column per question containing that subject's response.
-- For an interview, there might be one row per subject, with
the coded demographic data for that subject in the columns
(age, gender, race, date of interview, etc., etc.).
-- For datasets with multiple CSV files, they can be placed
in data01.csv, data02.csv, etc.
-- In all cases, appropriate subject codes could be used.
For example, we could use "s01", "s02", "s03", ...
-- Generally, this file (or the other data*.csv files) can contain
one or more columns that refer to additional data files in
the dataset, such as the file holding this subject's interview
transcript, or the subdirectory holding this subject's programming
code, or the file holding the video recording of this subject's
attempt at a task, etc.
Where possible, such additional data files related to a single
subject should be given names that include the subject code as
the file's/directory's base name or prefix (e.g., "s01.txt",
"s14.mov", "s15-1.dat", or "s75/" as a subdirectory).
datatoc.csv
This is the "table of contents" for the data*.csv files that
defines what the columns are. It is a fixed-format file containing
one row describing each column in the data*.csv files. The
columns in this fixed-format file are:
Col Content
-------------- --------
File The name of the data file containing the column
described in this row (e.g., "data.csv" or "data02.csv").
Col Name The name of the data column in the given file (e.g.,
"id", "Subject", "score", "q12", etc.). We suggest
using reasonably compact names for ease of manipulation
(see Extended Label below).
Type Excel-compatible description of the type of data in
the given column (e.g., number, text, date, time, etc.).
Meaning The meaning of the data in the given column.
Extended Label An optional field that provides a longer descriptive,
human-readable version of a column's name. For example,
if the column contains survey responses, this might
include the full text of the corresponding survey
question. Alternatively, if the column name is "ncsloc",
this column might say "Non-commented Source Lines of
Code".
Scale An optional field that characterizes the measurement
scale used for values entered in this column.
s01.txt
s02.txt
s*.txt
These are subject-specific data files with contents that depend on
the experiment. For example, each of these can be the transcript of
an interview with the corresponding subject. If there are multiple
files, they can be named s01-1.txt, s01-2.txt, etc., or placed in
a subdirectory s01/.
Note that I've been using two-digit subject study codes here, but
we can't restrict to that, and we also don't want to limit, so it
is reasonable to allow study codes to be used without zero-padding.
Further, we could probably work it so that contributors can use any
unique identifiers as study codes, although we could recommend something
simple like "s" + number.