This is a draft document for a system in preparation
Part of the agreement for research projects working at the SAFE Project is that all project data is submitted to the central SAFE Project data repository, so that all the data collected at SAFE is available to future researchers. In order to make it easy for data to be found and used in the future, we need researchers to provide some (relatively!) simple information in their datafiles.
At the moment, the data format only applies to tabular datasets stored in Excel spreadsheets - this accounts for about 90% of the data files used by researchers. For the moment, we will typically handle other data manually.
The details described below will be used to automatically publish your data to Zenodo. You should choose titles, descriptions and keywords that you would be happy to be permanently associated with your dataset!
The basic format for a SAFE dataset submission is an Excel Workbook, which must contain the following three worksheets:
After these worksheets come your data tables. You should label these sheets with a sensible name (not 'Sheet1'!) and each data table must be described in the Summary worksheet. You can include as many data tables as you like in a single dataset: we don't want you to spend time rearranging your data and are happy just to take the data in the natural tables you already use.
Click on this link to download the spreadsheet template containing the required worksheets, labels and headers.
You can also look at existing approved datasets to see how the format is used:
Some links to examples will be added as the system gets used
We've tried to make the description below as clear as possible but in order to help you prepare your file:
If you want to check your formatting yourself before submitting it then the code used to check Excel datasets is freely available online here. The link also provides instructions on how to use the code to check your data. You will need a computer with Python installed and which is connected to the internet (although the program can be setup to allow offline use).
When your dataset is published, the metadata will be immediately publicly visible. This includes details of the data fields, the spatial scope, the date range and the like. If you set the access status as
Open, then the Excel file itself will also be immediately publicly available.
We would prefer that as much data as possible is submitted with
Open access status, but if you want to restrict access to the data while you work on papers, then you can use the
Embargo access status and set an embargo date. The metadata will still be visible, so that researchers can see that the data exists but the data itself will only become available once the embargo date has passed. You cannot embargo a dataset for more than two years.
Obviously, you can choose to provide embargoed data to other researchers within the embargo period. If researchers contact the SAFE Project for access to data during the embargo period, we will always pass the request on to you.
This worksheet contains a simple set of rows describing the dataset and identifying the spreadsheets that contain data tables. Each row is labelled on the left in the first column and then the description data should be typed in the columns to the right. The following example shows the required rows:
|SAFE Project ID||1|
|Title||Example data for the SAFE Project|
|Description||This is an example dataset.|
|Author name||Orme, David|
|Author affiliation||Imperial College London|
|Worksheet title||My shiny dataset||My incidence matrix|
|Worksheet description||This is a test dataset||A test dataset too|
|Keywords||Keyword 1||Keyword 2|
The first rows are simple:
Embargo. If you want to embargo your data, then provide a date when the embargo will end: you cannot embargo data for more than two years.
These rows provide contact details for the authors of the data. If the datasets should be credited to more than author, then provide sets of details in adjacent columns. If you have an ORCID, provide it here: this is a good way to help link all of your academic outputs to you!
Author names must be formatted as “last name, first name”: “Orme, C David L” not “C David L Orme”.
Each data worksheet must be described here - do not include
Locations worksheet in this block. As with the authors, you can describe multiple sheets in adjacent columns. The worksheet name row must contain the name of a worksheet in the workbook: that is, the exact text shown on the worksheet label tab at the bottom. The title and description summarise what data is found in a given sheet.
Provide keywords for the dataset here, with one keyword (or short phrase) per cell in the row.
Many datasets will involve data taken from organisms, whether that is a count of the number of individuals or measurement of a trait such as body length. In order to help us keep track of taxa, all datasets using taxa must contain a
Taxa spreadsheet, providing taxonomic information. The table format looks like this:
|Taxon name||Taxon type||Kingdom||Phylum||Class||Order||Family||Genus||Species||Morphospecies|
|Crematogaster borneensis||Species||Animalia||Arthropoda||Insecta||Hymenoptera||Formicidae||Crematogaster||Crematogaster borneensis|
|Morphospecies 1||Morphospecies||Animalia||Arthropoda||Insecta||Hymenoptera||Formicidae||Morphospecies 1|
The table must contain column headers in the first row of the worksheet. The first two columns must be:
Taxon name: This column must contain all of the taxon names that you are going to use to identify taxa in the rest of the dataset. You cannot have duplicated names! Note that these can be abbreviations or codes: if you want to use
Crbein your data worksheets, rather than typing out
Crematogaster borneensisevery time, then that is fine.
Taxon type: This column tells us the taxonomic type of the named taxon, which is usually the taxonomic level. For example, the taxon Pongo pygmaeus would be of type
Speciesand the taxon Formicidae would be of type
Family. However, we also recognise morphospecies and functional groups - see below for details
You must include the main seven taxonomic levels in this worksheet: Kingdom, Phylum, Class, Order, Family, Genus and Species. You must fill in each of these levels up to the taxon type of a particular taxon: for example, the table above has all the taxonomy for Dolichoderus sp. but obviously no species name is needed.
Note that if you are only using a single taxonomic type and not abbreviating your taxon names, then you will end with the same data in two columns. For example, if all your taxa are identified as species using latin binomials, then your
Taxon name column might be identical to your
Species column. This is correct - it does make your file more bulky but this system allows us to have mixed taxonomic types and still have a simple set of names for the taxa used in the dataset.
You can also include other taxonomic levels. We do not routinely check other levels but the dataset checking program has an option to check all taxonomic levels. The NCBI dataset includes the following levels: Kingdom, Subkingdom, Superphylum, Phylum, Subphylum, Superclass, Class, Subclass, Infraclass, Cohort, Superorder, Order, Suborder, Infraorder, Parvorder, Superfamily, Family, Subfamily, Tribe, Subtribe, Genus, Subgenus, Species group, Species subgroup, Species, Subspecies, Varietas, Forma.
In order to help keep the taxonomy as clean as possible and to allow us to index the taxonomic coverage of datasets, we will check all taxon names in Taxa worksheet against the NCBI taxonomy database. If you want to check your taxon names and ranks, then the NCBI search engine is here:
Some taxa may either be genuinely new or just missing from the NCBI database. If you are using a taxon name that you are sure is correct or is new, then add an asterisk to the end of the taxon name and the file checker will stop treating it as an error. For example,
You can do this for any taxonomic level but we will get suspicious if your file contains new kingdoms…
If you are using morphospecies, then provide a column
Morphospecies giving the morphospecies names (in case you want to use abbreviations in the datasheets). Please provide taxonomic information up to at least Order level for each morphospecies.
If you are using functional groups, then provide a column
Functional Group giving the functional group. Functional groups are highly context dependent, so we only insist you identify each group to Kingdom, but please do provide more taxonomic detail if possible!
Fine. You can omit the Taxa worksheet!
Like the Taxa worksheet, all locations in your data worksheets need to be listed in this worksheet. By location, we mean the common frequently used areas in which research has happened at SAFE. You might have more detail about the precise place you worked in your dataset - great! - but using these known locations allows us to get broad spatial data on sampling relatively simply.
So, we expect you'll have a relatively small set of location names in your data sheets, all of which should appear in this worksheet: the worksheet should contain a column of location names, with the column header 'Location name' in the first row.
The location names are checked against the location names known in the SAFE gazetteer. You can look at the gazetteer webpage to see the available sites and to download location data:
If you want to get a list of valid location names for use in a program or script, then we provide a web service that returns a list of valid names as a JSON object:
For example, in R:
If your data comes from genuinely new locations or uses a sampling structure (e.g. a grid or transect) that is likely to be used again in the future, then you can create new location names and include them in your locations table. We will then consider adding them to the Gazetteer.
If you include new locations then you will need to include the following columns in your
New: This should simply contain
Noto show which rows contain new locations. You cannot create a new location with a name that matches an existing location in the Gazetteer.
Longitude: these should provide GPS coordinates for the new site. These must be provided as decimal degrees (not degrees minutes and seconds) and please provide 6 decimal places in your coordinates. This level of precision is around ten centimetres and, although the GPS from the field is highly unlikely to be accurate to this level, we want to record as much sampling precision as possible.
Type: For most new locations, this will be
POINT, so the latitude and longitude are sufficient. New linear sampling features (e.g. transects) are
LINESTRINGand sampling areas are
POLYGON. In these cases, you will need to email the SAFE administrators and provide a GIS file containing the spatial information for your new locations.
You only need to provide
Type in the rows for new locations: these rows can be blank for locations that are already in the gazetteer.
You don't have to include the Locations worksheet, although it would be very unusual. Possible examples:
Finally, we get to the worksheets containing your actual data!
The top rows of the worksheet are used to provide metadata descriptors for each of the columns ('fields') in your data worksheet. Each descriptor row has a label, which must appear in Column A of the worksheet, with the value for each field appearing above that column.
The following are the mandatory field descriptors, which are needed for all fields and which cannot be blank.
field_type: This has to be one of the following values indicating the field type (see the options below).
description: a short description of the field
field_name: the name of the variable. The name format should be suitable for loading into an analysis package and should not contain spaces: use an underscore (
_) to put gaps in names. This descriptor must always be the last descriptor row, immediately above the data, so that it can be used as field headers when loading data from the file for analysis.
There are also some additional field descriptors, which are mandatory for some data types (see the descriptions of the data types below). These only have to be completed for the appropriate data types - you should leave them blank for any fields that don't require them. The options are:
levels: contains the set of level names used in a categorical variable.
method: a contain a short description of the method and equipment used to record numeric, abundance and trait data.
units: the units of numeric or trait variables.
taxon_name: the taxon (or taxon column) that trait or abundance data is recorded from.
If your data worksheets contain missing data, you must enter 'NA' in those cells, not just leave them blank. This is to make it absolutely unambiguous that a given value is actually missing. We know this is picky but it can be absolutely vital: for example, does a blank cell in an abundance matrix mean that the species wasn't seen (so the cell should be zero) or that the trap for that species fell over and you don't know if it was recorded (so it should be NA).
You must number the rows in your data worksheet. The row numbers must start at 1 in the cell directly under the
field_name descriptor, increase by 1 as you move down through the cells and must continue down to the last row containing data. The row numbers must not extend below the data: the template numbers rows down to 1000, so delete the numbers for any unused rows in your data!
This section shows the options that can appear in the
field_type descriptor, along with any further descriptors that might be needed. See the sections below for details on formatting, but the available types are:
Time: when were the data collected?
Location: where was the data collected?
Longitude: GPS data for the exact location.
Replicate: a record of replication, usually just a repeating set of numbers.
ID: a column showing any kind of identification code.
Categorical: otherwise known as a factor: a variable that puts data into a fixed set of groups.
Ordered Categorical: a factor where there is a logical order to the levels.
Numeric: all kinds of numeric data.
Taxa: what taxa was the data collected from?
Abundance: for abundance/density/presence data collected about a taxon.
Categorical Trait: for categorical data collected on a taxon.
Numeric Trait: for numeric data collected on a taxon.
We have three kinds of date and time fields!
Datetime: The data in the field includes both a time and a date (e.g. 21/05/2016 15:32), which could be a visit time and day to a site or when a camera trap was deployed or similar data.
Date: The data in the field only specifies a date (e.g. 21/05/2016).
Time: The data in the field only specifies a time (e.g. 15:32).
We don't mind how you provide the date and time information but you do need to be consistent within a field.
Excel cell formatting can make this confusing. Both date and time are stored in Excel as a single number (N: days since the beginning of January 1900). If N < 1 it represents a time and if N > 1 it is a date. However, cell formatting can mislead you as to what is actually stored in the cell.
18:00, but it could also display in Excel as
00/01/1900 18:00if formatted as a date. This is reasonably easy to spot because of the 0th of January!
12/01/1900but it could display as the time
12/01/1900 18:00but could display as the time
Note that the value 12 is ambiguous, because Excel doesn't differentiate between integer and float numbers: it could just refer to the day (the integer 12) or mean exactly midnight on the day (the float 12.0). This is one reason why we have the three data types!
Columns of this type contain location labels showing where the data in the row was recorded. All of the labels must have been included in the Locations worksheet.
Columns of this type contain taxon names showing what taxon the data in the row was recorded from. All of the values in the row must appear in the
Taxon Names column in the Taxa worksheet.
ID fields could contain almost any values. Replicates are typically just shown with repeating numbers, but researchers could use other formats. ID can represent lots of things (for example, PIT tag numbers for individual organism, fine scale spatial sampling ID, batch number for reagents) and again could have almost any format.
Replicate fields are checked for missing data (NAs are permitted) but no other validation occurs.
Both categorical fields (also known as a factors) are made up of a set of levels showing the different groups or treatment. The data in the column then shows which level applies to each row.
levels descriptor, you must provide a complete set of all the levels used in the column, which will be checked against the data. The level names must be short text labels. Do not use integer level names: they are harder to interpret in statistical analyses and there is a real risk that they are analysed as a number by mistake.
The format is that the level names are separated using semi-colons (
;). For example:
If the levels aren't obvious, we'd also like label descriptions: they come after each label, separated by colons (
:). For example:
Control:sites in reserve forest;Logged:sites in logged forest;Burned:sites in burned forest
Do not use colons or semi-colons in your level names or descriptions!
Ordered Categorical fields, the order of the entries in the
levels descriptor should be the logical order of the factor. For example, an ordered disturbance gradient could be:
Primary:primary rainforest;Once:once logged rainforest;Twice:twice logged;Salvage:salvage logged;Oil palm:plantation
This field type should be used to record numeric variables except numeric variables recorded from taxa (see Traits below). The
method descriptor should include information about how the variable is measured and the
units descriptor must provide the units used.
Not all numeric variables have methods or units: a column of replicate numbers, for example. If this is the case, enter
None rather than leaving the descriptors blank. (If you prefer to use
Dimensionless as the unit for dimensionless quantities then that is also fine!)
Both traits and abundance data tie a value (category or number) to a particular taxon. You need to format your data so that it is clear which taxon each value comes from. There are two possible formats:
taxon_namedescriptor for this column.
taxon_namedescriptor needs to contain the field name of the appropriate Taxa column.
Columns of type
Taxa contain taxon names showing what taxon the data in the row was recorded for. All
of the names must appear in the Taxa worksheet.
Abundance is used here as an umbrella term to cover a wide range of possibilities from casual observation data ('We saw two clouded leopards on Friday on the road near F100'), through presence/absence data to precise measurements of abundances or encounter rate.
method descriptor needs to provide a detailed description of the sampling method, including the area surveyed, the length of time spent sampling, the number of samplers and any equipment. This should be detailed enough to allow the sampling protocol to be replicated. If other columns provide sampling information, such as survey time or area, then make this clear.
This is just a categorical variable where the groups apply to a taxa. So, we need information on the levels used, as for a standard categorical variable, and a link to taxonomic information as described in the examples above.
units and ''method required
This is just a numeric variable where the groups apply to a taxa. So, we need the method and units for the values, as for a standard numeric variable, and a link to taxonomic information as described in the examples above.
If you have a free text field with notes or comments, then this is the field type to use. We don't really check anything in comments fields: they're not expected to be complete data and you can put anything in them.
A word of caution though: it is highly unlikely that anyone will ever read your comments column again. If there is genuinely important information that might apply across multiple rows, consider coding it as an explicit variable rather than consigning it to a comments field.