This is a draft document for a system in preparation
Part of the agreement for research projects working at the SAFE Project is that all project data is submitted to the central SAFE Project data repository, so that all the data collected at SAFE is available to future researchers. In order to make it easy for data to be found and used in the future, we need researchers to provide some (relatively!) simple information in their datafiles.
At the moment, the data format only applies to tabular datasets stored in Excel spreadsheets - this accounts for about 90% of the data files used by researchers. For the moment, we will typically handle other data manually.
The details described below will be used to automatically publish your data to Zenodo. You should choose titles, descriptions and keywords that you would be happy to be permanently associated with your dataset!
The basic format for a SAFE dataset submission is an Excel Workbook, which must contain the following three worksheets:
After these worksheets come your data tables. You should label these sheets with a sensible name (not 'Sheet1'!) and each data table must be described in the Summary worksheet. You can include as many data tables as you like in a single dataset: we don't want you to spend time rearranging your data and are happy just to take the data in the natural tables you already use.
Use a simple short name for your spreadsheet - there will be a lot of information giving more detail inside. Please do not use spaces in your file name - you can use underscores to separate words.
Click on this link to download the spreadsheet template containing the required worksheets, labels and headers.
You can also look at existing approved datasets to see how the format is used:
Some links to examples will be added as the system gets used
We've tried to make the description below as clear as possible but in order to help you prepare your file:
If you want to check your formatting yourself before submitting it then the code used to check Excel datasets is freely available online here. The link also provides instructions on how to use the code to check your data. You will need a computer with Python installed and which is connected to the internet (although the program can be setup to allow offline use).
When your dataset is published, the metadata will be immediately publicly visible. This includes details of the data fields, the spatial scope, the date range and the like. If you set the access status as
Open, then the Excel file itself will also be immediately publicly available.
We would prefer that as much data as possible is submitted with
Open access status, but if you want to restrict access to the data while you work on papers, then you can use the
Embargo access status and set an embargo date. The metadata will still be visible, so that researchers can see that the data exists but the data itself will only become available once the embargo date has passed. You cannot embargo a dataset for more than two years.
Obviously, you can choose to provide embargoed data to other researchers within the embargo period. If researchers contact the SAFE Project for access to data during the embargo period, we will always pass the request on to you.
This worksheet contains a simple set of rows describing the dataset and identifying the spreadsheets that contain data tables. Each row is labelled on the left in the first column and then the description data should be typed in the columns to the right.
The following example shows the required rows. You must include all of these rows even if you don't provide that metadata: just leave the row blank. Examples of where this might happen are embargo date for Open access datasets and Author affiliations, emails and ORCIDs.
|SAFE Project ID||1|
|Title||Example data for the SAFE Project|
|Description||This is an example dataset.|
|Author name||Orme, David|
|Author affiliation||Imperial College London|
|Worksheet title||My shiny dataset||My incidence matrix|
|Worksheet description||This is a test dataset||A test dataset too|
|Keywords||Keyword 1||Keyword 2|
The first rows are simple:
Embargo. If you want to embargo your data, then provide a date when the embargo will end: you cannot embargo data for more than two years.
These rows provide contact details for the authors of the data. If the datasets should be credited to more than author, then provide sets of details in adjacent columns. If you have an ORCID, provide it here: this is a good way to help link all of your academic outputs to you!
Affiliation and email are also optional, but we would very much prefer complete author metadata (name, affiliation, email) for all authors. However we realise that sometimes this isn't possible: if you're uploading data collected by past students who you've lost contact with, then you might not have these details for any author.
Author names must be formatted as “last name, first name”: “Orme, C David L” not “C David L Orme”.
Each data worksheet must be described here - do not include
Locations worksheet in this block. As with the authors, you can describe multiple sheets in adjacent columns. The worksheet name row must contain the name of a worksheet in the workbook: that is, the exact text shown on the worksheet label tab at the bottom. The title and description summarise what data is found in a given sheet.
Provide keywords for the dataset here, with one keyword (or short phrase) per cell in the row.
Provide DOIs for publication using the data here and you can add multiple DOIs, one per cell in the row. Please format the DOI as a URL using
https://doi.org/ before the DOI, so
Many datasets will involve data taken from organisms, whether that is a count of the number of individuals or measurement of a trait such as body length. In order to help us keep track of taxa, all datasets using taxa must contain a
Taxa spreadsheet, providing taxonomic information.
Note that you must only provide details for taxa actually used in the data worksheets. This ensures that the taxonomic index for a dataset is accurate and also double checks that it the omission of a taxon from the data worksheets is not an error.
In order to help keep the taxonomy as clean as possible and to allow us to index the taxonomic coverage of datasets, we will check all taxon names in Taxa worksheet against the GBIF backbone taxonomy database. If you want to check your taxon names and ranks, then the search engine is here:
No online taxonomy is ever going to be 100% up to date (or 100% agree with your taxonomic usage!) but the GBIF backbone has very good taxonomic coverage and is well curated.
The table format looks like this:
|Taxon name||Scientific name||Taxon type||Parent name||Parent type|
|Crematogaster borneensis||Crematogaster borneensis||Species|
The table must contain column headers in the first row of the worksheet. The headers must include:
Taxon name: This column must contain all of the taxon names that you are going to use to identify taxa in the rest of the dataset. You cannot have duplicated names! Note that these can be abbreviations or codes: if you want to use
Crbein your data worksheets, rather than typing out
Crematogaster borneensisevery time, then that is fine.
Scientific name: This column must contain the scientific name of the taxon, which will be used for taxon validation via GBIF.
Taxon type: This column must provide the taxonomic type of the named taxon, which is usually the taxonomic level. For example, the taxon Pongo pygmaeus would be of type
Speciesand the taxon Formicidae would be of type
Family. However, we also recognise morphospecies and functional groups - see below for details.
You can optionally include the following two columns.
Parent type: These columns are used when you need to provide a taxonomic parent for a taxon. This will be in a handful of cases: a new or unrecognised taxon, morphospecies, functional groups and taxa at less common taxonomic levels. These two columns then provide a taxonomic hook to allow us to place the taxon in the backbone taxonomy.
If a taxon is new or not recognized by GBIF (and you're sure you're right!) then provide a parent name and type to allow us to hook the taxon into the index. For example Pongo tapanuliensis is not currently recognised as a species, so providing Pongo as a parent name of type 'genus' allows us to place the new taxon.
For morphospecies and functional groups, the taxon name is the label to be used in the dataset. Set the Scientific name to be 'NA' - it cannot be blank - and then specify the taxon type as 'Morphospecies' or 'Functional group'.
Now you need to provide a parent taxon and type. The level of taxonomic certainty for morphospecies and functional groups is quite variable, but we'd like the finest taxonomic level you can provide. As an example, in the table above, 'Morphospecies #1' is simply identified as being an ant.
The GBIF backbone taxonomy only includes the following eight major levels: Kingdom, Phylum, Order, Class, Family, Genus, Species and Subspecies. If you need to use taxa defined at any intermediate levels, then again provide a parent taxon and type. For example, if you were counting bees and only identifying to tribe level (Bombini, Euglossini, etc.) then the parent family Apidae would allow us to hook the taxa into the backbone taxonomy. The subfamily Apinae would be more precise, but subfamily isn't one of the backbone taxonomic levels.
Fine. You can omit the Taxa worksheet!
Like the Taxa worksheet, all locations in your data worksheets need to be listed in this worksheet. By location, we mean the common frequently used areas in which research has happened at SAFE. You might have more detail about the precise place you worked in your dataset - great! - but using these known locations allows us to get broad spatial data on sampling relatively simply.
So, we expect you'll have a relatively small set of location names in your data sheets, all of which should appear in this worksheet: the worksheet should contain a column of location names, with the column header 'Location name' in the first row.
The location names are checked against the location names known in the SAFE gazetteer. You can look at the gazetteer webpage to see the available sites and to download location data:
If you want to get a list of valid location names for use in a program or script, then we provide a web service that returns a list of valid names as a JSON object:
For example, in R:
If your data comes from genuinely new locations or uses a sampling structure (e.g. a grid or transect) that is likely to be used again in the future, then you can create new location names and include them in your locations table. We will then consider adding them to the Gazetteer.
If you include new locations then you will need to include the following columns in your
New: This should simply contain
Noto show which rows contain new locations. You cannot create a new location with a name that matches an existing location in the Gazetteer.
Longitude: these should provide GPS coordinates for the new site. These must be provided as decimal degrees (not degrees minutes and seconds) and please provide 6 decimal places in your coordinates. This level of precision is around ten centimetres and, although the GPS from the field is highly unlikely to be accurate to this level, we want to record as much sampling precision as possible. If you don't have any GPS data for the new location, please explicitly enter
NAin these fields.
Type: For most new locations, this will be
POINT, so the latitude and longitude are sufficient. New linear sampling features (e.g. transects) are
LINESTRINGand sampling areas are
POLYGON. In these cases, you will need to email the SAFE administrators and provide a GIS file containing the spatial information for your new locations.
You only need to provide
Type in the rows for new locations: these rows can be blank for locations that are already in the gazetteer.
You don't have to include the Locations worksheet, although it would be very unusual. Possible examples:
Finally, we get to the worksheets containing your actual data!
The top rows of the worksheet are used to provide metadata descriptors for each of the columns ('fields') in your data worksheet. Each descriptor row has a label, which must appear in Column A of the worksheet, with the value for each field appearing above that column.
The following are the mandatory field descriptors, which are needed for all fields and which cannot be blank.
field_type: This has to be one of the following values indicating the field type (see the options below).
description: a short description of the field
field_name: the name of the variable. The name format should be suitable for loading into an analysis package and should not contain spaces: use an underscore (
_) to put gaps in names. This descriptor must always be the last descriptor row, immediately above the data, so that it can be used as field headers when loading data from the file for analysis.
There are also some additional field descriptors, which are mandatory for some data types (see the descriptions of the data types below). The options are:
levels: contains the set of level names used in a categorical variable.
method: a contain a short description of the method and equipment used to record numeric, abundance and trait data.
units: the units of numeric or trait variables.
taxon_name: the name of the taxon for which all of the trait or abundance data in a field is recorded. The taxon name must appear in the Taxa worksheet.
taxon_field: the name of a taxon field in the datasheet which shows the taxon for which trait or abundance data on that row is recorded.
interaction_name: a set of names giving the interacting taxa for interaction data .
interaction_field: a set of field names, where the rows of the field give the interacting taxa for interaction data.
These descriptors only have to be completed for the appropriate data types: leave them blank for any fields that don't require them.
If your data worksheets contain missing data, you must enter 'NA' in those cells, not just leave them blank. This is to make it absolutely unambiguous that a given value is actually missing. We know this is picky but it can be absolutely vital: for example, does a blank cell in an abundance matrix mean that the species wasn't seen (so the cell should be zero) or that the trap for that species fell over and you don't know if it was recorded (so it should be NA).
You must number the rows in your data worksheet. The row numbers must start at 1 in the cell directly under the
field_name descriptor, increase by 1 as you move down through the cells and must continue down to the last row containing data. The row numbers must not extend below the data: the template numbers rows down to 1000, so delete the numbers for any unused rows in your data!
This section shows the options that can appear in the
field_type descriptor, along with any further descriptors that might be needed. See the sections below for details on formatting, but the available types are:
Time: when were the data collected?
Location: where was the data collected?
Longitude: GPS data for the exact location.
Replicate: a record of replication, usually just a repeating set of numbers.
ID: a column showing any kind of identification code.
Categorical: otherwise known as a factor: a variable that puts data into a fixed set of groups.
Ordered Categorical: a factor where there is a logical order to the levels.
Numeric: all kinds of numeric data.
Taxa: what taxa was the data collected from?
Abundance: for abundance/density/presence data collected about a taxon.
Categorical Trait: for categorical data collected on a taxon.
Numeric Trait: for numeric data collected on a taxon.
Categorical Interaction: for categorical data on interactions between taxa.
Numeric Interaction: for numeric data on interactions between taxa.
We have three kinds of date and time fields!
Datetime: The data in the field includes both a time and a date (e.g. 21/05/2016 15:32), which could be a visit time and day to a site or when a camera trap was deployed or similar data.
Date: The data in the field only specifies a date (e.g. 21/05/2016).
Time: The data in the field only specifies a time (e.g. 15:32).
We don't mind how you provide the date and time information but you do need to be consistent within a field.
Excel cell formatting can make this confusing. Both date and time are stored in Excel as a single number (N: days since the beginning of January 1900). If N < 1 it represents a time and if N > 1 it is a date. However, cell formatting can mislead you as to what is actually stored in the cell.
18:00, but it could also display in Excel as
00/01/1900 18:00if formatted as a date. This is reasonably easy to spot because of the 0th of January!
12/01/1900but it could display as the time
12/01/1900 18:00but could display as the time
Note that the value 12 is ambiguous, because Excel doesn't differentiate between integer and float numbers: it could just refer to the day (the integer 12) or mean exactly midnight on the day (the float 12.0). This is one reason why we have the three data types!
Columns of this type contain location labels showing where the data in the row was recorded. All of the labels must have been included in the Locations worksheet.
Columns of this type contain taxon names showing the taxon from which other data in the row was recorded. All of the values in the row must appear in the
Taxon Names column in the Taxa worksheet.
ID fields could contain almost any values. Replicates are typically just shown with repeating numbers, but researchers could use other formats. ID can represent lots of things (for example, PIT tag numbers for individual organism, fine scale spatial sampling ID, batch number for reagents) and again could have almost any format.
Replicate fields are checked for missing data (NAs are permitted) but no other validation occurs.
Both categorical and ordered categorical data (also known as a factors) are made up of a set of levels showing the different groups or treatment. The data in the column then shows which level applies to each row.
levels descriptor, you must provide a complete set of all the levels used in the column, which will be checked against the data. The level names must be short text labels. Do not use integer level names: they are harder to interpret in statistical analyses and there is a real risk that they are analysed as a number by mistake.
The format is that the level names are separated using semi-colons (
;). For example:
If the levels aren't obvious, we'd also like label descriptions: they come after each label, separated by colons (
:). For example:
Control:sites in reserve forest;Logged:sites in logged forest;Burned:sites in burned forest
Do not use colons or semi-colons in your level names or descriptions!
Ordered Categorical fields, the order of the entries in the
levels descriptor should be the logical order of the factor. For example, an ordered disturbance gradient could be:
Primary:primary rainforest;Once:once logged rainforest;Twice:twice logged;Salvage:salvage logged;Oil palm:plantation
This field type should be used to record numeric variables except numeric variables recorded from taxa (see Traits below). The
method descriptor should include information about how the variable is measured and the
units descriptor must provide the units used.
Not all numeric variables have methods or units: a column of replicate numbers, for example. If this is the case, enter
None rather than leaving the descriptors blank. (If you prefer to use
Dimensionless as the unit for dimensionless quantities then that is also fine!)
Both traits and abundance data tie a value (category or number) to a single taxon. You need to format your data so that it is clear which taxon each value comes from. There are two possible formats:
taxon_namedescriptor for this column.
taxon_fielddescriptor needs to contain the field name of the appropriate Taxa column.
It is an error to provide both
taxon_field descriptors for an Abundance or Trait field.
method and one of
Abundance is used here as an umbrella term to cover a wide range of possibilities from casual observation data ('We saw two clouded leopards on Friday on the road near F100'), through presence/absence data to precise measurements of abundances or encounter rate.
method descriptor needs to provide a detailed description of the sampling method, including the area surveyed, the length of time spent sampling, the number of samplers and any equipment. This should be detailed enough to allow the sampling protocol to be replicated. If other columns provide sampling information, such as survey time or area, then make this clear.
levels and one of
This is just a categorical variable where the groups apply to a taxa. So, we need information on the levels used, as for a standard categorical variable, and a link to taxonomic information as described in the examples above.
method and one of
This is just a numeric variable where the groups apply to a taxa. So, we need the method and units for the values, as for a standard numeric variable, and a link to taxonomic information as described in the examples above.
Interaction data is essentially just a column of categorical or numeric data that you want to associate with (at least) two taxon identities, but there are lots of ways that the taxon identities could be provided.
Interaction data fields do this by using two alternative descriptors to tie the data to taxa:
interaction_field. Each descriptor can provide one or more taxon names and optionally their role - the formatting is identical to the categorical data
levels descriptors. So, for example an
interaction_name descriptor might be:
Moon rat:prey;Clouded leopard:predator;
You can use one or both of the descriptors, depending on how your data is laid out. For the most common case of two interacting taxa, the following three possibilities exist.
1. Both interacting taxa vary from row to row, so taxon names are provided in two fields
2. Alternatively, all of the data might refer to the same two taxa, so the taxon names can be provided directly.
3. Finally, one side of the interaction might vary from row to row but the other side is constant for all rows.
You must provide at least two taxon names or fields, but you can provide more if you have tritrophic interactions! Again, you can use any combination of interaction names and fields to provide your taxon identities.
If you have a free text field with notes or comments, then this is the field type to use. We don't really check anything in comments fields: they're not expected to be complete data and you can put anything in them.
A word of caution though: it is highly unlikely that anyone will ever read your comments column again. If there is genuinely important information that might apply across multiple rows, consider coding it as an explicit variable rather than consigning it to a comments field.