The SAFE Project wiki

A shared workspace for documenting information and research at the SAFE project

User Tools

Site Tools


working_at_safe:data_submission_format

This is a draft document for a system in preparation

Part of the agreement for research projects working at the SAFE Project is that all project data is submitted to the central SAFE Project data repository, so that all the data collected at SAFE is available to future researchers. In order to make it easy for data to be found and used in the future, we need researchers to provide some (relatively!) simple information in their datafiles.

At the moment, the data format only applies to tabular datasets stored in Excel spreadsheets - this accounts for about 90% of the data files used by researchers. For the moment, we will typically handle other data manually.

Format overview

The details described below will be used to automatically publish your data to Zenodo. You should choose titles, descriptions and keywords that you would be happy to be permanently associated with your dataset!

The basic format for a SAFE dataset submission is an Excel Workbook, which must contain the following three worksheets:

  • Summary: This contains some simple information about the authors of the dataset, access rights and the individual data tables in the dataset.
  • Taxa: This describes all the taxa used in the dataset.
  • Locations: This describes all the sampling locations used in the dataset.

After these worksheets come your data tables. You should label these sheets with a sensible name (not 'Sheet1'!) and each data table must be described in the Summary worksheet. You can include as many data tables as you like in a single dataset: we don't want you to spend time rearranging your data and are happy just to take the data in the natural tables you already use.

Spreadsheet Template and Examples

Click on this link to download the spreadsheet template containing the required worksheets, labels and headers.

You can also look at existing approved datasets to see how the format is used:

Some links to examples will be added as the system gets used

Format checking

We've tried to make the description below as clear as possible but in order to help you prepare your file:

  1. It is easier to follow an example than to follow a description, so please use the template and look at the examples.
  2. We use a Python program to automatically check the formatting of datasets. When you submit a file, you will get a report back from this program that will highlight any problems with your dataset. If there are problems, fix them and replace the submitted file. Once the file passes through the checker without problem, we will double check the file and then publish your dataset.

Checking your own data

If you want to check your formatting yourself before submitting it then the code used to check Excel datasets is freely available online here. The link also provides instructions on how to use the code to check your data. You will need a computer with Python installed and which is connected to the internet (although the program can be setup to allow offline use).

Data availability

When your dataset is published, the metadata will be immediately publicly visible. This includes details of the data fields, the spatial scope, the date range and the like. If you set the access status as Open, then the Excel file itself will also be immediately publicly available.

We would prefer that as much data as possible is submitted with Open access status, but if you want to restrict access to the data while you work on papers, then you can use the Embargo access status and set an embargo date. The metadata will still be visible, so that researchers can see that the data exists but the data itself will only become available once the embargo date has passed. You cannot embargo a dataset for more than two years.

Obviously, you can choose to provide embargoed data to other researchers within the embargo period. If researchers contact the SAFE Project for access to data during the embargo period, we will always pass the request on to you.

The Summary worksheet

This worksheet contains a simple set of rows describing the dataset and identifying the spreadsheets that contain data tables. Each row is labelled on the left in the first column and then the description data should be typed in the columns to the right. The following example shows the required rows:

SAFE Project ID 1
Access status Embargo
Embargo date 03/09/18
Title Example data for the SAFE Project
Description This is an example dataset.
Author name Orme, David
Author email d.orme@imperial.ac.uk
Author affiliation Imperial College London
Author ORCID 0000-0002-7005-1394
Worksheet name DF Incidence
Worksheet title My shiny dataset My incidence matrix
Worksheet description This is a test dataset A test dataset too
Keywords Keyword 1 Keyword 2

The first rows are simple:

  • SAFE Project ID: This is the project number from the SAFE project website. When you upload your dataset, you will also be asked to choose a project for your dataset: these two numbers must match. Note that you can only upload data to a project of which you are a member.
  • Access status and Embargo date: As described above, the access status of the datasets can either be Open or Embargo. If you want to embargo your data, then provide a date when the embargo will end: you cannot embargo data for more than two years.
  • Title: This should be a short informative title for the dataset: it will be used as the public title for the dataset so make sure it is clear and grammatical!
  • Description: This will be the public description of the dataset. Note that you can have paragraphs of text within a single cell in Excel, so please do provide a reasonable summary. You will need to use Alt + Enter (or Alt + Shift + Enter on a Mac) to insert a carriage return.

The author block

These rows provide contact details for the authors of the data. If the datasets should be credited to more than author, then provide sets of details in adjacent columns. If you have an ORCID, provide it here: this is a good way to help link all of your academic outputs to you!

Author names must be formatted as “last name, first name”: “Orme, C David L” not “C David L Orme”.

The worksheet block

Each data worksheet must be described here - do not include Taxa and Locations worksheet in this block. As with the authors, you can describe multiple sheets in adjacent columns. The worksheet name row must contain the name of a worksheet in the workbook: that is, the exact text shown on the worksheet label tab at the bottom. The title and description summarise what data is found in a given sheet.

Keywords

Provide keywords for the dataset here, with one keyword (or short phrase) per cell in the row.

The Taxa worksheet

Many datasets will involve data taken from organisms, whether that is a count of the number of individuals or measurement of a trait such as body length. In order to help us keep track of taxa, all datasets using taxa must contain a Taxa spreadsheet, providing taxonomic information. The table format looks like this:

Taxon name Taxon type Kingdom Phylum Class Order Family Genus Species Morphospecies
Crematogaster borneensis Species Animalia Arthropoda Insecta Hymenoptera Formicidae Crematogaster Crematogaster borneensis
Dolichoderus sp. Genus Animalia Arthropoda Insecta Hymenoptera Formicidae Dolichoderus
Formicidae #1 Family Animalia Arthropoda Insecta Hymenoptera Formicidae
Morphospecies 1 Morphospecies Animalia Arthropoda Insecta Hymenoptera Formicidae Morphospecies 1

The table must contain column headers in the first row of the worksheet. The first two columns must be:

  • Taxon name: This column must contain all of the taxon names that you are going to use to identify taxa in the rest of the dataset. You cannot have duplicated names! Note that these can be abbreviations or codes: if you want to use Crbe in your data worksheets, rather than typing out Crematogaster borneensis every time, then that is fine.
  • Taxon type: This column tells us the taxonomic type of the named taxon, which is usually the taxonomic level. For example, the taxon Pongo pygmaeus would be of type Species and the taxon Formicidae would be of type Family. However, we also recognise morphospecies and functional groups - see below for details

You must include the main seven taxonomic levels in this worksheet: Kingdom, Phylum, Class, Order, Family, Genus and Species. You must fill in each of these levels up to the taxon type of a particular taxon: for example, the table above has all the taxonomy for Dolichoderus sp. but obviously no species name is needed.

Note that if you are only using a single taxonomic type and not abbreviating your taxon names, then you will end with the same data in two columns. For example, if all your taxa are identified as species using latin binomials, then your Taxon name column might be identical to your Species column. This is correct - it does make your file more bulky but this system allows us to have mixed taxonomic types and still have a simple set of names for the taxa used in the dataset.

You can also include other taxonomic levels. We do not routinely check other levels but the dataset checking program has an option to check all taxonomic levels. The NCBI dataset includes the following levels: Kingdom, Subkingdom, Superphylum, Phylum, Subphylum, Superclass, Class, Subclass, Infraclass, Cohort, Superorder, Order, Suborder, Infraorder, Parvorder, Superfamily, Family, Subfamily, Tribe, Subtribe, Genus, Subgenus, Species group, Species subgroup, Species, Subspecies, Varietas, Forma.

Taxon validation

In order to help keep the taxonomy as clean as possible and to allow us to index the taxonomic coverage of datasets, we will check all taxon names in Taxa worksheet against the NCBI taxonomy database. If you want to check your taxon names and ranks, then the NCBI search engine is here:

https://www.ncbi.nlm.nih.gov/taxonomy

New and unrecognized taxa

Some taxa may either be genuinely new or just missing from the NCBI database. If you are using a taxon name that you are sure is correct or is new, then add an asterisk to the end of the taxon name and the file checker will stop treating it as an error. For example,

  Crematogaster ormei*

You can do this for any taxonomic level but we will get suspicious if your file contains new kingdoms…

Morphospecies

If you are using morphospecies, then provide a column Morphospecies giving the morphospecies names (in case you want to use abbreviations in the datasheets). Please provide taxonomic information up to at least Order level for each morphospecies.

Functional groups

If you are using functional groups, then provide a column Functional Group giving the functional group. Functional groups are highly context dependent, so we only insist you identify each group to Kingdom, but please do provide more taxonomic detail if possible!

My data doesn't contain taxa

Fine. You can omit the Taxa worksheet!

The Location worksheet

Like the Taxa worksheet, all locations in your data worksheets need to be listed in this worksheet. By location, we mean the common frequently used areas in which research has happened at SAFE. You might have more detail about the precise place you worked in your dataset - great! - but using these known locations allows us to get broad spatial data on sampling relatively simply.

So, we expect you'll have a relatively small set of location names in your data sheets, all of which should appear in this worksheet: the worksheet should contain a column of location names, with the column header 'Location name' in the first row.

Location verification

The location names are checked against the location names known in the SAFE gazetteer. You can look at the gazetteer webpage to see the available sites and to download location data:

https://www.safeproject.net/info/gazetteer

If you want to get a list of valid location names for use in a program or script, then we provide a web service that returns a list of valid names as a JSON object:

https://www.safeproject.net/call/json/get_locations

For example, in R:

> library(jsonlite)
> locations <- fromJSON("https://www.safeproject.net/call/json/get_locations")
> str(locations)
List of 1
 $ locations: chr [1:2691] "SAFE_camp" "Flux_tower" "A_1" "A_2" ...

New locations

If your data comes from genuinely new locations or uses a sampling structure (e.g. a grid or transect) that is likely to be used again in the future, then you can create new location names and include them in your locations table. We will then consider adding them to the Gazetteer.

If you include new locations then you will need to include the following columns in your Locations worksheet:

  • New: This should simply contain Yes or No to show which rows contain new locations. You cannot create a new location with a name that matches an existing location in the Gazetteer.
  • Latitude and Longitude: these should provide GPS coordinates for the new site. These must be provided as decimal degrees (not degrees minutes and seconds) and please provide 6 decimal places in your coordinates. This level of precision is around ten centimetres and, although the GPS from the field is highly unlikely to be accurate to this level, we want to record as much sampling precision as possible.
  • Type: For most new locations, this will be POINT, so the latitude and longitude are sufficient. New linear sampling features (e.g. transects) are LINESTRING and sampling areas are POLYGON. In these cases, you will need to email the SAFE administrators and provide a GIS file containing the spatial information for your new locations.

You only need to provide Latitude, Longitude and Type in the rows for new locations: these rows can be blank for locations that are already in the gazetteer.

My data doesn't include any locations

You don't have to include the Locations worksheet, although it would be very unusual. Possible examples:

  1. You are working with lab data (and don't need to say where specimens came from in the field)
  2. You are collecting data haphazardly from across the landscape, for example tracking animal movements, and the data isn't tied to particular sampling locations. We would then want GPS data for each observation!

Data worksheets

Finally, we get to the worksheets containing your actual data!

Field metadata

The top rows of the worksheet are used to provide metadata descriptors for each of the columns ('fields') in your data worksheet. Each descriptor row has a label, which must appear in Column A of the worksheet, with the value for each field appearing above that column.

The following are the mandatory field descriptors, which are needed for all fields and which cannot be blank.

  • field_type: This has to be one of the following values indicating the field type (see the options below).
  • description: a short description of the field
  • field_name: the name of the variable. The name format should be suitable for loading into an analysis package and should not contain spaces: use an underscore (_) to put gaps in names. This descriptor must always be the last descriptor row, immediately above the data, so that it can be used as field headers when loading data from the file for analysis.

There are also some additional field descriptors, which are mandatory for some data types (see the descriptions of the data types below). These only have to be completed for the appropriate data types - you should leave them blank for any fields that don't require them. The options are:

  • levels: contains the set of level names used in a categorical variable.
  • method: a contain a short description of the method and equipment used to record numeric, abundance and trait data.
  • units: the units of numeric or trait variables.
  • taxon_name: the taxon (or taxon column) that trait or abundance data is recorded from.

Missing data

If your data worksheets contain missing data, you must enter 'NA' in those cells, not just leave them blank. This is to make it absolutely unambiguous that a given value is actually missing. We know this is picky but it can be absolutely vital: for example, does a blank cell in an abundance matrix mean that the species wasn't seen (so the cell should be zero) or that the trap for that species fell over and you don't know if it was recorded (so it should be NA).

Row numbers

You must number the rows in your data worksheet. The row numbers must start at 1 in the cell directly under the field_name descriptor, increase by 1 as you move down through the cells and must continue down to the last row containing data. The row numbers must not extend below the data: the template numbers rows down to 1000, so delete the numbers for any unused rows in your data!

Field types

This section shows the options that can appear in the field_type descriptor, along with any further descriptors that might be needed. See the sections below for details on formatting, but the available types are:

  • Date, Datetime and Time: when were the data collected?
  • Location: where was the data collected?
  • Latitude, Longitude: GPS data for the exact location.
  • Replicate: a record of replication, usually just a repeating set of numbers.
  • ID: a column showing any kind of identification code.
  • Categorical: otherwise known as a factor: a variable that puts data into a fixed set of groups.
  • Ordered Categorical: a factor where there is a logical order to the levels.
  • Numeric: all kinds of numeric data.
  • Taxa: what taxa was the data collected from?
  • Abundance: for abundance/density/presence data collected about a taxon.
  • Categorical Trait: for categorical data collected on a taxon.
  • Numeric Trait: for numeric data collected on a taxon.

Date, Datetime and Time

We have three kinds of date and time fields!

  • Datetime: The data in the field includes both a time and a date (e.g. 21/05/2016 15:32), which could be a visit time and day to a site or when a camera trap was deployed or similar data.
  • Date: The data in the field only specifies a date (e.g. 21/05/2016).
  • Time: The data in the field only specifies a time (e.g. 15:32).

We don't mind how you provide the date and time information but you do need to be consistent within a field.

Excel cell formatting can make this confusing. Both date and time are stored in Excel as a single number (N: days since the beginning of January 1900). If N < 1 it represents a time and if N > 1 it is a date. However, cell formatting can mislead you as to what is actually stored in the cell.

  • 0.75 is the time 18:00, but it could also display in Excel as 00/01/1900 or 00/01/1900 18:00 if formatted as a date. This is reasonably easy to spot because of the 0th of January!
  • 12 is the date 12/01/1900 but it could display as the time 00:00 or as 12/01/1900 00:00.
  • 12.75 is the datetime 12/01/1900 18:00 but could display as the time 18:00 or as 12/01/1900.

Note that the value 12 is ambiguous, because Excel doesn't differentiate between integer and float numbers: it could just refer to the day (the integer 12) or mean exactly midnight on the day (the float 12.0). This is one reason why we have the three data types!

Locations

Columns of this type contain location labels showing where the data in the row was recorded. All of the labels must have been included in the Locations worksheet.

Taxon

Columns of this type contain taxon names showing what taxon the data in the row was recorded from. All of the values in the row must appear in the Taxon Names column in the Taxa worksheet.

Replicate and ID

Both Replicate and ID fields could contain almost any values. Replicates are typically just shown with repeating numbers, but researchers could use other formats. ID can represent lots of things (for example, PIT tag numbers for individual organism, fine scale spatial sampling ID, batch number for reagents) and again could have almost any format.

So, both ID and Replicate fields are checked for missing data (NAs are permitted) but no other validation occurs.

Categorical and Ordered Categorical

Field descriptor levels required

Both categorical fields (also known as a factors) are made up of a set of levels showing the different groups or treatment. The data in the column then shows which level applies to each row.

In the levels descriptor, you must provide a complete set of all the levels used in the column, which will be checked against the data. The level names must be short text labels. Do not use integer level names: they are harder to interpret in statistical analyses and there is a real risk that they are analysed as a number by mistake.

The format is that the level names are separated using semi-colons (;). For example:

  Control;Logged;Burned

If the levels aren't obvious, we'd also like label descriptions: they come after each label, separated by colons (:). For example:

  Control:sites in reserve forest;Logged:sites in logged forest;Burned:sites in burned forest

Do not use colons or semi-colons in your level names or descriptions!

For Ordered Categorical fields, the order of the entries in the levels descriptor should be the logical order of the factor. For example, an ordered disturbance gradient could be:

  Primary:primary rainforest;Once:once logged rainforest;Twice:twice logged;Salvage:salvage logged;Oil palm:plantation

Numeric

Field descriptors method and units required

This field type should be used to record numeric variables except numeric variables recorded from taxa (see Traits below). The method descriptor should include information about how the variable is measured and the units descriptor must provide the units used.

Not all numeric variables have methods or units: a column of replicate numbers, for example. If this is the case, enter None rather than leaving the descriptors blank. (If you prefer to use Dimensionless as the unit for dimensionless quantities then that is also fine!)

Trait and abundance data

Both traits and abundance data tie a value (category or number) to a particular taxon. You need to format your data so that it is clear which taxon each value comes from. There are two possible formats:

  1. All observations in a column are from a single taxon: in this case, you can put a valid taxon name (see Taxa worksheet) in the taxon_name descriptor for this column.

Example: Observation counts in separate columns for each taxon

  1. Different rows in the column refer to different taxa: in this case, you must also have a Taxa column and the taxon_name descriptor needs to contain the field name of the appropriate Taxa column.

Example: Observation counts with different taxa in rows

Taxa

Columns of type Taxa contain taxon names showing what taxon the data in the row was recorded for. All of the names must appear in the Taxa worksheet.

Abundance

Field descriptors taxon_name and method required

Abundance is used here as an umbrella term to cover a wide range of possibilities from casual observation data ('We saw two clouded leopards on Friday on the road near F100'), through presence/absence data to precise measurements of abundances or encounter rate.

The method descriptor needs to provide a detailed description of the sampling method, including the area surveyed, the length of time spent sampling, the number of samplers and any equipment. This should be detailed enough to allow the sampling protocol to be replicated. If other columns provide sampling information, such as survey time or area, then make this clear.

Categorical trait

Field descriptors taxon_name and levels required

This is just a categorical variable where the groups apply to a taxa. So, we need information on the levels used, as for a standard categorical variable, and a link to taxonomic information as described in the examples above.

Numeric trait

Field descriptors taxon_name, units and ''method required

This is just a numeric variable where the groups apply to a taxa. So, we need the method and units for the values, as for a standard numeric variable, and a link to taxonomic information as described in the examples above.

Comments

If you have a free text field with notes or comments, then this is the field type to use. We don't really check anything in comments fields: they're not expected to be complete data and you can put anything in them.

A word of caution though: it is highly unlikely that anyone will ever read your comments column again. If there is genuinely important information that might apply across multiple rows, consider coding it as an explicit variable rather than consigning it to a comments field.

working_at_safe/data_submission_format.txt · Last modified: 2017/08/18 12:57 by david_orme