# The SAFE Project wiki

A shared workspace for documenting information and research at the SAFE project

### Site Tools

working_at_safe:data_submission_format

This is a draft document for a system in preparation

Part of the agreement for research projects working at the SAFE Project is that all project data is submitted to the central SAFE Project data repository, so that all the data collected at SAFE is available to future researchers. In order to make it easy for data to be found and used in the future, we need researchers to provide some (relatively!) simple information in their datafiles.

At the moment, the data format only applies to tabular datasets stored in Excel spreadsheets - this accounts for about 90% of the data files used by researchers. For the moment, we will typically handle other data manually.

## Format overview

The details described below will be used to automatically publish your data to Zenodo. You should choose titles, descriptions and keywords that you would be happy to be permanently associated with your dataset!

The basic format for a SAFE dataset submission is an Excel Workbook, which must contain the following three worksheets:

• Summary: This contains some simple information about the authors of the dataset, access rights and the individual data tables in the dataset.
• Taxa: This describes all the taxa used in the dataset.
• Locations: This describes all the sampling locations used in the dataset.

After these worksheets come your data tables. You should label these sheets with a sensible name (not 'Sheet1'!) and each data table must be described in the Summary worksheet. You can include as many data tables as you like in a single dataset: we don't want you to spend time rearranging your data and are happy just to take the data in the natural tables you already use.

### File naming

Use a simple short name for your spreadsheet - there will be a lot of information giving more detail inside. Please do not use spaces in your file name - you can use underscores to separate words.

You can also look at existing approved datasets to see how the format is used:

Some links to examples will be added as the system gets used

### Format checking

We've tried to make the description below as clear as possible but in order to help you prepare your file:

1. It is easier to follow an example than to follow a description, so please use the template and look at the examples.
2. We use a Python program to automatically check the formatting of datasets. When you submit a file, you will get a report back from this program that will highlight any problems with your dataset. If there are problems, fix them and replace the submitted file. Once the file passes through the checker without problem, we will double check the file and then publish your dataset.

If you want to check your formatting yourself before submitting it then the code used to check Excel datasets is freely available online here. The link also provides instructions on how to use the code to check your data. You will need a computer with Python installed and which is connected to the internet (although the program can be setup to allow offline use).

### Data availability

When your dataset is published, the metadata will be immediately publicly visible. This includes details of the data fields, the spatial scope, the date range and the like. If you set the access status as Open, then the Excel file itself will also be immediately publicly available.

We would prefer that as much data as possible is submitted with Open access status, but if you want to restrict access to the data while you work on papers, then you can use the Embargo access status and set an embargo date. The metadata will still be visible, so that researchers can see that the data exists but the data itself will only become available once the embargo date has passed. You cannot embargo a dataset for more than two years.

Obviously, you can choose to provide embargoed data to other researchers within the embargo period. If researchers contact the SAFE Project for access to data during the embargo period, we will always pass the request on to you.

## The Summary worksheet

This worksheet contains a simple set of rows describing the dataset and identifying the spreadsheets that contain data tables. Each row is labelled on the left in the first column and then the description data should be typed in the columns to the right.

The following example shows the required rows. You must include all of these rows even if you don't provide that metadata: just leave the row blank. Examples of where this might happen are embargo date for Open access datasets and Author affiliations, emails and ORCIDs.

 SAFE Project ID 1 Access status Embargo Embargo date 03/09/18 Title Example data for the SAFE Project Description This is an example dataset. Author name Orme, David Author email d.orme@imperial.ac.uk Author affiliation Imperial College London Author ORCID 0000-0002-7005-1394 Worksheet name DF Incidence Worksheet title My shiny dataset My incidence matrix Worksheet description This is a test dataset A test dataset too Keywords Keyword 1 Keyword 2 Publication DOI https://doi.org/10.1098/rstb.2011.0049

The first rows are simple:

• SAFE Project ID: This is the project number from the SAFE project website. When you upload your dataset, you will also be asked to choose a project for your dataset: these two numbers must match. Note that you can only upload data to a project of which you are a member.
• Access status and Embargo date: As described above, the access status of the datasets can either be Open or Embargo. If you want to embargo your data, then provide a date when the embargo will end: you cannot embargo data for more than two years.
• Title: This should be a short informative title for the dataset: it will be used as the public title for the dataset so make sure it is clear and grammatical!
• Description: This will be the public description of the dataset. Note that you can have paragraphs of text within a single cell in Excel, so please do provide a reasonable summary. You will need to use Alt + Enter (or Alt + Shift + Enter on a Mac) to insert a carriage return.

#### The author block

These rows provide contact details for the authors of the data. If the datasets should be credited to more than author, then provide sets of details in adjacent columns. If you have an ORCID, provide it here: this is a good way to help link all of your academic outputs to you!

Affiliation and email are also optional, but we would very much prefer complete author metadata (name, affiliation, email) for all authors. However we realise that sometimes this isn't possible: if you're uploading data collected by past students who you've lost contact with, then you might not have these details for any author.

Author names must be formatted as “last name, first name”: “Orme, C David L” not “C David L Orme”.

#### The worksheet block

Each data worksheet must be described here - do not include Taxa and Locations worksheet in this block. As with the authors, you can describe multiple sheets in adjacent columns. The worksheet name row must contain the name of a worksheet in the workbook: that is, the exact text shown on the worksheet label tab at the bottom. The title and description summarise what data is found in a given sheet.

#### Keywords

Provide keywords for the dataset here, with one keyword (or short phrase) per cell in the row.

#### Publication DOI

Provide DOIs for publication using the data here and you can add multiple DOIs, one per cell in the row. Please format the DOI as a URL using https://doi.org/ before the DOI, so https://doi.org/10.1098/rstb.2011.0049 not DOI:10.1098/rstb.2011.0049

## The Taxa worksheet

Many datasets will involve data taken from organisms, whether that is a count of the number of individuals or measurement of a trait such as body length. In order to help us keep track of taxa, all datasets using taxa must contain a Taxa spreadsheet, providing taxonomic information.

Note that you must only provide details for taxa actually used in the data worksheets. This ensures that the taxonomic index for a dataset is accurate and also double checks that it the omission of a taxon from the data worksheets is not an error.

### Taxon validation

In order to help keep the taxonomy as clean as possible and to allow us to index the taxonomic coverage of datasets, we will check all taxon names in Taxa worksheet against the GBIF backbone taxonomy database. If you want to check your taxon names and ranks, then the search engine is here:

No online taxonomy is ever going to be 100% up to date (or 100% agree with your taxonomic usage!) but the GBIF backbone has very good taxonomic coverage and is well curated.

### Taxon table layout

The table format looks like this:

Taxon name Scientific name Taxon type Parent name Parent type
Crematogaster borneensis Crematogaster borneensis Species
Dolichoderus sp. Dolichoderus Genus
Morphospecies 1 NA Morphospecies Formicidae Family

The table must contain column headers in the first row of the worksheet. The headers must include:

• Taxon name: This column must contain all of the taxon names that you are going to use to identify taxa in the rest of the dataset. You cannot have duplicated names! Note that these can be abbreviations or codes: if you want to use Crbe in your data worksheets, rather than typing out Crematogaster borneensis every time, then that is fine.
• Scientific name: This column must contain the scientific name of the taxon, which will be used for taxon validation via GBIF.
• Taxon type: This column must provide the taxonomic type of the named taxon, which is usually the taxonomic level. For example, the taxon Pongo pygmaeus would be of type Species and the taxon Formicidae would be of type Family. However, we also recognise morphospecies and functional groups - see below for details.

You can optionally include the following two columns.

• Parent name and Parent type: These columns are used when you need to provide a taxonomic parent for a taxon. This will be in a handful of cases: a new or unrecognised taxon, morphospecies, functional groups and taxa at less common taxonomic levels. These two columns then provide a taxonomic hook to allow us to place the taxon in the backbone taxonomy.

### New and unrecognized taxa

If a taxon is new or not recognized by GBIF (and you're sure you're right!) then provide a parent name and type to allow us to hook the taxon into the index. For example Pongo tapanuliensis is not currently recognised as a species, so providing Pongo as a parent name of type 'genus' allows us to place the new taxon.

### Morphospecies and Functional groups

For morphospecies and functional groups, the taxon name is the label to be used in the dataset. Set the Scientific name to be 'NA' - it cannot be blank - and then specify the taxon type as 'Morphospecies' or 'Functional group'.

Now you need to provide a parent taxon and type. The level of taxonomic certainty for morphospecies and functional groups is quite variable, but we'd like the finest taxonomic level you can provide. As an example, in the table above, 'Morphospecies #1' is simply identified as being an ant.

### Less common taxonomic levels

The GBIF backbone taxonomy only includes the following eight major levels: Kingdom, Phylum, Order, Class, Family, Genus, Species and Subspecies. If you need to use taxa defined at any intermediate levels, then again provide a parent taxon and type. For example, if you were counting bees and only identifying to tribe level (Bombini, Euglossini, etc.) then the parent family Apidae would allow us to hook the taxa into the backbone taxonomy. The subfamily Apinae would be more precise, but subfamily isn't one of the backbone taxonomic levels.

### My data doesn't contain taxa

Fine. You can omit the Taxa worksheet!

## The Location worksheet

Like the Taxa worksheet, all locations in your data worksheets need to be listed in this worksheet. By location, we mean the common frequently used areas in which research has happened at SAFE. You might have more detail about the precise place you worked in your dataset - great! - but using these known locations allows us to get broad spatial data on sampling relatively simply.

So, we expect you'll have a relatively small set of location names in your data sheets, all of which should appear in this worksheet: the worksheet should contain a column of location names, with the column header 'Location name' in the first row.

### Location verification

The location names are checked against the location names known in the SAFE gazetteer. You can look at the gazetteer webpage to see the available sites and to download location data:

If you want to get a list of valid location names for use in a program or script, then we provide a web service that returns a list of valid names as a JSON object:

For example, in R:

> library(jsonlite)
> locations <- fromJSON("https://www.safeproject.net/call/json/get_locations")
> str(locations)
List of 1
\$ locations: chr [1:2691] "SAFE_camp" "Flux_tower" "A_1" "A_2" ...

### New locations

If your data comes from genuinely new locations or uses a sampling structure (e.g. a grid or transect) that is likely to be used again in the future, then you can create new location names and include them in your locations table. We will then consider adding them to the Gazetteer.

If you include new locations then you will need to include the following columns in your Locations worksheet:

• New: This should simply contain Yes or No to show which rows contain new locations. You cannot create a new location with a name that matches an existing location in the Gazetteer.
• Latitude and Longitude: these should provide GPS coordinates for the new site. These must be provided as decimal degrees (not degrees minutes and seconds) and please provide 6 decimal places in your coordinates. This level of precision is around ten centimetres and, although the GPS from the field is highly unlikely to be accurate to this level, we want to record as much sampling precision as possible. If you don't have any GPS data for the new location, please explicitly enter NA in these fields.
• Type: For most new locations, this will be POINT, so the latitude and longitude are sufficient. New linear sampling features (e.g. transects) are LINESTRING and sampling areas are POLYGON. In these cases, you will need to email the SAFE administrators and provide a GIS file containing the spatial information for your new locations.

You only need to provide Latitude, Longitude and Type in the rows for new locations: these rows can be blank for locations that are already in the gazetteer.

### My data doesn't include any locations

You don't have to include the Locations worksheet, although it would be very unusual. Possible examples:

1. You are working with lab data (and don't need to say where specimens came from in the field)
2. You are collecting data haphazardly from across the landscape, for example tracking animal movements, and the data isn't tied to particular sampling locations. We would then want GPS data for each observation!

## Data worksheets

Finally, we get to the worksheets containing your actual data!

The top rows of the worksheet are used to provide metadata descriptors for each of the columns ('fields') in your data worksheet. Each descriptor row has a label, which must appear in Column A of the worksheet, with the value for each field appearing above that column.

The following are the mandatory field descriptors, which are needed for all fields and which cannot be blank.

• field_type: This has to be one of the following values indicating the field type (see the options below).
• description: a short description of the field
• field_name: the name of the variable. The name format should be suitable for loading into an analysis package and should not contain spaces: use an underscore (_) to put gaps in names. This descriptor must always be the last descriptor row, immediately above the data, so that it can be used as field headers when loading data from the file for analysis.

There are also some additional field descriptors, which are mandatory for some data types (see the descriptions of the data types below). The options are:

• levels: contains the set of level names used in a categorical variable.
• method: a contain a short description of the method and equipment used to record numeric, abundance and trait data.
• units: the units of numeric or trait variables.
• taxon_name: the name of the taxon for which all of the trait or abundance data in a field is recorded. The taxon name must appear in the Taxa worksheet.
• taxon_field: the name of a taxon field in the datasheet which shows the taxon for which trait or abundance data on that row is recorded.
• interaction_name: a set of names giving the interacting taxa for interaction data .
• interaction_field: a set of field names, where the rows of the field give the interacting taxa for interaction data.

These descriptors only have to be completed for the appropriate data types: leave them blank for any fields that don't require them.

### Missing data

If your data worksheets contain missing data, you must enter 'NA' in those cells, not just leave them blank. This is to make it absolutely unambiguous that a given value is actually missing. We know this is picky but it can be absolutely vital: for example, does a blank cell in an abundance matrix mean that the species wasn't seen (so the cell should be zero) or that the trap for that species fell over and you don't know if it was recorded (so it should be NA).

### Row numbers

You must number the rows in your data worksheet. The row numbers must start at 1 in the cell directly under the field_name descriptor, increase by 1 as you move down through the cells and must continue down to the last row containing data. The row numbers must not extend below the data: the template numbers rows down to 1000, so delete the numbers for any unused rows in your data!

## Field types

This section shows the options that can appear in the field_type descriptor, along with any further descriptors that might be needed. See the sections below for details on formatting, but the available types are:

• Date, Datetime and Time: when were the data collected?
• Location: where was the data collected?
• Latitude, Longitude: GPS data for the exact location.
• Replicate: a record of replication, usually just a repeating set of numbers.
• ID: a column showing any kind of identification code.
• Categorical: otherwise known as a factor: a variable that puts data into a fixed set of groups.
• Ordered Categorical: a factor where there is a logical order to the levels.
• Numeric: all kinds of numeric data.
• Taxa: what taxa was the data collected from?
• Abundance: for abundance/density/presence data collected about a taxon.
• Categorical Trait: for categorical data collected on a taxon.
• Numeric Trait: for numeric data collected on a taxon.
• Categorical Interaction: for categorical data on interactions between taxa.
• Numeric Interaction: for numeric data on interactions between taxa.

### Date, Datetime and Time

We have three kinds of date and time fields!

• Datetime: The data in the field includes both a time and a date (e.g. 21/05/2016 15:32), which could be a visit time and day to a site or when a camera trap was deployed or similar data.
• Date: The data in the field only specifies a date (e.g. 21/05/2016).
• Time: The data in the field only specifies a time (e.g. 15:32).

We don't mind how you provide the date and time information but you do need to be consistent within a field.

Excel cell formatting can make this confusing. Both date and time are stored in Excel as a single number (N: days since the beginning of January 1900). If N < 1 it represents a time and if N > 1 it is a date. However, cell formatting can mislead you as to what is actually stored in the cell.

• 0.75 is the time 18:00, but it could also display in Excel as 00/01/1900 or 00/01/1900 18:00 if formatted as a date. This is reasonably easy to spot because of the 0th of January!
• 12 is the date 12/01/1900 but it could display as the time 00:00 or as 12/01/1900 00:00.
• 12.75 is the datetime 12/01/1900 18:00 but could display as the time 18:00 or as 12/01/1900.

Note that the value 12 is ambiguous, because Excel doesn't differentiate between integer and float numbers: it could just refer to the day (the integer 12) or mean exactly midnight on the day (the float 12.0). This is one reason why we have the three data types!

### Locations

Columns of this type contain location labels showing where the data in the row was recorded. All of the labels must have been included in the Locations worksheet.

### Taxa

Columns of this type contain taxon names showing the taxon from which other data in the row was recorded. All of the values in the row must appear in the Taxon Names column in the Taxa worksheet.

### Replicate and ID

Both Replicate and ID fields could contain almost any values. Replicates are typically just shown with repeating numbers, but researchers could use other formats. ID can represent lots of things (for example, PIT tag numbers for individual organism, fine scale spatial sampling ID, batch number for reagents) and again could have almost any format.

So, both ID and Replicate fields are checked for missing data (NAs are permitted) but no other validation occurs.

### Categorical data

Field descriptor levels required

Both categorical and ordered categorical data (also known as a factors) are made up of a set of levels showing the different groups or treatment. The data in the column then shows which level applies to each row.

In the levels descriptor, you must provide a complete set of all the levels used in the column, which will be checked against the data. The level names must be short text labels. Do not use integer level names: they are harder to interpret in statistical analyses and there is a real risk that they are analysed as a number by mistake.

The format is that the level names are separated using semi-colons (;). For example:

  Control;Logged;Burned

If the levels aren't obvious, we'd also like label descriptions: they come after each label, separated by colons (:). For example:

  Control:sites in reserve forest;Logged:sites in logged forest;Burned:sites in burned forest

Do not use colons or semi-colons in your level names or descriptions!

For Ordered Categorical fields, the order of the entries in the levels descriptor should be the logical order of the factor. For example, an ordered disturbance gradient could be:

  Primary:primary rainforest;Once:once logged rainforest;Twice:twice logged;Salvage:salvage logged;Oil palm:plantation

### Numeric data

Field descriptors method and units required

This field type should be used to record numeric variables except numeric variables recorded from taxa (see Traits below). The method descriptor should include information about how the variable is measured and the units descriptor must provide the units used.

Not all numeric variables have methods or units: a column of replicate numbers, for example. If this is the case, enter None rather than leaving the descriptors blank. (If you prefer to use Dimensionless as the unit for dimensionless quantities then that is also fine!)

### Abundance and trait data

Both traits and abundance data tie a value (category or number) to a single taxon. You need to format your data so that it is clear which taxon each value comes from. There are two possible formats:

1. All observations in a column are from a single taxon: in this case, you can put a valid taxon name (see Taxa worksheet) in the taxon_name descriptor for this column.

Example: Observation counts in separate columns for each taxon

1. Different rows in the column refer to different taxa: in this case, you must also have a Taxa column and the taxon_field descriptor needs to contain the field name of the appropriate Taxa column.

Example: Observation counts with different taxa in rows

It is an error to provide both taxon_name and taxon_field descriptors for an Abundance or Trait field.

#### Abundance

Field descriptors method and one of taxon_name or taxon_field required

Abundance is used here as an umbrella term to cover a wide range of possibilities from casual observation data ('We saw two clouded leopards on Friday on the road near F100'), through presence/absence data to precise measurements of abundances or encounter rate.

The method descriptor needs to provide a detailed description of the sampling method, including the area surveyed, the length of time spent sampling, the number of samplers and any equipment. This should be detailed enough to allow the sampling protocol to be replicated. If other columns provide sampling information, such as survey time or area, then make this clear.

#### Categorical trait

Field descriptors levels and one of taxon_name or taxon_field required

This is just a categorical variable where the groups apply to a taxa. So, we need information on the levels used, as for a standard categorical variable, and a link to taxonomic information as described in the examples above.

#### Numeric trait

Field descriptors units, method and one of taxon_name or taxon_field required

This is just a numeric variable where the groups apply to a taxa. So, we need the method and units for the values, as for a standard numeric variable, and a link to taxonomic information as described in the examples above.

### Interaction data

Interaction data is essentially just a column of categorical or numeric data that you want to associate with (at least) two taxon identities, but there are lots of ways that the taxon identities could be provided.

Interaction data fields do this by using two alternative descriptors to tie the data to taxa: interaction_name and interaction_field. Each descriptor can provide one or more taxon names and optionally their role - the formatting is identical to the categorical data levels descriptors. So, for example an interaction_name descriptor might be:

   Moon rat:prey;Clouded leopard:predator;

You can use one or both of the descriptors, depending on how your data is laid out. For the most common case of two interacting taxa, the following three possibilities exist.

1. Both interacting taxa vary from row to row, so taxon names are provided in two fields

Example: Interacting taxa identified in separate columns

2. Alternatively, all of the data might refer to the same two taxa, so the taxon names can be provided directly.

Example: Interacting taxa identifed in separate columns

3. Finally, one side of the interaction might vary from row to row but the other side is constant for all rows.

Example: Interacting taxa identified by name and by column

You must provide at least two taxon names or fields, but you can provide more if you have tritrophic interactions! Again, you can use any combination of interaction names and fields to provide your taxon identities.

#### Categorical interactions

Field descriptors levels and interaction_name and/or interaction_field required

#### Numeric interactions

Field descriptors units, method and interaction_name and/or interaction_field required

If you have a free text field with notes or comments, then this is the field type to use. We don't really check anything in comments fields: they're not expected to be complete data and you can put anything in them.

A word of caution though: it is highly unlikely that anyone will ever read your comments column again. If there is genuinely important information that might apply across multiple rows, consider coding it as an explicit variable rather than consigning it to a comments field.