Coded Data & Specimens

OHRP Definition of Coded:

Coded means that

  • identifying information (such as name or social security number) that would enable the investigator to readily ascertain the identity of the individual to whom the private information or specimens pertain has been replaced with a number, letter, symbol, and/or combination thereof (i.e., the code); and

  • a key to decipher the code exists, enabling linkage of the identifying information to the private information or specimens.

In order to minimize risk to subjects, data sets and biospecimens are frequently coded. Coding a data set means replacing the name, MR# and other readily identifiable fields with a unique identifier code number. Just because the data is coded does not mean that all elements of PHI have to be removed. Data sets should contain the minimum PHI necessary consistent with the requirements of the research.

Coding Schemes

A wide variety of methods can be used to code data/specimens. The numbers could be sequential, they could be hashed from a sequential number, they could be randomly assigned, they could include a site identifier and a sequential number - it doesn't matter as long as they are unique and not otherwise associated with the individual.

HIPAA considers Code numbers that are derived from any element of PHI to still be considered PHI. For example, a Code that was derived from a subject's initials plus a unique number would still be considered PHI. Commonly, the Medical Record number is hashed or encrypted to create a unique Code number. The MR# is desirable to prevent enrolling the same subject more than once and to ensure that data updates are correctly merged with the correct subject's data.

Even though the Code number is unique and the subject is not be readily identifiable, since the it was derived from PHI the dataset cannot be claimed to be de-identified and HIPAA protections will apply when sharing the data. One solution is to generate new ID numbers that replace the hashed or encrypted MR# when exporting the data/specimens.

More information about sharing data/biospecimens with other investigators at CHOP and external to CHOP can be found on the IRB's page devoted to Sharing Data.

Complete Data Set with Identifiers

This data set has complete identifiers including names, MR# and date of birth and date of service.

Subject ID# Last Name First Name MR# Birth Date Date of Surgery Age (yrs) Diagnosis Surgical Procedure
A00001 Fine John 10972390 01/01/00 01/03/08 8.01 VSD VSD Repair
A00002 Smith Sally 09890580 03/02/85 02/05/99 13.93 ASD Patch closure ASD
A00003 Jones Bobby 98098908 04/04/96 05/06/97 1.09 TGA Arterial Switch
A00004 Chen Allison 83838300 02/01/02 12/12/03 1.86 HLHS Stage 2

Master List + Coded Data Set

Risk of breach of confidentiality can be decreased by separating using a Master List for the as much of the PHI as possible. The data set should contain only the minimum necessary PHI for the purposes of the research. In this example, dates are retained to allow calculation of age at a later point in time.

Master List:
Subject ID# Last Name First Name MR# Birth Date Date of Surgery
A00001 Fine John 10972390 01/01/00 01/03/08
A00002 Smith Sally 09890580 03/02/85 02/05/99
A00003 Jones Bobby 98098908 04/04/96 05/06/97
A00004 Chen Allison 83838300 02/01/02 12/12/03
Coded Data Set with Minimum Necessary PHI:
Subject ID# Birth Date Date of Surgery Diagnosis Surgical Procedure
A00001 01/01/00 01/03/08 VSD VSD Repair
A00002 03/02/85 02/05/99 ASD Patch closure ASD
A00003 04/04/96 05/06/97 TGA Arterial Switch
A00004 02/01/02 12/12/03 HLHS Stage 2

Coded Data Sets That Could Still Be Used to Re-Identify Individuals

Both data sets below are coded and could be relinked to subjects. The top data set retains the original subject ID code which is the key to link to the Master List. In the limited data set on the bottom, a new ID coded has replaced the original code. Even though new ID number can no longer serve as a key to relink the subjects to PHI, the data set still has some PHI (dates are retained). Subjects could potentially be re-identified from the data in either data set.

Coded Data Set without PHI (with a Key to Re-Link):
Subject ID# Age (yrs) Diagnosis Surgical Procedure
A00001 8.01 VSD VSD Repair
A00002 13.93 ASD Patch closure ASD
A00003 1.09 TGA Arterial Switch
A00004 1.86 HLHS Stage 2
Limited Data Set (with Dates that Can be Used to Re-Link):
Random ID# Birth Date Date of Surgery Diagnosis Surgical Procedure
B0045 01/01/00 01/03/08 VSD VSD Repair
B0065 03/02/85 02/05/99 ASD Patch Closure ASD
B0023 04/04/96 05/06/97 TGA Arterial Switch
B0087 02/01/02 12/12/03 HLHS Stage 2

Deidentified or Anonymized Dataset

All identifiers have been stripped from this data set and the original ID number has been replaced with a new unique ID number.

  • If the new code is randomly generated in a reproducible way and a key is retained that can allow the process to "go backwards" to regenerate the original code, the data set is encrypted.
  • If it is not possible to regenerate the original code, so subjects cannot be re-identified from this data set then the data set has been anonymized or deidentified.
  • If the data was originally collected in this format without a link to subjects, then the data set would be anonymous.
Deidentified Data Set:
Random ID# Age (yrs) Diagnosis Surgical Procedure
B0045 8.01 VSD VSD Repair
B0065 13.93 ASD Patch closure ASD
B0023 1.09 TGA Arterial Switch
B0087 1.86 HLHS Stage 2

For more information see OHRP: Guidance on Research Involving Coded Private Information or Biological Specimens

Top