6.2 Data Cleaning: Missing Values

There were 22,822 observations in the 2000 dataset and 24,060 observations in the 2001 dataset. For both datasets the same cleaning techniques were applied. As the decomposition methodology is multivariate, an observation requires information on all variables in the model in order to be included in the analysis. Instead of relying on the statistical process to remove the observations with missing values, or imputing the missing values, observations with missing values were removed prior to the analyses. Table 1 below shows the effect of the data cleaning process on observation numbers.

Table 1 Data Cleaning Steps and Numbers of Observations Affected

 

Cleaning Step

2000 Dataset

2001 Dataset

Initial number of observations

37,386

38,930

  •  
    • Non-current Employees 14

-7,462

-7,549

  •  
    • Missing date of birth

- 362

- 453

  •  
    • Missing occupation

- 230

- 163

  •  
    • Missing ethnicity

-4,037

-3,996

  •  
    • Part-time work status 15

-2,181

-2,359

  •  
    • Age < 16 years or > 64 years

- 59

- 103

  •  
    • Single gender occupations 16

- 233

- 247

Observations included in analysis

22,822

24,060

14 Employees who terminated their employment between 1 July and 30 June of the analysis year, and employees on secondment, leave without pay, or parental leave as at 30 June. It is assumed that the salary information provided for current employees reflects the actual salary as at 30 June, whereas the salary information for non-current employees is influenced by additional factors (e.g., no incremental pay increase in the current year).

15 Due to the small number of part-time employees, these people were excluded from the analysis rather than incorporating an indicator variable indicating full-time status into the model.

16 These are occupations that comprise solely women or solely men. Refer Appendix A.

Last modified: