Engagement when the toddler is watching towards the movie stimuli is defined byframes when the toddler exhibits yaw poses with magnitudes less than 20˝.Figure 3.2: Example of a head turn using the automatic method. To differentiate ahead turn from a face occlusion, we determine if the child is performing a headturning motion before and after the face is lost or when its exhibiting a yaw posewith large magnitude. The red bars represent the half-second windows used todetermine if the child is exhibiting a head turning motion before and after the face islost (by the camera) or when its exhibiting a yaw pose with large magnitude.(a)28
(b)Figure 3.3: Audio is analyzed to detemine the exact time point the practitioner saidthe child’s name during a name-call. The power spectrum density (psd) of therecorded audio signal (3.3(a)) contains audio from the movie stimuli(predominantly music) and instances of vocalizations. Root mean squared (RMS)values of the audio signal (3.3(b)) provide quantification of audio signals at eachtime point, and are used to detect a name-call prompt. Knowing that practitionerwas asked to prompt a name-call at 15 seconds into the stimuli, in this example weare able to focus on speech around the time point (green box) and detect the exacttime point when maximum speech occurred.Head movement and turn detectionWe estimate the child’s head movement by tracking the distances and pixel-wisedisplacements of central facial landmarks. We record the frame-by-framedisplacements of landmarks around the nose, namely the two outer eye landmarksand the lowest nose landmark shown in Figure 3.1. The magnitudes of thesedisplacements are heavily dependent on the distance the child is away from thecamera. Thus these displacements need to be normalized with respect to the child’sdistance from the camera. If depth information were available, this would be atrivial task; however, since it is not, we normalize the displacements with respect tothe distance between the child’s eyes, keeping in line with the use of only availableand ubiquitous hardware. At any given time point, the displacements from the noselandmark are normalized by a ˘1 second windowed-average Euclidean distancebetween theeyes.Since the practitioner and caregiver are located behind the child, the child musttransition his/her face from looking at the screen to looking behind him/her inorder to perform a head turn (in response to name calling or social referencing forexample). To detect head turns and distinguish between a head turn and just anocclusion of the face, we tracked yaw pose changes and defined two rules: to initiatea head turn the pose had to go from a frontal to one extreme head pose position (leftor right); to complete a head turn the pose then had to come back from the same29
extreme position to a frontal position. More formally, to initiate a head turn the yawpose had to change from a frontal position θyaw P t´20˝,`20˝u to an extreme |θyaw| ą35˝ within a half-second window. Then to complete a head turn, the yaw pose had to