ABSTRACT
What is the relationship between object segmentation and recognition? First, we develop a feature segmentation method that parses faces into features and, in doing so, attempts to approximate human performance. This segmentation-based approach allows us to build featural representations that make explicit the part-whole structure of faces and removes a priori assumptions from the equation of how objects come to be divided into features. Second, we examine the utility and the psychological plausibility of this representation by applying it to the task of facial gender recognition. Featural information from the segmentation process is shown to support relatively high accuracy levels with automatic gender categorization.
The diagnosticity of featural information, in particular color information as encoded by the three perceptual color channels, is traced to the different patterns of feature contrast across Caucasian male and female faces. Results with human recognition suggest the visual system can exploit this information, however, there are open questions regarding the contribution of color information independent of luminance. More generally, our approach allows us to clarify and extend the notion of “configural” representations to multiple cues (i.e., not only shape) by considering relations between features independent of cue domain.
INTRODUCTION
Gender categorization has received considerable attention in the study of human face processing as well as in machine (automatic) face recognition. These two approaches to the study of visual face processing, human and automatic, share common ground in the exploitation of image structure which is typically analyzed into different information types or, alternatively, into different features. By the former, we mean sources of information along a given dimension such as luminance, shape-from-shading, or texture; by the latter, we mean localized object constituent parts, such as the nose on a face or the reddish/greenish region of the upper cheeks.
In the context of generic human object recognition, the most widely held hypothesis is that shape cues, such as shape-from-shading are weighted more heavily than surface properties and thus form the basis for extracting or delineating constituent features or parts. However, surface cues do seem to play a role in at least some forms of object recognition. In particular, there is evidence suggesting that pigmentation cues such as hue and texture are important in face processing. At the same time, face recognition is often construed as a feature-based process that emphasizes the relative use and diagnosticity of distinct local features such as the mouth or the nose. With respect to the problem at hand—gender categorization—studies on the relative contribution of different facial features have produced mixed results, often not easily comparable because of the different ways faces can be parsed into said features .
We contend that a clearer understanding of the role of image structure in face processing can arise from combining the cue- and feature-based approaches. Another level of complexity is by considering the claim that faces are processed by the visual system as configural structures that is, relating local features/parts to one another, rather than as unordered sets of features. A configural style of processing has been advocated as characteristic of face and expert object recognition. Configuration is naturally taken to refer in this context to the geometrical positioning of the different features with respect to each other that is, geometrical configuration. However, we note that relations can be meaningfully constructed between pigmentation cues as well. For instance, evidence has been presented in favor of a face recognition scheme based on luminance differences between separate face regions. Such an approach can be generalized and applied to virtually every cue involved in face recognition.
A critical issue faced by feature-based approaches, whether configural or not, concerns the identification of valid and stable features. If we think of features as corresponding to distinct non-overlapping regions, the question becomes how does the visual system “carve up” or segment an object such as a face into constituent features? The role of segmentation in recognition
Features are selected using an experimenter's a priori intuitions, that is, without any feature diagnosticity. Features are usually identified by manually marking or “cutting and pasting” a limited number of face images. Such methods have a number of drawbacks that often tend to pass unnoticed, hidden in the Methods section.
First, only a small number of features are considered, generally features with high contrast such as the eyes, the mouth, and the nose. Second, different intuitions for parsing faces may lead to different and potentially incommensurate results across studies. For instance, the central brow of a face may be grouped with the eyes, with the nose or with the forehead —all options are plausible. Third, manual feature marking is impractical for large databases and large sets of features. One might think that this final concern may be easily addressed by appealing to automatic segmentation algorithms. Indeed, in computer vision, this task has been accomplished by methods for facial feature segmentation. Unfortunately, our first two concerns apply to these methods as well, making them equally problematic. More specifically, most automatic feature segmentation algorithms only extract a limited number of features, such as the eyes and the mouth, and feature selection is dependent on the concrete goal of the algorithm, for example, lip segmentation for automatic lip reading.
In contrast, when addressing segmentation as the foundation for human face recognition there are important theoretical advantages to an a posteriori method for segmenting objects into features, that is, making no assumptions about the nature of the features up front but grounding feature identification in human performance. At least two criteria need be considered in this respect. First, feature identification should mirror the way humans accomplish face segmentation, and second, the utility and plausibility of a segmentation scheme for recognition needs to be assessed. This twofold approach is illustrated by the research reported here. First, we develop a feature segmentation method that exhaustively parses faces into features and, in doing so, attempts to approximate human segmentation performance. Second, we examine the utility and the psychological plausibility of the segmental representation obtained in facial gender recognition. As emphasized below, this latter analysis has rarely been used in evaluating segmentation algorithms.
Interestingly, our investigation of segmental structure in gender categorization enabled us to examine more thoroughly one type of cue relatively under-researched in face recognition: color.
Any particular pattern of variation with regard to color or any other cue is critically dependent on the feature segmentation schema deployed. Different ways of segmenting faces can lead to different patterns of variation. Consequently, as mentioned earlier, our approach uses segmentation to study recognition and, conversely, recognition results to assess the utility of feature segmentation. This method allows us to gain a broader perspective on how mechanisms of low-level and high-level face processing interact, as well as providing a tool for examining the role of different cues through multiple processing stages.
Front-view (face-on) face images were drawn from the original MPI face database. This database contains 200 faces, half males, half females, with one frontal, color image per individual. The stimuli were collected under controlled, consistent lighting conditions. All subjects have a neutral expression and none of them wears makeup, glasses, or other accessories. The faces have no hair or facial hair other than stubble. In addition, we removed the visible part of the neck from the images.
Figure 1. Manual segmentations of the same face stimulus (leftmost image) by three different participants.
Facial feature segmentation
For the purpose of manual segmentation by human observers, we developed an application with the aid of the Psychophysics Toolbox for Mat lab. This application allowed the user to use a computer mouse to draw contours on top of color images and mark regions bounded by closed contours. Participants were instructed to identify and mark distinct parts of the face trying to be as exhaustive as possible, that is, cover as much of the face surface as possible, and avoid overlapping regions. Two sets of forty faces randomly selected from the MPI database were segmented by six participants over the course of two sessions.
Manual segmentations were examined for self-consistency using the precision-recall framework as applied to segmentation .Every segmented image was treated as a signal while the remainder of the segmentations of the same image by other observers provided the benchmark against which it was evaluated. Precision (P) was measured as the probability that two pixels located in the same segment in the test image was located in the same segment in the other segmentations. Conversely, recall (R) measured the probability that two pixels located in the same segment in the ground-truth data are also located within the same segment in the test image. Precision and recall were combined into a single measure, the F-measure, using their harmonic mean.
F=2PR/ (P+R)
Where P = true positives / (true positives + false positives) and R = true positives / (true positives + false negatives).
For the purpose of automatic feature segmentation, we designed a multiple-cue; patch-based method based on ideas and techniques borrowed from general image segmentation, facial feature segmentation, and top-down category-specific segmentation.
First, a fine 2 × 2 pixel grid was superimposed on top of the stimuli, and at each node of the grid, we constructed histogram descriptors for the pigmentation cues considered. For color, we employed the CIE L*a*b* color space whose components correspond to the three perceptual color channels in the human visual system, brightness (L*), red–green (a*), and yellow–blue (b*). Histograms were computed over L*a*b* values of pixels within a circular area centered on each node on the grid. The radius of the patch was a parameter of the algorithm. In addition to pigmentation cues, we also considered proximity that was combined with symmetry by measuring the Euclidian distance from each node to the vertical symmetry axis.
Where (i) = [f(i; I) + f(i; J)]/2 denotes the joint estimate. Cue-specific distances were next normalized by their variance and combined linearly using cue weights that maximized the F-measure fit of automatic segmentations with the manual ones
Feature segmentation was applied to the entire MPI data set, yielding a total of 200 segmented faces. For each facial feature resulting from the segmentation process, we recorded its average color properties represented as a triplet in L*a*b* space. The textural property of a feature was obtained by appealing to the texton information used by the segmentation algorithm: we computed the χ2 similarity—between the texton distribution for a given feature and the texton distribution for the entire face. In addition, we recorded simple geometrical information consisting in the position of feature centers within a face normalized by intraocular distance as well as feature size normalized by total face area. Configural information was obtained by taking all pair wise features and comparing their values. Position information was computed as the Euclidian distance between pairs of feature centers. For all other cues, we used a simple subtraction operator.
The values thus computed were input to a single layer perceptron that classified each face as male or female. The diagnosticity of different cues and features for gender recognition was evaluated by a “leave-one-out” cross-validation method whereby the perceptron was trained on all stimuli but one and tested on the remaining one. This procedure was repeated for all 200 stimuli.
For automatic face recognition, the classifier was trained to recognize objective facial gender. For human face recognition, the target responses were provided by experimental data from the study of Tarr et al in which human observers were asked to identify the gender of a series of faces from degraded images.
Results
Facial feature segmentation
After pairing symmetrical regions for each manually segmented image, for example, grouping the two cheeks into a single feature, the average number of features for the two sets of faces was 8.05 (σ = 0.57) and 8.56 (σ = 1.60), respectively. Consequently, both versions of the segmentation algorithm were then trained to decompose the image into 8 distinct features.
A baseline for the automatic segmentations was provided by proximity-based segmentation, that is, the outcome of the method when pixels are clustered only based on their position within a face. It should be noted however this is not the equivalent of random image segmentation as proximity the way it was measured is sensitive both to the overall shape of the face and to symmetry, factors we expect to constrain feature segmentation in humans as well.
Figure 2. Human and automatic segmentations of four MPI faces: manual segmentations (first row), algorithm segmentations (middle row), and contours of automatically extracted features superimposed on the stimuli (lower row).
Figure 3. Consistency of automatic segmentations with the human data and inter-consistency of human segmentations with each other. From left to right: proximity-based segmentations, bottom-up eight-feature segmentations, final top-down segmentations, and human data. Error bars represent a single standard error.
Automatic multiple-cue segmentations are visibly superior to proximity-based segmentations and closer to manual ones. Still, the self-consistency of the human data was higher than their consistency with our best set of segmentations, those combining bottom-up and top-down information. One consistent departure from manual segmentations which is partly responsible for the difference is the fact that the lower part of the nose containing the tip of the nose, the nostrils, and the area above the upper lip were grouped together while the rest of the nose included some of the upper cheeks. Automatic gender recognition
The results obtained using color information are depicted in Figure 4. Table 1 displays the results obtained for all cues.
Figure 4. Accuracy of automatic gender categorization with different color cues.
The first set of analyses concerns the relationship between global, featural, and configural information. All accuracy levels were significantly above chance as indicated by χ2 tests (p < 0.01) with one exception, the global use of yellow–blue information (p > 0.25). Color cues gave significantly better performance in the featural condition by the same test. The cue for which configuration provided an advantage was texture (χ12 = 4.12, p < 0.05). Pooling together all three types of cue usage increased accuracy as compared to the featural condition for the three color cue combination. The effect was, however, significant for texture (χ12 = 5.39, p < 0.05) and for the all-cue combination (χ12 = 12.7, p < 0.01).
Next, we considered the relationship between different cues. Not surprisingly, comparison across cues showed a benefit of combining color cues over using any one of them independently. The highest level of accuracy, 94%, was obtained by combining all types of information and was significantly superior to all other performance levels.
To establish the diagnosticity of each feature, we measured categorization accuracy by considering only a single feature at a time.
The categorization results above draw on global and local differences between male and female faces. To examine the nature of these differences, we compared color properties across genders. First, for global properties, we found that male faces were darker (t198 = 5.97, p < 0.01) and redder (t198 = 6.59, p < 0.01) than female faces—both results consistent with earlier findings
Two comparisons that deserve special attention are the brightness contrasts for the eyes and the mouth. contrast, we found that in males compared to females the contrast was larger for the eyes (t198 = 2.77, p < 0.01) and smaller for the mouth but not significantly so (t198 = 0.26, p > 0.76).
Human gender recognition
A first set of analyses regressed average human responses to the facial properties of the MPI stimuli. The results presented in indicate the proportion of variance explained by different cues and features using multiple linear regressions. An examination of the values seems to show that featural information does a better job at explaining human performance than global information. As far as individual features are concerned, the cheeks, the eyes, and the chin are the most diagnostic with respect to human judgments of gender.
Table 3. Proportion of variance and accuracy of predicting human performance obtained with various types of cues in gender categorization.
Table 4. Proportion of variance and accuracy of predicting human performance based on the color properties of eight features in gender categorization.
Such conclusions need to be qualified by the remark that we performed our regression in a high-dimensional space. The high values we obtained and the differences between them might reflect a dimensionality advantage rather than diagnosticity Accuracy levels obtained confirm the reliance of the responses on color properties All of them were significantly above chance by χ2 tests. Geometrical information, on the other hand, produced a smaller accuracy level and did not provide an advantage over color information. The performance of luminance particularly stands out. Featural information was superior to global information for the yellow–blue channel and when pooling together the three color cues. their performance was clearly superior to the next best feature, the eyes (χ12 = 12.58, p < 0.01)
Figure 5. Accuracy of predicting human gender categorization with different color cues.
Figure 6. Accuracy of automatic gender categorization and prediction of human responses based on color information from different features. From left to right: forehead, eyes, ears, upper nose, cheeks, lower nose, mouth, and chin.
The results presented above support a role for color (L*a*b*) in gender recognition for human faces. The accuracy of gender categorization based on region properties is comparable to that of other methods applied to the task .While geometrical and textural information make a significant contribution, color was shown to support a relatively high level of performance by itself. It is important to note, however, the separation of pigmentation and shape cues was not perfect. Luminance includes both albedo and shape-from-shading information.. Therefore, performance obtained with color cannot be ascribed entirely to pigmentation or assumed to exploit pigmentation information exhaustively. The results point however to the diagnosticity of color information beyond luminance and show the advantage of considering featural information.
The current study examines the role of stimulus structure, and spectral structure in particular, for face processing across two different types of tasks. Specifically, we present evidence for the role of surface cues in face segmentation and recognition.
To this goal, we developed a method for automatic feature segmentation grounded in human performance.
Finally, further research should explore how well our current results generalize to other types of stimuli, for example, faces of different ethnicities or faces displaying more variability due to lighting or expression
REFERENCES
1. Jameson, Fredric(September–October 2003). "Fear and Loathing in Globalization". New Left Review http://www.newleftreview.org/?page=article&view=2472
2. Gibson 2003, p341.
3. "A noir vision of the future", Financial Times
Very interesting to read this article. I would like to thank you for the efforts you had made for writing this awesome article.
ReplyDelete* happy room play