AI and Risks of Hiring Bias Due to Gender Imbalances in Historical Data ()
1. Introduction
Bias persists whenever it is ignored (Korteling et al., 2021; Payne & Hannay, 2021). Automating biased organizational processes makes the dynamics of bias identification and mitigation more challenging rather than easier as it further obfuscates reality (Hagendorff et al., 2023; Rastogi et al., 2022; Waters & Honenberger, 2025). This is problematic as Artificial Intelligence (AI) is expanding and integrating into work (Agrahari, 2024; Daugherty & Wilson, 2024; Paul et al., 2022). One area in which AI is being used more frequently is in organizational hiring (Agarwal, 2023; Jatobá et al., 2023; Panda et al., 2023). Companies utilize AI in hiring to make that process more efficient (Abasaheb & Subashini, 2024; Hashemi, 2024; Madanchian & Taherdoost, 2025). Although AI brings the promise of greater efficiency in the hiring process, it carries with it potential inherent risks that will shape and constrain the organization, the workforce, and society (Chander & Kumar, 2024; Howard & Schulte, 2024; Jackson, 2023; Patil, 2024). Within organizations, work, and by extension its business practices, will either be revolutionary or reactionary, and the adoption of AI resides at the crux of this dilemma (Jackson & Heath, 2025a). As is generally accepted and incontrovertible, AI is trained with existing data. If the existing data AI is being trained on are biased, those biases can ripple through the organizational processes to which AI is being applied. In the context of the focus of this study, this means that biased AI-training data would potentially result in biased hiring practices.
The focus of this study was on an examination of the potential biases resident within publicly available data that are generally reflective of the type of historical data that could be reasonably expected to be used to train AI systems and to identify potential impacts of such bias on organizational hiring processes. Although the world is shifting toward increased adoption of AI-driven tools in a variety of areas including healthcare (Bouderhem, 2024; Li, et al., 2022; Saraswat et al., 2022), education (Holmes & Tuomi, 2022; Jackson, 2023, 2025; Selwyn, 2022), and finance (Bahoo et al., 2024; Cao, 2022; Weber et al., 2023), it is important to recognize that AI, like the data used to produce and train it, is a product created by an imperfect humanity (Paleri, 2022; Taylor et al., 2022), and is susceptible to a host of cybersecurity concerns (Aslam, 2024; Heath et al., 2025; Sontan & Samuel, 2024). Consequently, whereas it is useful, AI is not perfect and has room for improvement.
The data for this study were collected from Google Books Ngram Viewer (Google, n.d.). Google Books Ngram Viewer has been used as a source of data for numerous studies including social evolution (Clark et al., 2022; Solovyev, 2024; Tyrkkö & Mäkinen, 2022), linguistics (Richter & Böhm, 2024; Richter et al., 2025; Solovyev & Ivleva, 2024), and topical analysis (Kestler, 2025; Olimid et al., 2023). In general, when one enters a given word or phrase into Google Books Ngram Viewer, the system displays a graph showing how the particular words or phrases have occurred within the corpus of books over the selected period. Given the purpose of this study, the data utilized from Google Books Ngram Viewer focused on the prevalence of words and phrases that have gender associations. The specific words and phrases used are presented in the method (Section 3). The hypotheses tested in this study (H1-H3) were:
H0: The average percentage of occurrence of aggregated masculine pronouns from 1922 to 2022 in Google Books Ngram Viewer < The average percentage of occurrence of aggregated feminine pronouns from the same period and source.
H1: The average percentage of occurrence of aggregated masculine pronouns from 1922 to 2022 in Google Books Ngram Viewer > The average percentage of occurrence of aggregated feminine pronouns from the same period and source.
H0: The observed attribution of masculine traits to men and women respectively is equal to the expected attribution based on the underlying gender proportions resident in the analyzed corpus.
H2: The observed attribution of masculine traits to men and women respectively is not equal to the expected attribution based on the underlying gender proportions resident in the analyzed corpus.
H0: The observed attribution of feminine traits to men and women respectively is equal to the expected attribution based on the underlying gender proportions resident in the analyzed corpus.
H3: The observed attribution of feminine traits to men and women respectively is not equal to the expected attribution based on the underlying gender proportions resident in the analyzed corpus.
H1 was tested using a one-tailed, two-sample t-test assuming unequal variance, whereas H2 and H3 were tested using a chi-square test. Key aspects of the analysis are developed more fully in the overview of the research method (Section 3). The results are presented in Section 4. The results section contains the following subsections: an assessment of both the prevalence of masculine and feminine pronouns (Section 4.1), and the prevalence of gender to stereotypical masculine and feminine traits (Section 4.2), along with an analysis of the frequency of common gendered phrases (Section 4.3). Based on the results, a discussion of implications and limitations was developed (Section 5), followed by the conclusion (Section 6). Background literature associated with this research topic is presented next (Section 2).
2. Background
Biases have long existed in a variety of forms and with a multiplicity of foci (Boas & Jones, 1964; Carmichael, 2020; Levinson, 1995). Mukherjee (2015) indicated that historically, “women have always struggled to gain equality, respect and the same rights as men”, and that this situation has been especially challenging to resolve due to patriarchy, which was defined as “an ideology in which men are superior to women and have the right to rule women” (p. 76). Adoption of a patriarchal ideology can result in biased employment decisions (Mukherjee & Sarkhel, 2025; Nayak & Tabassum, 2024; Njoto et al., 2025). More generally, Yanamala (2022) indicated that in recruitment, bias “can manifest in various forms, such as gender, ethnicity, age, and socioeconomic status, potentially leading to discriminatory practices that disadvantage certain groups of candidates” (p. 51). Common gender biases in hiring included implicit bias and in-group bias (Friedmann & Efrat-Treister, 2023; Hassan, 2019; Trifilo, 2025). Hassan indicated that one unconsciously allows biased “associations to influence our perceptions and evaluations”, and that one tends “to confirm the stereotypes by being on the lookout for information which is consistent with the stereotypes” (p. 15). This can be considered a manifestation of confirmation bias (Kappes et al., 2020; Mercier, 2022; Peters, 2022). Further, Hassan found that under implicit bias, there is a tendency for people “to give preference to their own kind as a way of sustaining a positive social identity” (p. 15). Given that such biases impact people differently, it is not surprising that workers perceptions of power in organizations would vary (Cini, 2023; Jackson et al., 2022; Livingstone, 2021). This situation becomes more complex as employment decisions are informed and executed through applications of AI.
The application of AI in hiring decisions is ambiguous. Arduini & Beck (2025) indicated that AI in recruitment is “viewed as a promising step toward mitigating implicit biases and enhancing transparency” (p. 247). Other reported benefits of the adoption of AI in HR include increased efficiency, capacity, and objectivity (Awad et al., 2023; Emami, 2023). Such views focus on the (latent) potential of AI to correct for longstanding, human biases and produce an objective, repeatable decision logic with improved results, in terms of equity. Even with the reported positive potential of AI in the recruitment process, Arduini and Beck suggest that “the success of AI in enhancing gender diversity hinges on the quality of the training data. If the data harbors gender biases, AI might inadvertently perpetuate these biases instead of alleviating them” (Arduini & Beck, 2025: p. 250). AI training data reside at the crux of AI’s potential to correct for human bias. Research by Manasi et al. (2022) found that data bias can occur due to “subjective choices made when selecting, collecting and preparing data” (p. 296). Further, as Nadeem et al. (2020) concluded “the most discussed contributing factor of gender bias includes lack of diversity in data and developers, the bias in society, and bias in data due to programmer conscious or unconscious bias (p. 9). The confluence of data and development homogeneity is amplified by a finding reported in an UNESCO report (UNESCO, 2024) in which it was stated that, “evidence shows, there is a significant white male predominance in the AI and machine learning field perpetuating gender bias” (p. 12). This is especially problematic because systems “often operate in ways that are not fully transparent or understandable to their human users, making it difficult to identify and correct biases. This lack of transparency not only raises ethical concerns but also poses significant legal risks, particularly in jurisdictions with strict anti-discrimination laws” (Yanamala, 2023: p. 52). Such obfuscation could be another manifestation of how corporate leaders and managers mislead workers (Boddy, 2021; Jackson & Heath, 2025b; Ng & vanDuinkerken, 2021). Whereas the concern for bias in organizational hiring is pronounced, persistent, and consequential, there are means available for its redress. One proposed way of addressing bias is by examining the creation of the AI tools themselves and the data used to train them.
Linkages between AI training data and gender bias is more than theoretical as there is already substantial evidence suggesting the transfer of social biases into AI model behavior (Liang et al., 2021; Liao & Naghizadeh, 2023; Wojciechowski & Korjonen-Kuusipuro, 2025). The degree to which gender bias has already been embedded in Large Language Models (LLMs) has been established by Caliskan et al. (2022), Bolukbasi et al. (2016), and Bender et al. (2021). Specifically, as Bolukbasi et al. explained, “by reducing the bias in today’s computer systems (or at least not amplifying the bias), which is increasingly reliant on word embeddings, in a small way debiased word embeddings can hopefully contribute to reducing gender bias in society” (Bolukbasi, 2016: p. 15). Given the proxy nature of the Google Books Ngram Viewer data used in this research study, such findings help bound these results as risk indicators for subsequent AI systems development and applications in human resources.
Further, proposed actions for addressing potential bias have included employing bias detection tools to identify and correct biases present in the data (Arduini & Beck, 2025; Bhavaneetharan, 2025; Donald et al., 2023). Additional recommendations have included addressing governance issues, which are the key to ensuring trustworthiness and fairness in the hiring process (Emami, 2023; Lahusen et al., 2024; Raza et al., 2023); ensuring transparency when AI is used in the hiring process (Emami, 2023; Musrifah & Hasanah, 2025; Yanamala, 2023); having discussions around accountability and ethical responsibility in AI technology (De Cremer & Kasparov, 2022; O’Connor & Liu, 2024; Orr & Davis, 2020); and continuing society-wide efforts to minimizing systemic gender bias (Armutat et al., 2024; Misa, 2022; Nadeem et al., 2020). Sun et al. (2019) summarized the concern of removing bias from AI well when they noted that “to completely debias effectively, it is important to understand how machine learning methods encode biases and how humans perceive biases” (p. 8). These concerns are beneficially contextualized by viewing them through a feminist lens.
Whereas both AI and hiring bias are real, material things, one’s understanding of them and ability to address concerns at the confluence of the two benefits from a grounding in theory. Drage & Mackereth (2022) stated that “feminist and anti-racist theory can help inform recruitment AI design and deployment through its persuasive critique of normative universal modes of representation, taxonomization, and classification” (p. 5). This concern is not simply with the outcome of organizational hiring decisions, but also about the employment engaged in the development, training, and deployment of AI systems. As reported in an UNESCO (2024) report, “women must be an active part of developing the digital economy to eliminate gender biases and stereotypes being reproduced through digital platforms, software and programs generated by AI. One place to start with this effort is by increasing the role of women engaging with and working in the field of AI” (p. 12). In that UNESCO report, it was further explained that to understand if and where progress is being made for female participation in the field of AI, “there is a need for a more focused and standardized measure of gender equality in educational attainment and training in AI” (p. 15). These issues have a long-established history as feminist concerns. However, it is important to keep the boundaries of these concerns flexible enough to account for emergent understandings of systemic oppression, subjugation, and bias. As indicated by O’Connor & Liu (2024), “with a growing debate on the concept of gender itself and the proliferation of research into variations in gender identity, perhaps consideration should be applied to the binary concept of gender often applied in related research, which can again hide or ignore the experience of those who do not conform to traditional gendered expectations or labels” (p. 2055). These concerns are important to consider, and the data used to develop and train AI systems used in organizational hiring are susceptible to each. The focus here on gender differences in a narrower sense is not due to a lack of appreciation for these broader concerns, but simply the result of the efficiency of examining traditional gender differences and biases as a foundation for subsequent research. Prior to discussing the method used for this research (Section 3), it is useful to provide a brief summary of the relevant research established in this background survey of literature.
As previously indicated, humans are susceptible to a variety of cognitive and socio-cultural biases. Women, minorities, and those operating outside the favored strata or clique can face barriers to employment and advancement. AI holds the potential to improve the situation, but only if those engaged in its development and application are focused on correcting for the biases resident in the historical data used to develop and train AI systems and tools. Automation without such modifications will likely simply replicate human bias more efficiently. As indicated, correcting for biased data in AI hiring tools requires first the identification, after which governance issues, transparency concerns need to be addressed. Above, AI systems require ethical applications and human accountability for their use and employment. Feminist theory can be applied beneficially to understand the risks of hiring bias due to gender imbalanced in the historical data.
3. Method
AI is trained with existing data. If those data are biased, it can cause prejudice to ripple through hiring processes within businesses and organizations (Albaroudi et al., 2024; Chen, 2023; Cruz, 2024; Amin Metwally Hussien et al., 2024). This study was a standard quantitative analysis approach using the parametric t-test and nonparametric chi-square test for hypothesis testing (Black, 2020). The focus of this study was to examine potential biases within the publicly available data that could be used to train AI and to identify latent impacts on organizational hiring processes. The data for this study were obtained from Google Books Ngram Viewer, for the years 1922 through 2022 inclusive, and were collected from June through July 2025. The data available through Google Books Ngram Viewer is the percentage of how often a given word or phrase has occurred in their corpus of books over the selected period (Google, n.d.).
Google Books Ngram Viewer was used to obtain the data used for this study. The date range setting was changed from its default values to start at 1922, and all remaining settings were left unchanged. That meant that the data search was examined works in English, were case-insensitive, and was calibrated to a smoothing value of 3. A smoothing value of 3 creates a 7-year rolling average, comprised of 3 years before the target year, the target year, and 3 years after the target year. The approach can be explained formulaically as 2n + 1, with n being the selected smoothing value (Google, n.d.). Smoothing is used to reduce noise and to help identify underlying trends. There was no compelling reason to deviate from the default settings for this analysis. These selection parameters were consistent for all three analyses conducted in this study. For the prevalence of masculine & feminine pronouns, he, him, his, she, her, hers were entered into the search bar. This will be developed more fully subsequently.
For the prevalence of gender to stereotypical masculine and feminine traits, the initial pool of traits for both men and women were entered. The initial pool of traits for men were: active, adventurous, ambitious, analytical, assertive, brave, competitive, confident, dominant, energetic, independent, industrious, insensitive, intelligent, rational, reasonable, responsible, stable, strong, and wise. The initial pool of traits for women were: attractive, compassion, confused, content, cooperative, dependent, emotional, expressive, fickle, foolish, inhibited, intuitive, moody, nervous, passive, receptive, sensitive, snobbish, submissive, support, temperamental, timid, unambitious, unintelligent, unstable, warmth, and weak. Traits considered relevant to the study were those traits a recruiter would reasonably look for in a candidate and that would likely influence their hiring decision. Traits unlikely to influence a hiring outcome were not considered relevant and were excluded (e.g., active for men, attractive for women). This is the same case for phrases. After determining the top ten traits for each gender the following phrases was entered in Google Books Ngram Viewer: men are strong, women are strong, men are independent, women are independent, men are responsible, women are responsible, men are reasonable, women are reasonable, men are stable, women are stable, men are wise, women are wise, men are dominant, women are dominant, men are intelligent, women are intelligent, men are competitive, women are competitive, men are rational, women are rational, men are content, women are content, men are dependent, women are dependent, men are weak, women are weak, men are emotional, women are emotional, men are nervous, women are nervous, men are sensitive, women are sensitive, men are confused, women are confused, men are cooperative, women are cooperative, men are passive, women are passive, men are foolish, women are foolish.
The first focus area of this study (Section 4.1) compared the prevalence of gender-coded pronouns within the dataset. More specifically, data for the prevalence of masculine pronouns (i.e., he, him, his) and feminine pronouns (i.e., she, her, hers) were collected for the period 1922 through 2022. The three masculine and three feminine pronouns were aggregated to single masculine and feminine values for analysis. Doing so provided a more normalized unit of analysis that would mute any idiosyncratic anomalies within the data (Alarcon Falconi et al., 2020; Masselot et al., 2018). The analysis of these data started with an examination of descriptive statistics. The minimum, maximum, mean, and standard deviation values for the aggregated masculine and feminine pronouns were assessed and the aggregated values were compared through the use of a data visualization (Knaflic, 2015). After the preliminary, descriptive analysis was conducted, H1 was tested using the parametric t-test (Black, 2020). As previously indicated (Section 1), the null hypothesis for H1 is that the average percentage of occurrence of aggregated masculine pronouns from 1922 to 2022 in Google Books Ngram Viewer < the average percentage of occurrence of aggregated feminine pronouns from the same period and source. If the null hypothesis is rejected one will have grounds to infer that the corpus contains fewer references to females, or that it is biased toward men.
The second focus area of this study (Section 4.2) examine the prevalence of gender to stereotypical masculine and feminine traits. The selected traits (both masculine and feminine) were determined by traits commonly associated with men and with women as reported by research conducted by Langford & Mackinnon (2000) and Taylor (2003). To determine the traits used in this analysis, frequency percentages for 1922 through 2022 were added together for each potential trait, and the ten traits with the highest sums for each gender were kept for the comparison conducted in this study. The ten stereotypical, male-gendered terms were: a) strong; b) independent; c) responsible; d) reasonable; e) stable; f) wise; g) dominant; h) intelligent; i) competitive; and j) rational. The ten stereotypical; female-gendered terms were: a) content; b) dependent; c) weak; d) emotional; e) nervous; f) sensitive; g) confused; h) cooperative; i) passive; j) foolish. A determination was made for each year 1922 through 2022 if the respective term (e.g., strong, foolish, etc.) was more prevalent in attribution for men or women. The search was conducted using the phrases “men are…” and “women are…” followed by the specific trait of interest (e.g., men are strong, women are strong). The gender that was found to occur more frequently, for a given trait in a given year, was assigned a value of 1, the alternative gender assigned a value of 0 for that trait and year. These values were then aggregated and reported graphically.
The aggregate masculine and feminine pronouns determined in the first section of analysis was used as the baseline to determine the expected values for trait allocation. A chi-square test was used to test the statistical significance between the expected and observed values (Black, 2020). As previously indicated, the null hypothesis for H2 was that the observed attribution of masculine traits to men and women respectively is equal to the expected attributions based on the underlying gender proportions resident in the analyzed corpus. The null hypothesis for H3 was that the observed attribution of feminine traits to men and women respectively is equal to the expected attribution based on the underlying gender proportions resident in the analyzed corpus. With 101 years of data (i.e., 1922 through 2022), and 10 traits (for each gender), the number of observations for each chi-square test is 1010, and with two categories (male and female) k = 2, resulting in 1 degree of freedom (df) for the test. The results of these tests will suggest if the data exhibit any trait bias beyond the potential bias associated with initial gender prevalence within the dataset.
The final analysis for the second section, focus on a sentiment analysis of the masculine and feminine traits. The sentiment analysis was conducted using the AFINN sentiment lexicon (Nielsen, 2011; Silge & Robinson, 2016), and each trait terms was assessed in terms of its assigned semantic value. The AFINN sentiment score ranges from −5 (extremely negative) to +5 (extremely positive), with 0 indicating a neutral sentiment. An average sentiment value was calculated for the masculine and feminine traits, and then an average sentiment score was calculated based on observed frequences and average sentiment values for men and women.
The last section of analysis (Section 4.3) was focused on an analysis of the frequency of common gendered phrases. The common gendered phrases used in this study were based on the research of Drew (2023) and Warner (2024). These phrases were used by inserting women and men sequentially in the phrases. As an example, the phrase “women are caregivers” was entered into Ngram and then the phrase “men are caregivers” was entered. Ratios of women to men and men to women were calculated based on average values for each phrase. Only those phrases that were determined to be relevant for organizational hiring were included in the study. If a phrase failed to generate results, an attempt was made to reword it slightly to generate results. If the slight modification of the phrase still did not produce results, that phrase was excluded from the study. The ratio frequency values for these phrases were presented in tabular form.
The goal of this research was to demonstrate the potential gender-bias in the hiring processes of businesses and organizations when using AI hiring tools based on the prevalence of gender-coded words and phrases in the publicly available data that will most likely be used train it. The methodology selected for this study seeks to answer: a) How prevalent are male-coded words and phrases compared to female-coded words and phrases in potential AI training data? b) Based on the results of the data, does data training AI hiring tools carry potential gender bias? c) Can training data impact AI’s likelihood to exhibit gender bias during organizational hiring processes? And d) How can gender-bias in AI hiring tools impact hiring outcomes? Given that AI informs decisions based on the information it receives, it is expected that if male coded words and phrases are more prevalent in the training data used to develop AI hiring tools, then those tools will be more likely to exhibit gender bias in candidate evaluation. Such an outcome would unduly favor male applicants or traits stereotypically associated with men. With the methodology of the study established, it is possible to turn attention to the results (Section 4).
4. Results
This results section contains an assessment of the prevalence of masculine and feminine pronouns (Section 4.1), prevalence of gender to stereotypical masculine and feminine traits (Section 4.2), and an analysis of the frequency of common gendered phrases (Section 4.3). The results of H1 are in Section 4.1, whereas the results of H2 and H3 are in Section 4.2. The results of the prevalence of masculine and feminine pronouns, along with those of H1, are presented in the following section.
4.1. Prevalence of Masculine & Feminine Pronouns
As indicated earlier, Google Books Ngram Viewer reports the occurrences of selected terms over a selected period (Google, n.d.). This study compared the prevalence of masculine pronouns (i.e., he, him, his) with feminine pronouns (i.e., she, her, hers) for the period 1922 through 2022. To reiterate this important point, a given reported value represents a percentage of the total word count, not the absolute number of occurrences, such that a value of 0.05 for a given word means that word constitutes approximately 0.05% of all words in that specific year’s English corpus. The results of this study are presented graphically as Figure 1.
Figure 1. Prevalence of masculine and feminine pronouns from 1922 to 2022.
As indicated in Figure 1, the aggregation of the three masculine pronouns analyzed are more prevalent than the aggregation of the three feminine pronouns analyzed for every year from 1922 to 2022. The aggregate masculine pronouns had a minimum occurrence of 0.295 and a maximum occurrence of 1.116 (M = 0.480, SD = 0.223). The aggregate feminine pronouns had a minimum occurrence of 0.068 and a maximum occurrence of 0.741 (M = 0.179, SD = 0.189). In terms of the average occurrence, masculine pronouns are approximately 2.7 times more likely to occur than feminine pronouns in the Google Books Ngram Viewer corpus. This relative value will be used subsequently (Section 4.2) to baseline the expected values for the chi-square test examining the prevalence of gender to stereotypical traits.
The null hypothesis for H1 tested to see if the average percentage of occurrence of aggregate masculine pronouns from 1922 to 2022 in Google Books Ngram Viewer (M = 0.480, SD = 0.223) was less than or equal to the average percentage of occurrence of aggregate feminine pronouns from the same period and source (M = 0.179, SD = 0.189). As indicated in the methodology (Section 3), a one-tail, t-Test assuming unequal variances was used. The null hypothesis was rejected (t(195) = 10.32, p < 0.001), indicating that the average percentage of occurrence of masculine pronouns was significantly greater than that of feminine pronouns.
Based on this finding one can conclude that data obtained from Google Books to train AI would be significantly biased toward male representation unless normalized as part of the algorithm training. Whereas this finding is, in itself, significant, the consequence of this finding is more pronounced when the prevalence of gender to stereotypical traits is considered.
4.2. Prevalence of Gender to Stereotypical Masculine & Feminine Traits
As previously indicated (Section 4.1), masculine pronouns were found to be approximately 2.7 times more likely to occur than feminine pronouns in the Google Books Ngram Viewer corpus for the years 1922 through 2022. This finding baselined the expectation for the gender prevalence of men and women with gendered phrases. This analysis started with searching for traits commonly associated with men and with women and listing them out (Langford & Mackinnon, 2000; Taylor, 2003). The frequency percentages for 1922 and 2022 were added together for each trait; the ten traits with the highest sums for each gender were kept for the comparison.
As this project examined the hiring process, traits that fell into the top ten but were considered not to be an applicable trait to describe a worker and therefore not meaningful to the comparison were removed; the trait with the next highest sum was then brought into the study. For example, the term attractive was the sixth highest feminine trait, but was removed as it was not considered an applicable trait to describe a worker. Notably, masculine traits tend to be more positively framed, while feminine traits often carry more negative connotations. This point will be elaborated more fully when the results of AFINN sentiment are reported.
For this focus area, as indicated in the methodology (Section 3), ten stereotypically male-gendered terms and ten stereotypically female-gendered terms were used to assess gender prevalence to stereotype. The ten stereotypical, male-gendered terms were: strong, independent, responsible, reasonable, stable, wise, dominant, intelligent, competitive, and rational. The ten stereotypical, female-gendered terms were: content, dependent, weak, emotional, nervous, sensitive, confused, cooperative, passive, and foolish. For each year, 1922 through 2022, a determination was made if the respective term was more prevalent in attribution for men or women.
To compare the prevalence of men being associated to masculine traits and feminine traits versus women, the terms “men are…” and “women are…” followed by the trait were searched into Google Books Ngram Viewer (e.g., men are strong, women are strong). Although the Ngram data were drawn from literature, they serve as a reasonable proxy of language patterns that may shape the data ultimately fed into AI systems, particularly open-source models. Whichever gender was found to dominate the occurrences for a given trait in a given year was given a value of 1, the alternative gender received a value of 0 for that trait and year. With a data set containing 101 years and 10 elements for each gender, there were 1010 total opportunities for the masculine traits and 1010 total opportunities for the feminine traits. The results of this analysis are presented in Figure 2.
Figure 2. Prevalence of gender to stereotypical masculine and feminine traits.
As indicated in Figure 2, 75% of the observed occurrences of masculine traits were attributed to men (n = 760), and 25% of the observed occurrences of masculine traits were attributed to women (n = 250). Alternatively, 56% of the observed occurrences of feminine traits were attributed to men (n = 563), and 44% of the observed feminine traits were attributed to women (n = 447). These observations serve as a useful starting point for analysis. In short, across all terms and traits, men have a higher prevalence of attribution occurrence (65%), whereas women have an attribution of occurrence 35% of the time. These results were potentially skewed due to the underling dominance of male pronouns to female pronouns, which was found to be approximately 2.7 times more likely within the analyzed dataset. A chi-square test was used (H2 & H3) to compare observed values to expected values based on this imbalance.
H2 tested for a significant difference between the expected and observed attribution of masculine traits to men and women respectively. Based on 1010 potential occurrences, and an underlying 2.7 male to female ratio within the dataset, the expected value for men was 737 and 273 for women. The observed values of 760 for men and 250 for women was not statistically significant χ2 (1, N = 1010) = 2.66, p = 0.10. As a consequence, whereas there is a substantial difference in the attribution of stereotypical male traits to men as opposed to women, the occurrence of attribution is not statistically different than one would generally expect given the disproportion of male to female pronouns resident in the data.
The results of H3 differ from those of H2. As indicated previously, H3 tested for a significant difference between the expected and observed attribution of feminine traits to men and women respectively (similar in construction to H2). Again, based on 1010 potential occurrences, and an underlying 2.7 male to female ratio, the expected values for H3 are the same as those of H2, men (n = 737) and women (n = 273). In this case, the observed value of men (n = 563) and women (n = 447) were found to be statistically different from expected values χ2 (1, N = 1010) = 151.97, p < 0.001. With the rejection of the null hypothesis, it can be concluded that there is a statically significant difference between the observed and expected counts of stereotypical feminine traits. More specifically, stereotypically feminine traits were attributed to women more frequently than would be expected given the underlying distribution of pronouns within the corpus. These findings are further contextualized when one considers the sentiment associated with the selected traits.
Of the ten masculine and feminine traits, only five terms from each category had corresponding AFINN sentiment scores. Traits without AFINN sentiment scores were excluded from the final comparison between masculine and feminine traits. The terms (sentiment score) are provided here. For the masculine terms, there were AFINN sentiment scores for: strong (2), responsible (2), stable (2), intelligent (2), and competitive (2). For the feminine terms, there were AFINN sentiment scores for: weak (−2), nervous (−2), confused (−2), passive (−1), and foolish (−2). There is almost perfect symmetry here between the positive sentiment for masculine traits and negative sentiment for feminine traits, with the average AFINN sentiment score being 2 for the masculine traits, and −1.8 for the feminine traits.
Taken together, these results suggest that an AI algorithm trained on this unadjusted data, would assume 2.7 males for every female is “normal” (i.e., consistent with the data), and that those men would have an average sentiment of 0.4 based on the calculation ((760*2 + 563* − 1.8)/1323), where women would have an average sentiment of −0.4 based on the calculation ((250*2 + 447* − 1.8)/697). These results suggest the potential for AI algorithms trained on historical data of this nature to replicate both an unequal dominance in the number of men to women, and also an inflated attribution of positive sentiment of men over women. These findings are further substantiated through an analysis of relative frequencies of common gendered phrases. Those results are presented in the following section (Section 4.3).
4.3. Analysis of the Frequency of Common Gendered Phrases
The results reviewed so far suggest that men are more associated with both sets of traits than women due to men being more prevalent than women in the dataset, but that women are observed at rate that is more than expected statistically for the traits that are stereotypically associated as feminine. This is consequential as the feminine traits are more negative than stereotypical masculine traits. The AFINN sentiment was based on the occurrence of singular word traits (e.g., stable, intelligent, weak, nervous, etc.). It is possible to extend this analysis by examining common gendered phrases rather than single words.
As previously noted, common gendered phrases were found through online research (Drew, 2023; Warner, 2024) and compared when women or men were used in the phrase. For example, the phrase “women are caregivers” was entered into Ngram along with “men are caregivers”. Consistent with other analyses conducted in this study, data from 1922 through 2022 were analyzed for each version of a phrase was assessed with an average value reported (see Table 1). The left column, “women to men” is the average when “women” is used in the respective phrase divided by the average when “men” is used in the respective phrase; vice versa for the “men to women column”. As this project examined the potential for historical data to impact AI during the hiring process, phrases that were considered not to be meaningful in examining that scenario were excluded (e.g. “boys don’t read books”). Phrases that did not give results, were reworded slightly in an attempt to generate results (e.g. “men provide for their family”; was rephrased to “men are providers”, which yielded results). If data did not show up for a phrase after rewording it, the phrase was not used. There was also interest in whether the phrase “women are good workers” and “men are good workers” produced any data for comparison as this phrase was considered meaningful; data were available and was provided at the bottom of the table. A summary of the results associated with the eight common gendered phrases analyzed in this study are presented in Table 1.
Table 1. Frequency of common gendered phrases.
Phrase |
Frequency |
Women to Men |
Men to Women |
…are caregivers |
3.201 |
0.312 |
…are passive |
5.889 |
0.170 |
…are too emotional |
7.651 |
0.131 |
…are weak |
1.105 |
0.905 |
…are natural leaders |
0.211 |
4.732 |
…are providers |
0.401 |
2.492 |
…solve problems |
0.366 |
2.731 |
…are good workers |
0.389 |
2.572 |
These results (Table 1) suggest that men are associated more with phrases about being leaders, providers, problem solvers, and good workers, whereas women are more frequently associated with phrases about being caregivers, passive, too emotional, and weak. To amplify a few of the results, women are over 7 times more likely than men to be viewed as too emotional and over 5 times more likely to be viewed as passive. Conversely, men are over 4 times more likely than women to be viewed as natural leaders and nearly 3 times more likely to be viewed as problem solvers. The results and analysis of these common gendered phrases are consistent with the previous results, which suggest the potential for significant gender bias in the historical data which could be used to train AI algorithms.
Based on the findings from the publicly available data and literature, men are more prevalent in historical data and historical data refers to men more positively. Women are less prevalent and more likely to be referred to more negatively. The more widespread use of words and phrases that are masculine or are associated with men could lead one to reasonably conclude the historical data training AI contains potential gender bias. These findings suggest that AI tools trained on biased data may reinforce gender inequality in hiring by favoring male candidates and reflecting patriarchal assumptions embedded in historical language.
5. Discussion
There are a few limitations of this dataset and this study. A limitation associated with the data is that they come solely from books and contain only the years 1922 through 2022. Different insights are possible from the inclusion of various print media and more expansive periods. Subsequent studies would benefit from both an expansion of source type and timeframe. Another limitation of this study is that the examination of the gendered traits was based on only two sources (Langford & Mackinnon, 2000; Taylor, 2003). Utilizing other sources may have identified different traits that could have been included in the analysis and potentially produced different results. Additionally, in terms of the gendered traits, only 10 traits were included for each gender (twenty traits in total). Expanding each set to include more traits may have enabled the generation of even more robust results.
Due to constraints of time, specific data values for the gendered traits were not examined, but only which gender’s data value was higher than the other for each year. In other words, more granular gender trait data were reduced to binary values for assessment. Examining the specific values would allow for a comparison of how much one gender is associated with a trait over the other and would have produced more nuanced results. For the gendered phrases (Table 1), phrases came from only two sources (Drew, 2023; Warner, 2024). Utilizing other sources may have identified additional phrases and different wording of phrases which could provide additional insights. Future studies would benefit from addressing these limitations and contributing further to our emerging understanding of the risks associated with basing AI algorithms on unnormalized, male dominant, and masculine trait-biased, data.
6. Conclusion
Based on the findings of this study, this analysis reveals a clear pattern in historical language: men appear more frequently and are associated with more positive traits, while women are referenced less often and in terms of a negative sentiment. If such biases are embedded in the data used to train AI hiring tools, such tools risk reinforcing systemic gender inequality rather than correcting it. While AI holds promise for improving efficiency and fairness in the hiring process, the potential cannot be realized without critically examining the data that shape its decisions. To move toward that potential, efforts must focus on specific actions to address potential bias in AI. Such actions include employing tools to detect and correct bias, strengthen governance, transparency, transparency in terms of AI algorithms and training data, and continued society-wide efforts to minimize systematic gender-bias. As AI integrates into the hiring process and everyday life, it is imperative that we identify its limitations, impacts, and areas for improvement. Without awareness, engagement and enhancement, too much of our flawed past will be algorithmically replicated into our future.