Abstract
Scholars increasingly warn that commercial AI products reproduce narrow, stereotypical gender identities, but far less is known about how those identities are made in practice. This article addresses that gap through the case of AI streamers in China's expanding live-commerce sector, where generative-AI streamers are built to appear hyper-feminine and sell products as stand-ins for human streamers. Drawing on 120 h of behind-the-scenes ethnography in two Chinese AI startups and 48 interviews with engineers, designers, and brand marketers, I show that an AI streamer's gender is produced, not merely reflected, through an optimization loop: a recursive, metric-driven cycle in which developers generate, refine, and scale the variant that meets commercial goals. Pre-launch, teams translate abstract brand ideals into parameters across voice, face, gaze, gesture, and script. Post-launch, continuous A/B tests link these parameters to performance metrics (retention, click-through rate, sales per minute). Exposure is reallocated to higher-performing variants, and the winners are written back into the product as defaults. Across cycles, data do not simply register a pre-given persona. They select and lock in a gendered one, yielding a soft-spoken femininity optimized for sales. This article extends bias-reproduction accounts by unpacking the production pipeline of AI products and showing that user feedback is not a mirror of preexisting bias but a design lever teams use to reverse-engineer persona traits. This reframing shifts accountability from “bad data” to human choices and makes clear that identities in AI are engineered, traceable, and therefore contestable.
This article is a part of special theme on AI Encoding Identities. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/how_is_ai_encoding_and_decoding_racial_gender_and_sexual_identities_and_experiences
Introduction
One evening in early 2024, in a small control room at a Chinese AI startup, I watched the debut of a generative-AI-powered female avatar on a major Chinese e-commerce platform as she promoted a luxury skincare line (see Figure 1). She was not modeled on any single person. Instead, developers assembled her from traits that market data suggested would sell: balanced facial symmetry, a warm vocal timbre, and micro-expressions calibrated to appear inviting. Yet this final product was not the outcome of a single design decision, but of continual adjustments. At launch, she spoke in a calm, authoritative tone, explaining the science behind the ingredients. Within minutes, viewers began clicking away, the chat went silent, and sales flat-lined. Behind the monitors, developers tracked performance dashboards in real time, then softened her voice, inserted gentle compliments, and animated her into a warmer smile. By midnight, watch time had doubled and sales rebounded. “We found her parameters,” one engineer said. What I witnessed captures the dynamic of this study: femininity was treated as an instrument, rendered as measurable, adjustable variables, and re-assembled until it aligned with commercial goals.

Screenshot of an AI streamer in a livestreaming event.
AI products have been widely critiqued for reproducing narrow, feminized defaults, from voice assistants to recommendation systems (Buolamwini and Gebru, 2018; Woods, 2018). Much of this critique follows a bias-reproduction account: biased inputs yield biased outputs. This framework has been valuable for diagnosing gender bias, but it treats the production process as a black box. We know what goes in and what comes out, but not what happens in between. Inside that black box, contemporary AI systems involve complex production pipelines where bias can enter at many points: corpus curation, alignment with human feedback, decoding settings, and online A/B experimentation (Ho et al., 2025; Ouyang et al., 2022). We know a lot about why biased inputs produce biased outputs, but far less about how gender is actively manufactured through this iterative process.
This article opens that black box by examining how gender is produced through what I call the optimization loop: a recursive, metric-driven cycle in which user feedback is used as a design lever to generate, refine, and scale the variant that best meets commercial goals. Drawing on sociological accounts of gender as something “done” rather than possessed (Butler, 1990; West and Zimmerman, 1987), I describe this process as “doing data, doing gender.” Where traditional “doing gender” depends on interpersonal recognition to validate performances, “doing data” shifts validation to commercial metrics that select which performances survive. Developers begin with culturally familiar stereotypes and translate them into quantifiable parameters. Those parameters are then tested against commercial metrics, which select and intensify configurations based on what converts. The resulting avatars do not reflect what real women are, nor pre-existing audience ideals. They reflect what the optimization loop learned would make audiences stay, click, and buy. Audiences participate in this loop not as judges but as generators of behavioral data: their watch time, clicks, and purchases become the feedback that shapes the personas engineered to capture their attention.
Drawing on 120 h of ethnographic observation inside two Chinese AI startups and 48 interviews with developers, designers, and brand marketers, my study opens the black box of AI production to trace how femininity is created and stabilized through two pivotal phases of the optimization loop. In the pre-launch stage, developers translate vague brand ideals such as “spicy” or “sweet” into measurable parameters like voice pitch, facial softness, gaze patterns, and gesture timing. Although presented as neutral technical features, these translations encode culturally normative assumptions about femininity, laying the groundwork for commercial appeal. Once the avatar is launched, each parameter enters a recursive A/B testing process governed by platform-specific metrics such as watch time, click-through rate, and sales per minute. Traits that consistently boost these metrics are retained and often codified into “reference avatars,” while underperforming variants disappear. Across repeated cycles, the “winning” style emerges as a hegemonic femininity (Schippers, 2007): soothing, admiring, deferential. It appears natural only because it reliably converts attention into profit.
By shifting the analytic lens from bias reproduction to commercial-metric production, this article advances scholarship on gendered AI in three ways. First, it reframes AI systems not as static products that mirror culture, but as dynamic production sites where gender is constructed, stabilized, and scaled through iterative optimization. Second, it shows how developers use feedback as a design lever, reverse-engineering “winning” variants and sidelining others under platform metrics and commercial goals. Third, it shows how big-data infrastructures (A/B testing stacks, metrics dashboards, traffic-allocation systems, and write-backs) obscure human choices in the production pipeline, so the narrowing of gendered personae appears as objective, data-driven discovery. Together, these insights move AI-ethics debates beyond bias detection toward the infrastructural and commercial processes through which identities are produced, commodified, and standardized.
Gender in AI: From bias reproduction to productive optimization
Research on gender bias in AI converges on a bias-reproduction account: biased inputs yield biased outputs. When texts link “man” with programmer and “woman” with homemaker, models reproduce those links (Bolukbasi et al., 2016; Caliskan et al., 2017). In vision, when training sets and labeling schemes underrepresent darker-skinned women or collapse gender into a binary, systems make more errors on darker-skinned women and misrecognize or erase non-normative genders (Buolamwini and Gebru, 2018; Keyes, 2018; Scheuerman et al., 2019). Complementing data-centric critiques, research on designer positionality shows that who builds systems, and in what organizational contexts, shapes problem framing, data choices, and evaluation criteria (Cambo and Gergle, 2022; Leavy, 2018; Miceli et al., 2020; Scheuerman and Brubaker, 2024). At deployment, feedback loops in recommendation systems amplify what already draws clicks, shaping what users see (Abdollahpouri et al., 2019; Chaney et al., 2018; Singh and Joachims, 2018). Across these strands, bias is understood as something that enters the system and is reproduced or amplified at the output.
At the interface level, a growing body of work examines how gender is designed into AI products. Popular voice assistants such as Siri, Alexa, and Cortana are given feminized names, voices, and personas, positioning them in service roles that reinforce patriarchal gender norms (Dobrosovestnova et al., 2022; Leavy, 2018; Woods, 2018). Strengers and Kennedy (2021) show how “smart wives” are designed to be helpful, compliant, and emotionally available, embedding gendered assumptions about domestic labor into the interface. These interactions often follow sexualized scripts, with users issuing commands and assistants responding with deference (Costa and Ribas, 2019; Hester, 2017; Schiller and McMahon, 2019). This work reveals how designers pre-script gender into AI products. Yet it focuses primarily on design intentions and interface features, treating gender as something coded once and deployed. What remains underexplored is the iterative process through which gendered personas are tested, refined, and recalibrated after launch.
This article shifts the focus from bias reproduction to gender production. Existing literature diagnoses bias by examining inputs (skewed data, designer assumptions) and outputs (unequal error rates, stereotypical personas). But the production process itself remains a black box: we know what goes in and what comes out, but not how gender is actively manufactured in between. My study opens this black box ethnographically by tracing what I call the optimization loop: the recursive, metric-driven process through which developers produce gendered AI. In this process, audience feedback is not just observed but operationalized as a control signal that shapes design. Developers, designers, and marketers use behavioral metrics to select, refine, and scale gendered traits. Crucially, developers do not invent femininity from scratch; they begin with culturally familiar stereotypes such as youthfulness, softness, and deference. What the optimization loop adds is a mechanism for filtering and intensifying particular configurations based on metric performance. Traits survive not because they are culturally dominant but because they convert. By tracing this pipeline, I locate human agency throughout: in the choice of what to measure, what thresholds to set, what counts as success, and how results are written back into production.
The optimization loop works as follows. Developers translate brand adjectives into tunable traits such as voice pitch, gaze, and gesture (Scheuerman et al., 2021), link them to dashboards, and run A/B tests. Metrics such as watch time, click-through, and sales-per-minute are treated as ground truth, and traits are reverse-engineered from these numbers until they maximize performance. High-performing variants are filtered through ranking and recommendation systems (Bucher, 2018; Gillespie, 2014; Rieder and Hofmann, 2020) and scaled as defaults, while weak ones are dropped. Over repeated cycles, platforms allocate traffic to the “winning” traits, narrowing the field toward a single profitable version of femininity. Gender is not simply carried forward by existing bias but actively assembled through a metric-driven workflow that turns femininity into a design object optimized for commercial goals (Wajcman, 2010).
This productive perspective builds on theories of gender as an ongoing accomplishment rather than a fixed essence. West and Zimmerman's (1987) notion of “doing gender” and Butler's (1990) account of performativity highlight how gender is stabilized through the repetition of stylized acts such as tones, expressions, and gestures that others recognize as coherent. Traditionally, this recognition is interpersonal: social audiences validate performances as appropriately masculine or feminine. I argue that commercial AI introduces a new mode of doing gender, which I call “doing data.” Here, gender is still accomplished through repeated performances (e.g. scripted voices, facial animations, gestural cues). What shifts is how performances are validated. Rather than social audiences judging appropriateness, commercial metrics such as watch time, click-through rates, and sales conversions determine which performances survive. In an optimization loop, gender is first rendered into measurable parameters, yet the choice of what to parameterize encodes cultural assumptions into ostensibly objective data (Kitchin, 2014). Those parameters are then tested against metrics, which select configurations based on what converts. “Doing data” is thus a commercially mediated way of doing gender, where measurement encodes norms and metrics select which configurations are standardized and scaled.
Audiences participate in this process not as judges of appropriateness but as generators of behavioral data. Their watch time, clicks, and purchases become the signals that validate which performances survive. Critical algorithm studies have shown how platforms construct audience identities by correlating their behavioral traces into categories (Bucher, 2018; Chun, 2021; Gillespie, 2016; Wang and Spronk, 2023). My study shows that the same behavioral traces also flow in the other direction: they become design inputs that shape the personas audiences encounter. Audience data not only classifies who is watching but also engineers what is being watched. In Cheney-Lippold's (2017: 5) terms, the avatar becomes what it is “computationally calculated to be.” Thus, audience feedback is not merely a downstream readout but a design lever that developers use to iteratively select, refine, and scale gendered traits. Understood this way, gender in commercial AI is not a fixed attribute encoded once and deployed, but a moving target continuously recalibrated through the optimization loop.
The Chinese case: Gendered AI streamers in live-commerce
Live-commerce refers to real-time video shopping with in-stream checkout, where viewers watch a streamer demonstrate products and purchase without leaving the stream. What distinguishes live-commerce from other live retail formats, such as television shopping or traditional e-commerce like Amazon, is its metrics-driven architecture. Streamers do not perform for an imagined audience; they perform for the metrics displayed on dashboards in front of them. Platform algorithms allocate traffic based on real-time performance indicators such as watch time, click-through rates, and sales per minute. Streamers who generate strong metrics receive more traffic; those who underperform see their streams deprioritized. This algorithmic distribution of visibility creates intense pressure to optimize every aspect of the performance: voice, gesture, pacing, and emotional tone are all calibrated to move the numbers. It is this metrics-driven logic that has shaped the rapid adoption of AI streamers, whose parameters can be adjusted in real time to maximize commercial performance.
In China, live-commerce is not a niche experiment but a mainstream platform economy operating at a national scale. Major platforms, including Taobao (Alibaba), Douyin (ByteDance), JD, and WeChat Channels, have integrated livestream shopping into their core business models. By the end of 2023, 816 million people—about three-quarters of Chinese internet users—had participated in livestream shopping, and industry revenues surpassed US $700 billion, with projections to exceed US $1 trillion by 2026 (Statista, 2024). Live-commerce thus stands as one of the largest big-data-mediated consumer economies in the world, producing massive, continuous streams of behavioral data that shape not only consumer experiences but also the organization of labor.
Operating livestreams at this scale requires considerable resources. Human streamers command high labor costs, tire over long sessions, and differ in their ability to deliver the affective style preferred by brands. These constraints have fueled the rapid adoption of AI-generated “digital humans” (shuzi ren 数字人): photorealistic avatars created using generative AI, deepfake technologies, and lip-sync models. Major technology companies such as Baidu, Tencent, and ByteDance have developed the underlying AI infrastructure, while a growing ecosystem of smaller production companies specializes in creating customized digital humans for brands. By the end of 2024, approximately 1.36 million companies in China had registered in the “digital human” category, a 36.9% increase from the previous year (China University of Communications, 2025). The digital-human market was valued at US $26 billion in 2022, with the majority of applications concentrated in live-commerce (iiMedia Research, 2023). My fieldwork takes place inside two production companies that create and optimize AI streamers for brand clients.
The rapid rise of AI-generated “digital humans” in Chinese live-commerce offers a powerful site for studying gendered AI. Live-commerce in China is highly gendered: beauty and fashion, categories targeted primarily toward women, account for over 70% of sales, and roughly 80% of human streamers are young women trained to project warmth, charm, and relational engagement. The Chinese term for streamers, zhubo (主播), carries gendered connotations in this context. Female streamers in China's digital economy often face sexualized stigma, and fashion and beauty influencers (wanghong网红) navigate similar gendered expectations (Craig et al., 2021). These existing cultural scripts provide the templates that brands and AI companies draw upon when designing AI avatars. AI streamers overwhelmingly reproduce this female-coded persona: they are typically designed as youthful, attractive women whose appearance and demeanor draw on deepfake models, lip-sync animation, and generative AI dialogue systems. Both startups I studied produced almost exclusively female avatars. Male avatars exist primarily in categories such as electronics and tech. However, developers explained that female personas “convert better” across most product categories. Even campaigns targeting female audiences used female avatars. As one product manager put it, “Women trust women for beauty advice, and men like watching women. Female just works.” Male avatars occasionally came up in discussions but were dismissed as “niche” without proven metrics.
Yet these personas are not fixed. Traits such as voice pitch, gaze angle, facial softness, and micro-expressions are parameterized and adjusted in collaboration between vendors and brands, often during livestreaming sessions. Viewer metrics such as watch time, emoji bursts, and add-to-cart rates serve as yardsticks: underperforming variants are quickly tweaked or discarded, while successful versions are cloned and scaled across storefronts. This data-intensive process can unfold in just days, with enormous commercial impact: AI avatars in China have sold millions of dollars’ worth of goods in single livestream sessions (Cheng, 2025). This case provides a fertile ground for analyzing how gender in AI is not merely represented but actively produced through iterative cycles of optimization and metric-based feedback within big data infrastructures.
Data and methods
This study draws on fieldwork at two AI startups in Hangzhou, a city often described as China's “digital capital” for its dense cluster of AI startups specializing in live-commerce technologies. The two “digital human” production companies I worked with—pseudonymized here as startup A and startup B—supply avatars to brands livestreaming across China's major live-commerce platforms (Taobao, Douyin, JD, and WeChat) and collectively hold a significant share of the national AI-streamer market. I conducted approximately 120 h of participant observation inside these two companies, following full AI-streamer production cycles from initial client brief to post-launch analytics. This involved shadowing engineering teams on the sales floor, sitting in design meetings where voice, face, and gesture parameters were debated, and observing real-time “war-room” sessions during 12 live broadcasts. During these sessions, I recorded how developers altered avatar traits mid-stream in response to shifts in performance metrics such as watch time, emoji bursts, and add-to-cart rates. Fieldnotes were written within 24 h of each visit and supplemented with screenshots of metrics dashboards, photographs of whiteboard sketches, and internal presentation slides shared by staff.
To triangulate my findings, I also conducted systematic online observation of 30 AI-streamed sessions across major platforms. These observations extended my participant observation beyond the production sites, allowing me to compare the internal design decisions I witnessed with their public enactment. They provided a comparative lens on how gendered performances were staged, evaluated, and monetized in live streams, and offered a check on whether the practices I observed inside companies aligned with broader market patterns.
Alongside observation, I conducted 48 semi-structured interviews lasting between 60 and 90 min with engineers (n = 9), photographers (n = 3), avatar designers (n = 8), product and project managers (n = 15), and brand-side marketers (n = 13). The design teams at both startups were predominantly male, particularly in engineering and product management roles, while marketing and brand liaison positions included more women. The workplace culture reflected this composition. Developers rarely used gendered language to describe their products, emphasizing instead technical sophistication, smoothness of animation, realism, and customization options. Metrics were treated as ground truth, and engineers followed standardized workflows organized around performance dashboards. Gender, when it surfaced, was framed as a product variable to be optimized rather than a social category to be discussed. Interviews were conducted in Mandarin, audio-recorded with informed consent, transcribed verbatim, and translated where necessary.
Whereas participant observation allowed me to witness the micro-level design practices, interviews provided access to the rationales participants themselves used to justify or contest these design choices. Questions focused on the design workflow, the role of performance metrics, client-vendor negotiations, the perceived effectiveness of gender cues, and the challenges of maintaining “authenticity” in AI-mediated sales. Comparing how practitioners narrated their decision-making with what I observed in practice revealed both alignments and tensions between discourse and action. These interviews were further complemented by the analysis of 73 documents, including internal materials (KPI reports, marketing decks, technical proposals) and industry reports. These documents provided a check on interview accounts and revealed the formal evaluative criteria used to assess avatar performance.
Data analysis followed an abductive coding strategy (Timmermans and Tavory, 2012), combining open coding of fieldnotes, interview transcripts, and documents in MAXQDA with iterative memo-writing to refine emerging conceptual categories. Codes such as “parameter translation,” “metric spike,” and “variant cloning” were progressively clustered into higher-order themes that mapped onto the optimization loop framework. This abductive approach allowed me to move between empirical detail and theoretical insight, sharpening the account of how gendered traits were produced, evaluated, and stabilized in practice.
My position as a Mandarin-speaking researcher with prior industry familiarity facilitated access to design spaces and informal conversations. Although the research centers on two large Hangzhou-based firms representative of mainstream, revenue-driven production practices, it provides a rare, fine-grained account of how gender is operationalized, tested, and monetized within China's live-commerce AI industry.
Findings
This section traces how gender is actively produced and stabilized through an optimization loop spanning the entire development cycle of AI streamers. I organize the findings into two stages that correspond to distinct phases of production. Stage 1 (pre-launch translation) follows a sportswear brand collaboration at startup A to show how abstract brand aesthetics are systematically translated into contractually fixed parameters before any audience interaction occurs. Stage 2 (live optimization) draws on multiple campaigns at startup B to examine how these parameters enter recursive testing cycles driven by real-time platform metrics. Across repeated cycles, high-performing variants are scaled as defaults while underperforming ones are dropped, progressively narrowing the design space toward a single profitable femininity.
Stage 1: Pre-launch translation: From vague vibes to metric-ready parameters
This stage unfolds in three substages. First, developers translate vague brand ideals into measurable parameters. Second, they assemble the avatar by compositing the face and voice from source materials. Third, iterative negotiation with clients locks the parameters into a final persona ready for launch.
Translating client aesthetics
Gender production begins through interactions with brand representatives. On a February morning in 2024, I joined startup A's showroom to observe a delegation from Brand X, a Hangzhou-based sportswear company seeking to elevate its market image. The brand's team—led by a marketing vice-president with two assistants—arrived with just one PowerPoint slide and an abstract aesthetic directive. Four male staff members from startup A awaited them. The slide featured four pink characters, “甜辣高级” (tian la gao ji: sweet-spicy yet upscale) 1 , superimposed on an image of a Lululemon ambassador sprinting against sunset hues. It was less a precise product brief than a mood board, reinforced by the VP's evocative description, “someone sweet but spicy, athletic but not aggressive, premium but accessible—someone men trust and women will not dislike.”
Producers at startup A did not question the vagueness of the request. Instead, they activated an established process of translation, moving quickly to render the brief in measurable form. Designers opened a shared screen, displaying reference materials: Nike's “City Ready” campaigns, a Shanghai-based fitness influencer's landing page, and an outdoor yoga shoot with pink overlays. The VP gestured approval or dismissal at selected images, articulating preferences through this curated gallery. This reference-finding practice operated like a form of ethnographic prompting, drawing implicit tastes into legible choices. A designer then dragged the selected images into a Midjourney interface, iteratively composing a detailed prompt: “18–25-year-old Chinese female, oval face, European double eyelids”—a marker of transnational beauty standards—“outdoor crossfit vibe, sweet-spicy style, premium athleisure, waist-to-hip ratio 0.68, pastel-neon crop-top, warm key light, 205 Hz voice.” Voice pitch, measured in Hertz (Hz), was a key parameter: adult women's voices typically range from 165 to 255 Hz, with higher pitches culturally coded as more youthful and feminine. Within minutes, a grid of avatars appeared, each calibrating sweetness and spiciness in subtle combinations of makeup, muscle definition, and glow. The VP pointed decisively at one: “柔中带刚” (rou zhong dai Gan: gentleness with inner strength), a phrase describing the balanced harmony of strength and softness, often used to describe idealized femininity. The designer acknowledged the choice and saved three variations.
For the team, this was routine translation work. As one producer later explained, “Clients are not technical. Their requests usually come as vague vibes like ‘sweet but spicy.’ We give them controlled choices, so the process seems objective. Our job is translating these squishy adjectives into hard parameters.” The word “control” recurred across interviews, as designers spoke of “controlling timbre” or “controlling expression” the way one might adjust an audio equalizer.
Even at this pre-launch stage, optimization thinking was already at work. Designers did not choose traits based on personal taste or cultural judgment; they evaluated them against imagined dashboards before any audience had watched. When an intern suggested including a lower-pitched voice variant, the project manager rejected it outright, “Too risky. No metrics show a lower pitch will get clicks.” He cited past tests where retention dropped nearly 30% for deeper voices. The dismissal was not about aesthetics but anticipated metrics. In this way, past optimization cycles constrained which traits were considered viable in the present. What began as open-ended adjectives (sweet, spicy, premium) was progressively locked into numbers.
Assembling the avatar
Once initial parameters were set, the team turned to assembling the avatar from source materials. Brands often provided their own in-house streamers as raw material—a practice that was both cost-effective and strategically advantageous. By using likenesses already under company contracts, brands avoided complicated licensing negotiations while maintaining control over image and messaging. Designers selectively extracted key features—eyes from one streamer, a nose from another, vocal tone from a third—constructing what clients openly described as an “idealized composite.” Mr Wang, a digital marketing lead for a domestic perfume brand, explained, “We want someone who feels like our vibe. The company helps us merge multiple sources into one. It's not anyone real, but it is definitely ours.”
Voice was assembled in the same way. Brands sampled the timbre and pacing of their highest-converting human streamers, feeding those recordings into text-to-speech and lip-sync models that replicated every pitch slide and micro-pause on demand. The goal was uniformity rather than verisimilitude. “Our AI streamer speaks with our star human streamer's voice, only it never has an off day,” explained Xin, a senior brand marketer. Managers treated tone height, attack speed, and even a “stuttering index” as adjustable parameters across more than a hundred emotion categories, from “assertive” to “soft.”
Once likeness and audio were secured, the process shifted to formalizing aesthetic cues as technical parameters. Each prompt line mapped directly onto a measurable input: 205 Hz voice, waist-to-hip ratio 0.68, valence score 0.8. Engineer Li described this as a scientific process, “Sweet-spicy corresponds directly to two metrics—arousal at 0.7, valence at 0.8. We incorporate these into a multimodal prompt to let the GAN converge.” His phrase “let the GAN converge” captures the hybrid logic of production: aesthetic intuitions and affective ideals were funneled into hard-coded metrics designed for algorithmic manipulation. Femininity did not arise from a single model but from iterative narrowing guided by client taste, yet constrained by parameterization.
Negotiation continued through brand feedback, but once parameters were metricized, debates shifted from aesthetics to numbers. In one review session, Brand X's VP rejected a render as “too seductive—like a club girl (太妖艳, tai yaoyan—excessively alluring in a way that feels cheap or overly sexual).” Producer Zhang countered by asking, “Who benchmarks your sweet-spicy ideal?” The VP pointed to a Douyin fitness influencer with 1.8 million followers, known for cheerful giggles and neon crop-tops. Zhang recorded this “reference face” and instructed Li to tag 15 video clips with specific emotion scores. When the VP later vetoed another variant as “too playful for premium positioning,” the rejection was still resolved through measurable adjustments—lowering arousal by 0.1, raising valence by 0.2. Once parameters were set, aesthetic disagreements could only be resolved through numerical recalibration.
Locking the persona
As iterations continued, the parameters accumulated into a locked design. By 15 days after the first client meeting, I could trace eight sequential builds. No user metrics existed during these pre-launch weeks, yet by V7, the avatar's voice had climbed to 230 Hz, and her waist had slimmed by 4%. A Git log reveals the path dependence this created: V1 added a pastel hoodie and a 205 Hz voice; V3 raised pitch to 218 Hz and introduced an off-shoulder bra; V7 locked a 230 Hz voice and trimmed the avatar's waistline. The voice moved steadily toward the higher, more youthful end of the feminine range while the body toward a slimmer silhouette, even though no adjustment was ever labeled as “more feminine.” Each adjustment was logged as “clarity,” “brand fit,” or “UX delight,” giving technical cover to a steady drift toward hyper-femininity.
By day 17, the team circulated a prompt-of-record that read like a parameter sheet: “18–25yo Chinese female, oval face, European double eyelids; cross-fit body; WHR 0.65; pastel-neon crop top; warm key light; 230 Hz voice; tone friendly/confident.” No one called this “feminine” or “sexy,” yet the numbers—pitch, waist-to-hip ratio, affective valence—encoded a narrow construction of femininity. What looked like brand compliance was, in fact, the translation of vague adjectives (“sweet but spicy”) into technical specifications that hardened into contractual assets. The project manager made the stakes explicit, “Once the face is locked, every pixel counts money.” At that point, the brand signed a contract, paid a 30% deposit, and agreed to “No Major Changes After T + 14 (two weeks after launch),” since downstream rigging and lip-sync work made reversal prohibitively costly. Once traits were codified in contract, objections no longer sounded like design disagreements but like violations of technical necessity.
Stage 1 reveals how gender enters the production pipeline before any audience interaction. Vague adjectives like “sweet but spicy” are translated into quantifiable parameters, negotiated through numerical recalibration, and locked into contracts. Once codified, these gendered choices no longer appear as choices. They become technical necessities, ready for the optimization cycles that follow. What makes this stage consequential is not that designers consciously stereotyped but that they systematically converted affective ambiguity into metrics-ready inputs, governed by the logic of what would later perform commercially. Profitability, not cultural tradition or audience preference, determined which configurations of femininity entered the optimization pipeline.
Stage 2: Post-launch optimization: From engineered parameters to hegemonic femininity
With parameters locked into contracts, the avatar was ready for launch. This stage unfolds in three substages. First, developers tune avatar parameters based on real-time performance data. Second, the repetition of winning traits trains audiences to expect and reward them. Third, winning configurations are codified and scaled across the industry, hardening into hegemonic femininity. This section draws primarily on fieldwork at startup B. I analyze a skincare campaign that illustrates avatar tuning through A/B testing and a whisky brand collaboration that reveals how audiences are trained to respond to stabilized affective cues.
Tuning the avatar
At startup B, developers referred half-jokingly to their launch cycles as “talking-point sprints”: short, intense iterations in which dialogue, gesture frequency, and lip-sync timing were continuously adjusted based on dashboard metrics. During a product launch in May 2024, I accompanied two engineers to a livestream studio affiliated with a high-end skincare brand. The AI streamer had been carefully designed in advance: a synthetic woman with soft features, glossy side-parted hair, and a soothing voice tuned to just above 200 Hz. Her design brief was to embody a “gentle-intellectual big sister”—“someone who could read a Lancôme label but also sing you to sleep,” as one product manager described. The target demographic—urban women aged 25–35—was translated into affective traits, aesthetic cues, and tonal adjustments calibrated to resonate with imagined consumer desires.
The implementation began in a spreadsheet. The team fed GPT a structured prompt: serum ingredients, dermatological buzzwords, and tonal cues like “nurturing” and “informed.” The model returned a 50-min sales script, each line mapped to a preset gesture from the company's motion library. During rehearsals, engineers flagged any lip-sync latency above 1 s. “If her mouth lags, no one believes the real-girl act,” a technician explained. The concern was not technical accuracy but believability, making the affective performance cohere with viewers’ expectations of a woman who was not only informative but also emotionally attuned and effortlessly sincere.
Despite meticulous preparation, the AI streamer's launch fell flat. Within minutes of going live, viewer dwell time plateaued at 4 s—well below the platform's KPI of 12. And add-to-cart rates stalled near zero. “She reads like a robot. Nothing is converting,” a brand-side analyst muttered. This was not interpreted as a failure of representation but as a signal to recalibrate. Engineers immediately launched an A/B test on smile intensity: three avatars deployed with identical scripts but different expressions—open-mouth smile, neutral face, and subtle smirk. “All other variables constant,” one developer said, underscoring the experimental logic. The closed-mouth smile won decisively. “That is her smile now,” the developer concluded. Adjustments cascaded. Script lines were softened, clinical monologues replaced with second-person phrasing. When a viewer asked, “What is good for thirty-year-old skin?” the AI responded, “Let me help you find the glow that belongs to your skin.” Watch time climbed to 11 s, and sales per minute hit the target. “We have found her lullaby,” the analyst declared, not describing human emotion, but a statistically optimized rhythm of gesture, tone, and phrasing that converted.
Over the following days, the changes accumulated into a stabilized persona: attentive, soft-spoken, emotionally attuned. Chat logs shifted. Comments began to praise the affect rather than the product: “soothing voice,” “she looks so real,” “she smiles so naturally.” These sentiments appeared spontaneous but were artifacts of iterative tuning. While the design brief began with vague cues like “gentle” or “nurturing,” what counted as “gentle” was not known in advance. It was discovered retroactively, defined by whatever produced a statistical lift. And once discovered, the result was reattributed to audience preference: the metrics now proved what viewers “really” wanted. As one analyst concluded: “We made her gentle because the numbers liked her gentle. Now the numbers tell us she is gentle.”
Training the audience
The optimization loop does more than identify traits that convert; it also conditions audiences to expect and reward them. Once smile curvature, pitch, and phrasing produce sales, these traits are not simply retained but systematically repeated. Over time, repetition shapes what viewers anticipate, respond to, and come to prefer. The following case from startup B's whisky brand collaboration illustrates this process.
The campaign targeted men in their 40s and 50s, many of whom were assumed to appreciate refinement and exclusivity. The avatar was carefully designed to match: a polished woman in a satin blouse and blazer, seated against a warm-lit bar with shelves of single malts behind her. She spoke with clipped precision, enunciating every tasting note—air-dried oak, hints of vanilla, peppery finish—as if delivering a private tutorial. “We wanted her to feel like someone who could walk you through a cask tasting in a private club,” a brand marketer explained. By internal standards, the performance was flawless. But the numbers disagreed. Viewer retention sagged, click-throughs hovered at baseline, and sales stalled. On the live feed, a lone comment summed it up: “Feels like a lecture.” Inside the team, the verdict was swift: professionalism conveyed expertise but failed to generate lift. “It had polish, but no spark,” one analyst concluded.
Within 24 h, the team rolled out a redesign. The blouse was swapped for a crimson qipao—a traditional Chinese dress associated with refined feminine elegance and graceful hospitality. “Qipao means poised, attentive, host-like,” a marketer explained. The new look was meant to signal elegance while leaving male authority intact. An agency manager who oversaw human streamers spelled it out, “When my streamers sell alcohol, I want men to feel like kings. They wear their best qipao, perfect their look, and give compliments that feel earned. It's elegance that doesn’t compete with masculinity; it supports it.” The changes went beyond costume. The script was rewritten to layer tasting notes with flattery: “Excellent taste,” “You know how to pick a classic,” “This is a drink for men with real presence.” Her vocal pitch dropped half a register, pacing slowed, and affirmations were precisely aligned with product mentions. “We’re giving them validation,” one developer explained. “Making them feel like men of taste.” The effects were immediate. Chat filled with whisky-glass emojis and fire symbols. Comments shifted from silence to engagement: “elegant vibe,” “good taste,” “feels like she's talking to me.” Viewer retention climbed, and conversion rates rose by 14%. Internally, the team celebrated what they called “the gentleman loop”: a persona that validated male taste without seeming deferential.
What followed was more consequential than a one-off fix. In subsequent streams, the team repeated the formula: the qipao, the slow pacing, the affirming compliments. Emotional cues were delivered at predictable beats—“a classic for men with refined taste,” “a drink that pairs with confidence”—and the audience began to anticipate them. Emojis appeared on schedule, comments echoed lines verbatim, and reactions aligned with the avatar's delivery. Over time, the repetition took on a ritual quality. One night, when the avatar intoned, “This is a drink for men with presence,” chat lit up with synchronized replies of “that's me” and “cheers, Madam Qipao.” A week later, the nickname stuck: users greeted her at the start of streams with, “Our whisky lady is back.”
This ritual quality reveals how the optimization loop trains audiences. What kept viewers engaged was the predictability of affective reward: flattery arrived on cue, validation felt personal, and the ritual offered belonging. Viewers who found the interaction uncomfortable likely exited; those who stayed had opted into the ritual. I observed limited critique in chat logs. When viewers did comment critically, their remarks focused on believability, not gender. Comments like “still robotic” questioned whether the avatar seemed human-like, not why she was deferential, admiring, or emotionally available.
To be clear, repetition and ritualized interaction are common in human live-commerce as well. Streamers develop catchphrases, audiences respond with synchronized reactions, and parasocial bonds form through familiarity. What distinguishes the AI optimization loop is not repetition itself but systematic testing and replication at scale. Human streamers also repeat what works, but they cannot isolate variables with the same precision or replicate performances with mechanical consistency. The optimization loop does not merely reflect audience preference; it manufactures preference, encoding a profitable template of femininity and conditioning viewers to find it desirable.
From iteration to hegemony
The optimization loop does not end with trained audiences. This substage traces the final step: how the narrowing process produces a hegemonic femininity that scales across the industry. Not all styles of femininity survive the optimization loop. The system quietly penalizes traits that introduce friction: styles that are too fast, too assertive, too weird. Over time, small adjustments accumulate into a consistent logic of what “works.” In one stream for eye cream, phrases linking the product to age-related anxieties (“skincare is a woman's lifelong commitment”) drew more clicks and buys than technical descriptions like “reduces fine lines.” Once the lift was detected, the phrase became a template for future scripts. As one designer summarized, “Once we see a lift, even small, we keep it in rotation.” The loop selects for language that reinforces gendered anxieties, not because developers intend this, but because such language converts.
Alternatives are not rejected outright but quietly filtered through metrics. A developer recounted an experiment with a cell phone avatar modeled after tech-savvy vloggers: confident, witty, fast-paced. The experiment failed within 48 h. Sales dropped, and viewers described her as “too much” or “trying too hard.” As a senior analyst explained, “It's not that she was not credible. It's that she did not feel good to watch.” The avatar's competence was not in question. Her style was. Confidence and wit, when not softened by warmth or deference, registered as friction. Viewers did not reject her knowledge; they rejected her tone. Competence, it seems, only converts when wrapped in softness. Across interviews, designers noted that tones introducing tension often “made it weird” or “broke the vibe.” The loop filtered these styles out, not because they were inaccurate, but because they did not convert. Over time, the loop distilled femininity into a narrow affective bandwidth: warm, admiring, and slow enough to linger. The avatars did not reflect a pre-existing feminine ideal drawn from real women or audience preferences. They reflected what converted: what made viewers stay, click, and buy.
What makes this process hegemonic is not ideology but infrastructure. Winning configurations are codified as “reference avatars” and carried forward into subsequent projects. Throughout my fieldwork at the two companies, I observed project managers routinely directing new clients to the templates that already proved “working.” “Why start from scratch?” one manager at start-up A explained. “We know these work.” Many brands, especially lower-end ones, adopt these existing models directly because of limited budgets for customization. As one designer put it, “Big brands can afford to experiment. Small brands just ask us, ‘give me something like the skincare girl.’” Through this process, commercially validated femininity becomes the industry default. New projects begin where previous optimizations ended, inheriting the narrowed parameters as starting points. Deviation becomes costly because experimenting with alternative styles risks lower conversion without proven templates to fall back on. Designers are pragmatic. They do not intend to enforce a gender hierarchy. They simply ask what converts. But what converts is hierarchical. The traits that consistently win, such as softness, deference, and emotional availability, align with existing gender inequalities. Traits associated with female authority, such as confidence and directness, are filtered out as “friction.” Power here is not intentional but structural. It operates through which traits are rewarded, which are eliminated, and who bears the cost of deviation.
This structural power is reinforced by how the loop masks its own politics. Because every design choice is routed through performance metrics, gender is rarely named as a deliberate variable. Engineers do not say they are adjusting gender; they say they are “tuning for engagement.” As one analyst explained, “We did not set out to make her soft. The numbers made her soft.” Yet data is never simply given (Kitchin, 2014). The “numbers” that “made her soft” were themselves products of human choices: what to parameterize (voice pitch, smile curvature), which metrics to track (watch time), and what to filter out (confidence, wit, speed). The displacement of agency onto metrics obscures these choices and makes gendered outcomes difficult to challenge.
Stage 2 reveals how pre-configured parameters enter live testing and are validated, filtered, and locked in through metrics. The optimization loop does not discover what audiences want; it produces behavioral compliance through repeated exposure. Alternatives are filtered out; winning traits are scaled as industry defaults; and the process is masked as neutral performance tuning. What emerges is a hegemonic femininity that feels natural because both the avatars and the audiences have been conditioned to align with it.
Discussion and conclusions
This study examines how commercial AI systems manufacture gendered personae through a recursive, metric-driven optimization loop. Drawing on ethnographic fieldwork at two Chinese AI startups and 48 interviews with engineers, designers, and marketers, I show that AI streamers become “feminine” not through a one-time design choice but through continuous calibration to real-time performance data. Brand descriptors such as “sweet-spicy” or “approachable” are first translated into tunable traits such as voice pitch, gaze curvature, waist-to-hip ratio, and scripted emotional tone, then formalized as production assets (prompts-of-record, reference faces, design tokens, contracts). After launch, these traits are A/B-tested in livestreams. Platform and business metrics such as watch time, emoji reactions, add-to-cart rates, and sales per minute determine which combinations are retained. High-performing configurations are codified as standardized “reference avatars” and carried forward in backlogs and budgets, while underperforming variants are dropped. Across cycles, metrics functions as a control signal for design, narrowing a wide space of possible styles to a single, profitable, platform-legible femininity: soft-spoken, admiring, emotionally available. Femininity is not merely reflected; it is assembled, selected, and stabilized to maximize commercial goals.
This production focus extends the bias-reproduction account that dominates scholarship on gendered AI. Existing work shows how skewed data, labels, developer demography, and ranking dynamics yield gendered outputs (Abdollahpouri et al., 2019; Buolamwini and Gebru, 2018; Chaney et al., 2018; Leavy, 2018; Singh and Joachims, 2018). These accounts treat bias as something that enters the system and is reproduced or amplified at the output. The optimization-loop view shifts the focus from reproduction to production. Feedback is not merely a downstream readout of existing bias but a design tool that actively shapes what the AI becomes. The loop links four moves into a single production system: translating gendered brand adjectives into controllable parameters; linking those parameters to performance metrics; reallocating exposure toward high-performing variants; and writing winners back into production artifacts. In this sequence, data do not merely register a pre-given persona; they select and stabilize a gendered one.
Because each step is framed as “what performs,” the politics of gender are naturalized as neutral performance engineering. Developers of AI streamers seldom aim to “add femininity”; they chase performance metrics and adjust traits that fail to lift the numbers. Yet this purported neutrality reliably rewards a narrow cluster of affects (softness, emotional availability, an admiring tone) while penalizing frictional styles. Across thousands of small, data-driven tweaks, rewarded traits harden into a template that later reads as natural and “user-driven.” The engineered identity is then attributed to “what the data wants,” producing infrastructural plausible deniability: hegemonic gender identity becomes hard to contest precisely because it appears as the market's objective will rather than the cumulative effect of metric-guided selection by human teams. This diagnosis locates agency not in data alone but in the human choices that define metrics, thresholds, allocation rules, and write-backs.
The optimization loop contributes to theories of identity in AI by treating identity as actively produced rather than merely reflected. Critical data studies show how platforms manufacture user identities by correlating behavioral traces and sorting audiences (Bucher, 2018; Cheney-Lippold, 2017; Chun, 2021; Gillespie, 2014). I extend this claim to the supplier side: the same infrastructures of correlation, ranking, and reward also construct the AI persona itself. Building on “doing gender” (West and Zimmerman, 1987) and performativity (Butler, 1990), I argue that optimization infrastructures function as an engine of metric recognition. It is performance against metrics, not interpersonal recognition, that decides which stylized acts “count.” Repetition under reward sediments high-performing acts into a stable persona, which is then written back into production plans. Gender is thus not a fixed attribute traveling through a pipeline; it is the moving result of parameterization, measurement, selection, and codification.
This understanding of AI identity as produced rather than reflected has political-economic implications. The optimization loop is the routine product-work through which surveillance capitalism acts on identity (Zuboff, 2019). In this economic model, platforms do not simply predict what people will do; they intervene to shape those actions. Every pause, scroll, click, or purchase becomes behavioral data, which feeds back into algorithms that determine which design features will maximize engagement (Fourcade and Johns, 2020). In this model, AI avatars are not merely outputs; they are instruments for generating behavioral data. Commercial AI avatars are engineered not just to represent gender but to elicit the attention, clicks, and purchases that platforms extract and monetize. Gender itself becomes a variable that can be tuned, tested, and monetized, continuously recalibrated to increase “behavioral surplus”—the excess attention, engagement, and purchasing extracted once the system identifies what works. When a particular look, tone, or movement improves the metrics, it is reinforced in the next design cycle until audiences are trained to respond to that version. The optimization loop applies the extractive logic of surveillance capitalism to gender, narrowing the range of possible performances and naturalizing the most profitable form as inevitable.
These dynamics extend beyond commercial AI to carry implications for gender inequality in the wider social context. The traits that consistently “work” (softness, deference, emotional availability) are precisely those that align with patriarchal expectations: femininity that legitimates male authority and provides emotional labor without reciprocity (Connell, 1987; Schippers, 2007). What makes this process powerful is that commercial viability and gender hierarchy converge on the same traits. Because selection occurs through metrics rather than explicit ideology, this convergence is masked as neutral market logic. As these avatars proliferate across platforms, they normalize this narrowed femininity as “what works,” potentially shaping expectations for human women in similar roles. The optimization loop does not just carry existing gender inequalities forward; it amplifies them, scales them, and presents them as what the market naturally rewards.
Making the optimization loop visible re-politicizes gender in AI by relocating accountability from “upstream hygiene” to the optimization infrastructures that decide which personae get scaled. Instead of treating bias as a residue of skewed inputs, this framework shows how identity is actively produced through metric-driven optimization, what I call doing data, doing gender. The accountability question turns on authorship of metrics and thresholds, rationales for exposure decisions, and the organizational routines that construct and stabilize winning traits. Gendered outcomes are not the impersonal verdict of markets but the cumulative record of situated choices by developers, product managers, and platform operators. This opens new terrain for analysis and governance: assessing decision rights over metrics, tracing selection histories, and evaluating how “metric recognition” substitutes for interpersonal recognition in making identities count.
This study situates its analysis in China's hyper-commercial live-commerce sector, a magnified site where the optimization loop is unusually exposed at platform scale. Here, AI avatars are valued only insofar as they move measurable outcomes, and performances are continuously scored by retention, reactions, and add-to-cart behavior. Under these high-frequency feedback conditions, the commercial calibration of gender becomes visible: femininity is not coded once and for all but continually recomputed at the pace of analytics. Focusing on this site expands gender-and-AI research beyond Euro-American voice assistants such as Siri and Alexa (Noble, 2018; West et al., 2019) and surfaces dynamics that remain harder to see in less commercial or less transparent settings.
The mechanisms traced here are not unique to live-commerce. Iterative A/B testing, metric-guided elimination of alternative styles, and the survival of the most commercially performant persona occur wherever AI products test affective cues against performance metrics. Voice assistants, customer-service bots, social companions, or educational avatars are all subject to similar optimization pressures. Early evidence points to domain-specific affective norms: efficiency-oriented styles in customer-service chatbots (Følstad and Skjuve, 2019); relational attachment and empathy cues in social robots (Darling, 2016); and supportive, empathic behaviors in educational robotics (Belpaeme et al., 2018). Comparative work can trace optimization loops across domains and countries to identify which performance indicators crystallize which hegemonic styles, and how those styles intersect with race, age, or class. Metric-driven femininity in Chinese live-commerce is thus not an outlier but a diagnostic lens on how big-data optimization infrastructures worldwide assemble normative personae in AI systems.
Footnotes
Acknowledgements
I would like to thank the anonymous peer reviewers for their constructive feedback, the participants at the Encoding Realities, Decoding Power virtual conference, the audience at the Society for Social Studies of Science conference, and the editors of this Special Issue. I also thank my anonymous ethnographic interlocutors for their time and trust.
Funding
The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Mellon/ACLS Dissertation Innovation Fellowships.
Declaration of conflicting interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
