Location and set-up
Our experiment took place at Trip.com in Shanghai, China. In July 2021, Trip.com decided to evaluate hybrid WFH after seeing its popularity amongst US tech firms. The first step took place on 27 July 2021, when the firm surveyed 1,612 eligible engineers, marketing and finance employees in the Airfare and IT divisions about the option of hybrid WFH. They excluded interns and rookies who were in probation periods because on-site learning and mentoring are particularly important for those individuals. Trip.com chose these two divisions as representative of the firm, with a mix of employee types to assess any potentially heterogeneous impacts. About half of the employees in these divisions are technical employees, writing software code for the website, and front-end or back-end operating systems. The remainder work in business development, with tasks such as talking to airlines, travel agents or vendors to develop new services and products; in market planning and executing advertising and marketing campaigns; and in business services, dealing with a range of financial, regulatory and strategy issues. Across these groups, 395 individuals were managers and 1,217 non-managers, providing a large enough sample of both groups to evaluate their response to hybrid WFH.
Randomization
The employees were sent an email outlining how the six-month experiment offered them the option (but not the obligation) to WFH on Wednesday and Friday. After the initial email and two follow-up reminders, a group of 518 employees volunteered. The firm randomized employees with odd birthdays—those born on the first, third, fifth and so on of the month—into eligibility for the hybrid WFH scheme starting on the week of 9 August. Those with even birthdays—born on the second, fourth, sixth and so on of the month—were not eligible, so formed the control group.
The top management at the firm was surprised at the low volunteer rate for the optional hybrid WFH scheme. They suspected that many employees were hesitating because of concerns that volunteering would be seen as a negative signal of ambition and productivity. This is not unreasonable. For example, a previous study found in the US firm they evaluated that WFH employees were negatively selected on productivity. So, on 6 September, all of the remaining 1,094 non-volunteer employees were told that they were also included in the program. The odd-birthday employees were again randomized into the hybrid WFH treatment and began the experiment on the week of 13 September. In this paper we analyze the two groups together, but examining the volunteer and non-volunteer groups individually yields similar findings of reduced quit rates and no impact on performance.
Employee characteristics and balancing tests
Figure 1 shows some pictures of employees working in the office (left side). Employees all worked in modern open-plan offices in desk groupings of four or six colleagues from the same team. By contrast, when WFH, they usually worked alone in their apartments, typically in the living room or kitchen.
The individuals in the experimental sample are typically in their mid-30s. About two-thirds are male, all of them have a university undergraduate degree and almost one-third have a graduate degree (typically a master’s degree). In addition, nearly half of the employees have children.
In Extended Data Table 7 we confirm that this sample is also balanced across the treatment and control groups, by conducting a two-sample t-test. The exceptions are from random variation given that the sampling was by even or odd day-of-month birthday—the control sample is 0.5 years older (P = 0.06), and this is presumably linked to why those in this group have 0.06% more children (P = 0.02) and 0.4 years more tenure (P = 0.09).
In Extended Data Table 3, we examine the decision to volunteer for the WFH experiment. We see that volunteers were significantly less likely to be managers, highlighting, at least in this case, the lack of evidence for any negative (or positive) selection effects around WFH.
Extended Data Fig. 3 plots the take-up rates of WFH on Wednesday and Friday by volunteer and non-volunteer groups. We see a few notable facts. First, take-up overall was about 55% for volunteers and 40% for non-volunteers, indicating that both groups tended to WFH only one day, typically Friday, each week. At Trip.com, large meetings and product launches often happen mid-week, so Fridays are seen as a better day to WFH. Second, the take-up rate even for non-volunteers was 40%, indicating that Trip.com’s suspicion that many employees did not volunteer out of fear of negative signaling was well-founded, and highlighting that amenities like WFH, holiday, maternity or paternity leave might need to be mandatory to ensure reasonable take-up rates. Third, take-up surged on Fridays before major holidays. Many employees returned to their home towns, using their WFH day to travel home on the quieter Thursday evening or Friday morning. Finally, take-up rates jumped for both treatment-group and control-group employees in late January 2022 after a case of COVID in the Shanghai headquarters. Trip.com allowed all employees at that point to WFH, so the experiment effectively ended early on Friday 21 January. The measure of an employee’s daily WFH take-up excludes leave, sick leave or occasions when they cannot come to the office owing to extreme bad weather (typhoon) or to the COVID outbreak in the company.
Null results
To interpret the main null results, we conduct null equivalence tests using the two one-sided tests (TOST) procedure in R. This test required us to specify the smallest effect size of interest (SESOI). For the results pertaining to performance review measures, we use 0.5 as the SESOI. This corresponds to half of a consecutive letter grade increase or decrease, because we had assigned numeric values to performance letter grades in increments of 1, with the lowest letter grade D being 1, and the highest letter grade A being 5. We performed equivalence tests for a two-sample Welch’s t-test using equivalence bounds of ±0.5. The TOST procedure yielded significant results using the default alpha of 0.05 for the tests against both the upper and the lower equivalence bounds for the performance measures for July–December 2021.
We conducted null equivalence results for the effect of the treatment on promotions using 2 as the SESOI, corresponding to ±2 percentage points (pp) difference in promotion rates. Although we can reject the null hypothesis that the true effect of treatment on promotion is larger than 2 pp or smaller than −2 pp in January–June 2022 and July–December 2022, we fail to reject the null equivalence hypothesis in July–December 2021 and January–June 2023. Thus, we interpret the results on promotion as no evidence of a difference between promotion rates across treatment and control employees.
We also conducted the equivalence test for lines of code using 29 lines of code per day as the SESOI. We can reject the equivalence null hypothesis for lines of code, so we interpret the effect of the treatment as a null effect.
Volunteer versus non-volunteer groups
In the main paper we pool the volunteer and non-volunteer groups. In Extended Data Table 5 we examine the impacts on performance and promotions and we see no evidence of a difference in performance and promotion treatment effects for volunteer versus non-volunteer groups.
Performance subcategories
The company has a rigorous performance-reviewing process every six months that determines employees’ pay and promotion, so is carefully conducted. The review process for each employee is built on formal reviews provided by their managers, project leaders and sometimes co-workers (peer review). Managers are more like an employee’s direct managers for organizational purposes, but for a particular project, the project leader could be another higher-level employee. In such a case, the manager of the employee would ask that project leader for an opinion on the employee’s contribution to the project. An individual’s overall score is a weighted sum of scores from various subcategories that managers have broad flexibility over defining, because tasks differ across employees, and managers would give a score for each task. For example, an employee running a team themselves will have subcategories around developing their direct reports (leadership and communication), whereas an employee running a server network will have subcategories around efficiency and execution. The performance subcategory data come from the text of the performance review. We first used the most popular Chinese word segmentation package in Python, named Jieba, to identify the most frequent Chinese words from task titles across four performance reviews. We also removed meaningless words and incorporated common expressions such as key performance indicators (‘KPI’), objectives and key results (‘OKR’), ‘rate’ and ‘%’. This process resulted in a total of 236 unique words and expressions. We then manually categorized those most frequent keywords into nine major subcategories by meanings and relevance. Finally, on the basis of the presence of keywords in the task title, tasks were grouped into the following subcategories:
– Communication tasks
– Development tasks
– Efficiency tasks
– Execution tasks
– Innovation tasks
– Leadership tasks
– Learning tasks
– Project tasks
– Risk tasks
Data sources
Data were provided by a combination of Trip.com sources, including human resources records, performance reviews and two surveys. All data were anonymized and coded using a scrambled individual ID code, so no personally identifiable information was shared with the Stanford team. The data were drawn directly from the Trip.com administrative data systems on a monthly basis. Gender is collected by Trip.com from employees when they join the company.
Subsamples
The full sample has 1,612 experiment participants, but we have 1,507, 1,355, 1,301 and 1,254 employees, respectively, in the subsamples for the four performance reviews from July–December 2021, January–June 2022, July–December 2022 and January–June 2023. These smaller samples are due to attrition. In addition, for the first performance review in July–December 2021, 105 employees did not have sufficient pre-experiment tenure to support a performance review (they had joined the firm less than three months before the experimental draw). The review text data covers 1,507,1,339,1,290 and 1,246 people, as some employees do have an overall score and review text but do not have additional and task-specific scores. The reason is that these employees do not have the full range of all tasks, so their managers did not write the full review script. For the two surveys, Trip.com used Starbucks vouchers to incentivize response and collected responses from 1,315 employees (314 managers, 1,001 non-managers) at the baseline on the left, and that of 1,345 employees (324 managers, 1,021 non-managers) at the end line.
Testings
All tests used two-sided Student t-tests unless otherwise stated. Analysis was run on Stata v17 and v18, R version 4.2.2. Unless stated otherwise, no additional covariates are included in the tests. The null hypothesis for all of the tests excluding null equivalence tests is a coefficient of zero (for example, zero difference between treatment and control).
Inclusion and ethics statement
The design and execution of the experiment was run by Trip.com. No participants were forced to WFH owing to the experiment (the entire firm was, however, forced to WFH during the pandemic lockdown). The treatment sample had the option but not the obligation to WFH on Wednesday or Friday. The experiment was designed, initiated and run by Trip.com. N.B. and R.H. were invited to analyze the data from the experiment, with consent for data collection coming from Trip.com internally. The experiment was exempt under institutional review board (IRB) approval guidelines because it was designed and initiated by Trip.com, before N.B. and R.H. were invited to analyze the data. Only anonymous data were shared with the Stanford team. Trip.com based the experimental design and execution on their previous experience with WFH randomized control trials.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
In conclusion, the location and set-up of the experiment at Trip.com in Shanghai, China provided valuable insights into the impact of hybrid WFH on employee performance and promotions. The randomization process, employee characteristics, performance subcategories, data sources, subsamples, and testing methods all contributed to a comprehensive analysis of the experiment. The null results and comparisons between volunteer and non-volunteer groups shed light on the effectiveness of the hybrid WFH scheme. Overall, the study offers valuable information for companies considering implementing hybrid WFH policies.