02 Data Preprocessing

Author

gitSAM

Published

March 31, 2025

This section describes the data used to construct and evaluate the TBTF portfolio strategy. We utilize multiple sources to enable a comprehensive analysis of market structure, return distributions, and benchmark comparisons. The primary dataset is drawn from CRSP, with supplemental reference portfolios and index returns from Fama-French and Yahoo Finance, respectively.

1 Primary Data: CRSP Monthly Stock File

We use the Center for Research in Security Prices (CRSP) monthly stock files from January 1996 to December 2023. The CRSP universe includes all listed common stocks on the NYSE, Nasdaq, and AMEX exchanges. Key variables include:

PERMNO: Unique stock identifier
Date: Monthly observation date
Market Capitalization: Shares outstanding × price
primaryexch : primary exchange
SIC Code: Optional industry classification

1.1 Market-Cap Based Valuation Perspective

Unlike fixed-income instruments, public equities derive their economic value primarily from their tradeability. While dividends contribute to realized returns, they are secondary to a stock’s market capitalization path, which reflects its ability to attract and retain capital under uncertainty.

We therefore adopt a market cap–based valuation approach, using capitalization time series as the core proxy for equity value and flow dynamics.

1.2 Delisting Treatment: Terminal Value = Zero

When a stock is delisted:

Its market capitalization becomes zero, and its final gross return is set to 0, implying net return of –1.

\[ R_{\text{gross}} = 0, \quad R_{\text{net}} = -1 \]

This reflects the reality that delisted equities become practically untradeable and lose all resale value for investors. In contrast to fixed-income instruments, equities have no residual recovery claim.

1.3 IPO Treatment: External Entry State

Newly listed stocks (IPOs) are initially placed in an external state (state 0).
From their first full month after IPO:
- Their market capitalization is ranked within the universe.
- They are assigned to a decile group (state 1–10) based on market cap percentile.

1.4 Survivorship Bias Adjustment

To avoid survivorship bias:

We retain all stocks, including those that were delisted during the sample period.
This prevents the overestimation of returns and misrepresentation of capital mobility that arises from survivor-only datasets.

1.5 State Classification Framework

For transition matrix estimation and mixture modeling (Section 4), we define the following percentile-based state mapping:

State	Market Cap Percentile Range	Description
10	90%–100%	Top 10% (largest-cap stocks)
9	80%–90%
…	…
1	0%–10%	Bottom 10% (smallest-cap stocks)
0	External	Delisted stocks (absorbing state) or IPOs before full inclusion

This unconventional but analytically convenient mapping allows for intuitive modeling of:

Upward mobility as a transition to higher state numbers
Capital lock-in at the top (persistence in State 10)
Exit fragility for lower states (transition to State 0)

1.6 Structural Implication

This mapping enables a semi-absorbing, non-reversible Markov transition model, where:

Delisted stocks enter the absorbing state (0) permanently.
Stocks that remain listed transition among states 1–10, based on their market cap percentile ranks at each time \(t\).
New entrants (IPOs) begin in state 0 and join the ranked universe from their second month onward.

This design forms the empirical backbone for transition matrix analysis, mixture decomposition, and long-run mobility asymmetry in Section 4.

2 Secondary Data Sources

2.1 Fama-French Research Portfolios

From the Ken French Data Library, we collect the following datasets:

ME10 Decile Portfolios: Used to compare return characteristics between small-cap (s_10) and large-cap (b_10) portfolios.
ME × PRIOR Portfolios: Cross-sorted on size (ME) and prior 12-month return momentum. We focus on:
- ME5 PRIOR5: Large size, high momentum
- ME5 PRIOR3: Mid size, moderate momentum
- ME5 PRIOR1: Mid size, low momentum

These portfolios are value-weighted, rebalanced annually using NYSE breakpoints, and provided as monthly returns.

2.2 Index and ETF Benchmarks

To compare TBTF performance with market-traded investment vehicles, we incorporate the following index and ETF data:

Pre-2010 Indices (from Yahoo Finance):
- ^NDX (Nasdaq 100 Index)
- ^DJI (Dow Jones Industrial Average)
Post-2010 ETFs:
- VTI: Vanguard Total Stock Market ETF (inception: 2001)
- SPY: SPDR S&P 500 ETF (inception: 1993)
- QQQ: Invesco Nasdaq-100 ETF (inception: 1999)
- DIA: SPDR Dow Jones Industrial Average ETF (inception: 1998)

The inclusion of these instruments enables direct comparison with passive investment strategies commonly available to retail and institutional investors.

3 Appendix

3.1 Data Availability and Frequency

Dataset	Raw data Frequency	Period Covered	Source
Fama-French ME Decile	Monthly	1963–2023	Ken French Data Library
CRSP Monthly Stock File	Monthly	1996–2023	WRDS/CRSP
Fama-French ME × PRIOR	Monthly	2010–2023	Ken French Data Library
DIA, QQQ, SPY, VTI (ETFs)	Daily	2010–2023	Yahoo Finance
DJIA, NDX (Price Indices)	Daily	1996–2009	Yahoo Finance

Pooled Panel data
- Fama-French data library
- FRED database
Panel data
- CRSP from WRDS
- Yahoo Finance