Back to blog items
Bloated Disclosures: Can GPTs Help Investors Process Information?
Jun 8, 2023
This study investigates the use of GPT-3.5 for summarizing financial documents, finding that these reports are often unnecessarily lengthy, leading to market inefficiencies. Summaries produced by GPT-3.5 are much shorter and tend to exaggerate the original sentiment, correlating more strongly with market reactions than the original documents. The paper points out the deliberate complexity in corporate filings to mask negative information. It introduces a 'bloat' measure, indicating that more bloated documents usually contain negative sentiments and are linked to poorer market outcomes.
Paper Review - Link
Key Points
Paper focused on MD&A (item 7) and conference calls. A statistical relevant dataset of financial documents is analysed. Figures, tables, html, etc.. are deleted.
Financial reports are often bloated. A correlation between bloating, sentiment and stock market reactions is found.
Bloated reports create information asymmetry and lower price efficiency. LLMs can distil bloated information by summarizing relevant facts.
GPT 3.5 used to produce unconstrained summaries (no limit on max words to use). Summaries are, on average, less than 20% of the original document’s length.
Great focus on general sentiment analysis. No real focus on goodness/factuality of the retrieved information.
Overall, summaries amplify sentiments of the reports, negative documents become more negative and positive ones become more positive in summaries.
Correlation between sentiment of summarized docs and cumulative abnormal returns. Correlation is absent when using original docs sentiment.
Correlation between values of “Bloat” and market performance.
More in-depth details
Information overload in corporate filings is a real issue, often incentivized by management to obfuscate negative information through irrelevant details. Up to now, call to awareness as the “Plain English” initiative have failed to enforce sufficiently good practices and frameworks.
Sentiment assessed by counting positive/negative financial keywords. Using dictionaries provided by Loughran and McDonald. Higher proportion of positive words corresponds to a higher Sentiment score.
Informativeness of summarized disclosures measured by comparing raw document sentiment vs. summarized document sentiment in their ability to explain cumulative abnormal returns (CAR). The calculation of CAR involves subtracting the expected return of the firm's stock, typically based on a market model or historical average return, from the actual return observed over the event window, and summing up these differences for both days of the sonsidered event window. Statistically relevant results, suggesting usefulness of LLM for insights. One standard deviation increase in sentiment of summaries is associated with a 0.087 deviation increase in abnormal returns. This correlation is not seen with the sentiment of original documents, because of their noisiness.
Relative amount by which a document’s length is reduced as a measure of the degree of irrelevant information, referred to as “Bloat”. “Bloat” is mostly attributed to firm level factors and changes significantly from period to period. Higher bloat is correlated with negative sentiments. When a firm repots losses, bloat is higher. Bloated disclosures are associated with adverse capital market consequences.
Proxies to capture degree of price informativeness:
Intraperiod timeliness (IPT) over a five-day window used to measure speed of price discovery.
Probability of informed trade (PIN).
Daily bid-ask spread measured on the announcement day (BAS)
Bloat has negative association with IPT and positive with PIN and BAS. One standard deviation increase in Bloat is associated with a 0.16% increase in the probability of informed trading, an 8.8% point decrease in the speed of price discovery, and a 17.6% point increase in bid-ask spread. Disclosure bloat hinders effective information transfer between companies and information users. Bloated discosures are associetad with adverse capital market consequences.
Team conclusions
The paper is generally solid and validates the use of large language models (LLMs), particularly those from OpenAI, for tasks such as document summarization.
However, it lacks in-depth analysis on the semantic quality of the summaries generated by the LLM. The focus is on metrics related to sentiment analysis and the length of the output produced by GPT-3.5.
Another critique, is on the very nature of the bloat measure. It may be volatile and in the long run the industry might adapt to prevent its use.
The end-to-end approach to summaries used in the paper is susceptible to 'hallucinations'. Our current method of 'extract-then-summarize' may offer better control over the output.
The way they determine sentiment of raw and summarized documents is easy to implement and provides good insights. We have to check if our kqueries retrieve sentiment. Also, would be good to study literature on sentiment analysis in financial settings.
Their "bloat" metric seems to offer a useful measure of sentiment and shows statistical relevance. I see two ways we could implement this: (1) by comparing the length of the summary from our current kqueries to the original text, or (2) by using a parallel pipeline with GPT-3.5 or 4 to generate a summary and compare its length to the original text.