diff --git a/.gitignore b/.gitignore
index de658d5..78a934b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,3 +6,4 @@ vendor
*.sw*
.idea/
.bundle/
+.DS_Store
\ No newline at end of file
diff --git a/Gemfile.lock b/Gemfile.lock
index 900bcb9..1ee0fbd 100644
--- a/Gemfile.lock
+++ b/Gemfile.lock
@@ -18,11 +18,15 @@ GEM
logger
faraday-net_http (3.4.2)
net-http (~> 0.5)
+ ffi (1.17.3-arm64-darwin)
ffi (1.17.3-x86_64-linux-gnu)
forwardable-extended (2.6.0)
google-protobuf (4.33.4)
bigdecimal
rake (>= 13)
+ google-protobuf (4.33.4-arm64-darwin)
+ bigdecimal
+ rake (>= 13)
http_parser.rb (0.8.1)
i18n (1.14.8)
concurrent-ruby (~> 1.0)
@@ -84,6 +88,8 @@ GEM
rexml (3.4.4)
rouge (4.7.0)
safe_yaml (1.0.5)
+ sass-embedded (1.97.2-arm64-darwin)
+ google-protobuf (~> 4.31)
sass-embedded (1.97.2-x86_64-linux-gnu)
google-protobuf (~> 4.31)
sawyer (0.9.3)
@@ -102,6 +108,7 @@ GEM
webrick (1.9.2)
PLATFORMS
+ arm64-darwin-25
x86_64-linux
DEPENDENCIES
diff --git a/_config.yml b/_config.yml
index 862eaa4..8cfb62a 100644
--- a/_config.yml
+++ b/_config.yml
@@ -7,7 +7,7 @@ description: >- # this means to ignore newlines until "baseurl:"
baseurl: "" # the subpath of your site, e.g. /blog
url: "https://tech.scribd.com" # the base hostname & protocol for your site, e.g. http://example.com
google_analytics: 'UA-443684-30'
-featured_series: 'kyc-series'
+featured_series: 'content-trust-series'
# GitHub Metadata
# Used for "improve this page" link
diff --git a/_data/authors.yml b/_data/authors.yml
index 466af78..89580c3 100644
--- a/_data/authors.yml
+++ b/_data/authors.yml
@@ -183,3 +183,10 @@ anishk123:
github: anishk123
about: |
Anish is an engineer on the Machine Learning Data Engineering team building LLM workflows and data pipelines to enrich our content and improve our trust and safety systems.
+
+ericc:
+ name: Eric Chang
+ github: EricCHChang
+ about: |
+ Eric is a data scientist on the Applied Research team building machine learning models to understand and connect our content.
+
diff --git a/_posts/2026-01-20-photodna-csam-detection.md b/_posts/2026-01-20-photodna-csam-detection.md
index 5b8a412..bcca299 100644
--- a/_posts/2026-01-20-photodna-csam-detection.md
+++ b/_posts/2026-01-20-photodna-csam-detection.md
@@ -6,6 +6,7 @@ tags:
- aws
- lambda
- databricks
+- content-trust-series
team: ML Data Engineering
author: anishk123
---
diff --git a/_posts/2026-02-25-content-trust-score.md b/_posts/2026-02-25-content-trust-score.md
new file mode 100644
index 0000000..45eaee1
--- /dev/null
+++ b/_posts/2026-02-25-content-trust-score.md
@@ -0,0 +1,223 @@
+---
+layout: post
+title: "Dual-Embedding Trust Scoring"
+tags:
+- machinelearning
+- scribd
+- featured
+- content-trust-series
+team: Applied Research
+author: ericc
+---
+
+Scribd is a digital library serving academics and lifelong learners, offering hundreds of millions of documents. This very nature presents a significant concern: **content trust and safety**. Protecting our library from undesirable and unsafe content is a top priority, but the **multilingual and multimodal** (text and images) nature of our platform makes this mission very challenging. Also, while third-party tools exist, they often fall short, lacking the nuance to handle our specific trust and safety categories.
+
+To this end, we capitalized on **Generative AI (GenAI)** signals and our **proprietary multilingual embeddings**, in conjunction with classical machine learning methods, to develop our **Content Trust Score**. This metric reflects the severity of a document violating a specific trust pillar, enabling us to identify high-risk content and take appropriate actions. Ultimately, the score allows us to build a more robust and scalable moderation system, ensuring a safer and more reliable experience for all users while preserving the rich diversity of our UGC.
+
+The data and methodologies presented here are for research purposes and do not represent Scribd's overall moderation or policy implementation.
+
+## Content Trust Pillars
+According to our internal Trust & Safety framework, we defined and prioritized our current efforts on four top-level concern pillars:
+
+* **Illegal:** Documents that contain or promote illegal materials or activities
+* **Explicit:** Sexual or shocking content
+* **Privacy/PII:** Violate privacy or contain Personally Identifiable Information (PII)
+* **Low Quality:** Junk, gibberish, low information, or non-semantic documents
+
+To maintain a clear project scope, we focused our research on these four semantic-heavy pillars where our embedding-based approach offers the greatest impact. The remaining violation types are out of scope and are addressed by other specialized detection algorithms.
+
+## From Embeddings to Trust Score
+### Datasets & Features
+We leveraged annotated data at Scribd, which includes human-assigned trust labels, to craft our **core modeling dataset** of roughly **100,000 documents**. This dataset was split 90-to-10 for training and testing data and distributed across the four trust pillars. The **training set** was used exclusively to derive the Content Trust Pillar embeddings, while the **testing set** provided the initial basis for comparison between content- and description-based scores. In addition to the four primary Trust Pillars, we also **included documents not violating any trust & safety pillars**. These *clean* documents serve as the **“baseline” in our analyses**. It is important to note that the **data presented here is for discussion purposes and does not represent approximate category distributions within the Scribd corpus**.
+
+
+
+ Table 1. Document distribution across Trust Pillars. This table details the percentage of labelled documents within the training and testing datasets. Note that the Clean documents are included separately as the baseline.
+
+
+
+
Trust Pillar
+
Training Dataset
+
Testing Dataset
+
+
+
+
+
Explicit
+
0.39%
+
0.41%
+
+
+
Illegal
+
1.49%
+
1.56%
+
+
+
PII/Privacy
+
5.43%
+
5.48%
+
+
+
Low Quality
+
2.18%
+
2.17%
+
+
+
Clean
+
90.51%
+
90.38%
+
+
+
+
+
+
+The core feature of our project is the **128-dimensional semantic embeddings for every document**, which were generated using the [LaBSE model](https://huggingface.co/sentence-transformers/LaBSE), fine-tuned on our in-house dataset. Specifically, semantic embeddings are **dense, numerical vector representations of text** in a high-dimensional space. The goal of the embeddings is to map linguistic meaning into this vector space such that pieces of text with similar semantics are positioned mathematically closer together. Moreover, the degree of similarity between texts can be quantified by the distance between their respective vectors. For instance, in Figure 1, the words “circle” and “square” are closer to each other since they are semantically more similar, compared to words like “crocodiles” or “alligators”. This allows us to **represent all the text in our document using a vector of numbers and accurately quantify their semantic relationships**.
+
+
+
+ Figure 1. Conceptual visualization of semantic embeddings.
+
+
+
+
+
+### Content Trust Score
+The first step in generating the Trust Score was creating the representative vectors for each trust pillar. Using the semantic embeddings, we generated the **Content Trust Pillar embeddings** for each trust pillar by **averaging the embeddings of all documents** with that pillar's label in the **training dataset**. The large size of the training dataset helps ensure the representativeness of these Pillar embeddings.
+
+The content trust score for a **Trust Pillar** was then computed as the **cosine similarity** between the document’s embedding and the corresponding Trust Pillar’s embedding. Crucially, **all scores are generated and evaluated exclusively using the testing dataset** to strictly **avoid data leakage and circularity** in our analysis. Our hypothesis is that documents closely matching a specific trust pillar will yield a high similarity score against that Pillar's embedding, while non-matching documents will yield a low score.
+
+This concept is visualized in Figure 2, where each "Pillar" represents a distinct trust pillar centroid. Individual documents are clustered around their respective pillar, illustrating that the closer a document's embedding is to a specific Trust Pillar embedding, the higher its calculated similarity score, which confirms a stronger thematic match to that pillar.
+
+
+
+ Figure 2. Conceptual visualization of Trust Pillar embeddings and document similarity in a high-dimensional space. Each coloured dot represents a single document.
+
+
+### Enhancing Semantics with Description Embeddings
+While the **content-based semantic embeddings** are generally effective, they struggle in certain cases where the raw text is not fully informative. Specifically, these embeddings may fail when documents are **extremely long, image-heavy, or contain meaningless repetitive text**.
+
+In these scenarios, a **brief content summary can provide a superior document representation**. For example, Figure 3 illustrates a document containing presentation slides where the raw text is minimal, yet the user-provided description is quite informative.
+
+
+
+ Figure 3. Example of an extremely long document with good descriptive metadata. This example demonstrates how a concise, user-provided description (bottom box) provides more focused, informative text for embedding than the raw content of an extremely long document.
+
+
+However, since users often do not provide adequate descriptions upon document upload, we rely on **large language models (LLMs)** to generate descriptive summaries based on the content. Figure 4 demonstrates this necessity, showing a document with lengthy and repetitive text where the LLM-generated descriptions (**GenAI descriptions**) summarize the core topic effectively.
+
+Consequently, we generated a **second set of document semantic embeddings** and the corresponding **Content Trust Pillar embeddings** based on the **LLM-generated descriptions**. This dual-approach allowed us to compute the content trust score using the alternative, enhanced representation.
+
+
+
+ Figure 4. Example of a document with meaningless, repetitive content (top). The LLM successfully analyzes and summarizes the document, providing a usable description for embedding generation (bottom).
+
+
+### Content- vs. Description-Based Trust Scores
+For each trust pillar, we compared the distribution of the content trust scores derived from the document’s content to their GenAI-description-based counterparts, using the approximately 10,000-document testing dataset. To ensure a fair comparison, we included only documents for which both sets of scores were available. Our results reveal that the **content-based trust scores outperformed the scores generated from GenAI descriptions** for **all Trust Pillars** (Figure 5a-c) **except the Low Quality pillar** (Figure 5d).
+
+For the majority of Trust Pillars, the **content-based scores demonstrated strong discrimination:** they were higher for documents truly violating a given pillar (True Positives) than for documents violating other trust pillars or clean documents. Conversely, for these same pillars, the GenAI-description-based scores were indistinguishable from those of other documents, or showed significantly less separation compared to the content-based counterparts. This suggests that while **content-based embeddings offer a superior representation for general trust identification**, the descriptive embeddings provided little added value for these pillars.
+
+This performance pattern is **reversed for Low Quality documents**. Specifically, the content-based scores for Low Quality documents were ineffective, proving to be indistinguishable from those violating other trust pillars or those labelled as clean. The GenAI-based approach, however, showed a distinct advantage: the **GenAI-description-based scores were significantly higher for Low Quality documents** compared to all others. This result indicates that the **descriptive summary is crucial for accurately identifying this specific type of document**.
+
+
+
+ Figure 5. Trust Score Distribution Comparison of Content vs. GenAI-Description Trust Scores. Violin plots showing the distribution of trust scores for documents belonging to a specific violation pillar (blue) compared to all other documents (red; other pillars in scope or clean documents).
+
+
+For completeness and to verify that our results were not skewed by the presence of other violating documents, we conducted a final comparative analysis by isolating the scores of labelled documents **against only the clean, non-violating documents**. As evident in Figure 6, the core patterns persist: The **content-based scores** consistently yield **superior separation** between violating content (blue) and clean content (green) for the **Illegal, Explicit, and PII/Privacy** pillars (Figure 6a-c). In sharp contrast, the GenAI-description-based scores for these same three pillars exhibit significantly greater distribution overlap. Conversely, for the **Low Quality pillar** (Figure 6d), the **GenAI-description method** again established a **much clearer boundary** from the clean documents than the content-based method, further validating our hybrid scoring approach.
+
+
+
+ Figure 6. Trust Score Distribution Comparing Pillars Exclusively to Clean Documents. Violin plots showing the distribution of scores for documents belonging to a specific violation pillar (blue) compared only to Clean documents (green).
+
+
+### Score Generation for All Documents
+Based on these differentiating findings, we adopted a **hybrid scoring approach:** we use the **content-based trust scores** for the **Illegal, Explicit, and PII/Privacy** pillars, and the **GenAI-description-based trust scores** for the **Low Quality** pillar. This decision enabled the computation of the most effective Content Trust Scores for all documents in our library across every trust pillar.
+
+## Classification Through Threshold Setting
+The content trust score reflects the extent to which a document violates a specific pillar – A high score indicates that the document closely resembles the defined trust violation type. **To build a classification system that flags violations, we must determine an optimal score threshold**.
+
+### Strategic Thresholding: Prioritizing Precision
+In this work, we chose to **prioritize precision** to build a high-confidence classification system. Our goal is to maintain a very low mislabeling rate, specifically aiming for a **false positive rate (FPR) close to 1%**. This decision is driven by the need to **minimize user friction** – Incorrectly flagging documents as violating trust pillars would be an undesirable user experience, making the avoidance of high FPR our primary concern.
+
+### Building the Evaluation Dataset
+The inherently low document count for certain violation types (e.g., Explicit) prevented us from performing reliable analyses to determine classification thresholds. To address this methodological challenge, we developed an **expanded evaluation dataset**. This was built by taking the original modeling data (both training and testing sets) and augmenting it with a high volume of additional human-annotated documents from our existing corpus. By incorporating this high-volume, high-quality labeled data, we established a more comprehensive baseline for threshold analysis. To ensure fair comparisons between the content-based and GenAI-description-based scores, we filtered the data to only include documents with both scores available. This refinement resulted in a **final working total of approximately 109,000 documents in the evaluation dataset**.
+
+### Final Classification Thresholds
+For each of the four in-scope trust pillars, we calculated classification metrics, specifically **recall** and **false positive rate (FPR)**, across a range of thresholds (0.5 to 0.95). Adhering to our rigorous safety standards, we **prioritized Precision** to maintain an **FPR close to 1%**. This conservative thresholding strategy was chosen to minimize user friction associated with false flagging. The final score thresholds for the classification systems of the four Trust Pillars are summarized in Table 2.
+
+
+ Table 2. Classification metrics at the chosen thresholds for the Trust Pillars.
+
+
+
+
Trust Pillar
+
Score Threshold
+
Recall
+
False Positive Rate
+
+
+
+
+
Explicit
+
0.80
+
10.22%
+
1.07%
+
+
+
Illegal
+
0.80
+
71.83%
+
0.79%
+
+
+
PII/Privacy
+
0.75
+
3.82%
+
0.62%
+
+
+
Low Quality
+
0.60
+
27.20%
+
0.52%
+
+
+
+
+
+The analysis revealed that the **Illegal** pillar achieved the optimal balance of metrics, securing a **high recall of 72%** while maintaining an excellent **FPR of 0.79%**. The **Low Quality pillar**, which relies on the GenAI-description-based scores, achieved a respectable **recall of 27.2%** with a very low **FPR of 0.52%**. This outcome validates our decision to utilize the descriptive embeddings for this challenging content type.
+
+However, this high-performance scenario was not replicated across all Trust Pillars. Specifically, the strict **FPR** target limited the system's ability to capture certain violations, with **Explicit and PII/Privacy** achieving only a recall of **10% and 4%, respectively**. This disparity highlights the inherent challenges in identifying documents violating these two pillars, as their topical language is much broader and less defined compared to the other classes.
+
+These results serve as an initial performance baseline. We are actively exploring internal refinements to our **embedding representations and scoring logic**, as well as integrating **complementary models**, to **progressively enhance detection sensitivity**. Our goal is to expand coverage across these more complex pillars while strictly upholding our commitment to a low false-positive environment.
+
+## Discussion
+Our work demonstrates a straightforward and flexible content moderation system by effectively leveraging **classical machine learning** principles (cosine similarity, thresholding) alongside **modern Large Language Models (LLMs)** for superior document representation. This hybrid approach offers several key operational and technical advantages:
+
+### Technical and Operational Advantages
+* **Scalability and Efficiency:** The final content trust score calculation relies on simple vector mathematics (cosine similarity) against pre-computed pillar embeddings. This allows the system to **run efficiently at scale** with a **low computational cost** for real-time inference.
+* **Customizable Representations:** The system is easy to fine-tune, allowing us to quickly update the trust category representations (the Pillar Embeddings) using new data. This flexibility is critical for adapting the system to the unique data and specific violation nuances present in our library.
+* **Enhanced Contextual Understanding:** Incorporating LLM-generated summaries provides a level of **contextual understanding** that helps handle the nuance and ambiguity often present in challenging document types (e.g., extremely long documents or those with minimal text).
+* **Resilience to Emerging Threats:** The use of semantic embeddings, which capture underlying meaning rather than just keywords, allows the system to **adapt well to new or evolving types of harmful content** without requiring constant manual rule updates.
+
+### Potential Applications
+The Content Trust Score and the underlying classification system created in this project open the door to various critical applications at Scribd:
+* **Content Safety in Discovery:** Serving as a primary filter to ensure safe content appears prominently in search results and recommendation feeds. Our N-way testing experiments revealed that filtering unsafe content from search results **significantly increases core business metrics** (e.g., signup) and user engagement (e.g., read time).
\ No newline at end of file
diff --git a/post-images/2026-content-trust/content-trust-score-Figure-1.png b/post-images/2026-content-trust/content-trust-score-Figure-1.png
new file mode 100644
index 0000000..a0c5a8c
Binary files /dev/null and b/post-images/2026-content-trust/content-trust-score-Figure-1.png differ
diff --git a/post-images/2026-content-trust/content-trust-score-Figure-2.png b/post-images/2026-content-trust/content-trust-score-Figure-2.png
new file mode 100644
index 0000000..724dedf
Binary files /dev/null and b/post-images/2026-content-trust/content-trust-score-Figure-2.png differ
diff --git a/post-images/2026-content-trust/content-trust-score-Figure-3.png b/post-images/2026-content-trust/content-trust-score-Figure-3.png
new file mode 100644
index 0000000..6ae9920
Binary files /dev/null and b/post-images/2026-content-trust/content-trust-score-Figure-3.png differ
diff --git a/post-images/2026-content-trust/content-trust-score-Figure-4.png b/post-images/2026-content-trust/content-trust-score-Figure-4.png
new file mode 100644
index 0000000..8267e26
Binary files /dev/null and b/post-images/2026-content-trust/content-trust-score-Figure-4.png differ
diff --git a/post-images/2026-content-trust/content-trust-score-Figure-5.jpg b/post-images/2026-content-trust/content-trust-score-Figure-5.jpg
new file mode 100644
index 0000000..4de5b4c
Binary files /dev/null and b/post-images/2026-content-trust/content-trust-score-Figure-5.jpg differ
diff --git a/post-images/2026-content-trust/content-trust-score-Figure-6.jpg b/post-images/2026-content-trust/content-trust-score-Figure-6.jpg
new file mode 100644
index 0000000..f4679f9
Binary files /dev/null and b/post-images/2026-content-trust/content-trust-score-Figure-6.jpg differ
diff --git a/tag/Oxbow/index.md b/tag/Oxbow/index.md
new file mode 100644
index 0000000..5fadeaa
--- /dev/null
+++ b/tag/Oxbow/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: Oxbow"
+tag: Oxbow
+robots: noindex
+---
diff --git a/tag/android/index.md b/tag/android/index.md
new file mode 100644
index 0000000..28a7e44
--- /dev/null
+++ b/tag/android/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: android"
+tag: android
+robots: noindex
+---
diff --git a/tag/archived/index.md b/tag/archived/index.md
new file mode 100644
index 0000000..3c7bfab
--- /dev/null
+++ b/tag/archived/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: archived"
+tag: archived
+robots: noindex
+---
diff --git a/tag/armadillo/index.md b/tag/armadillo/index.md
new file mode 100644
index 0000000..608135f
--- /dev/null
+++ b/tag/armadillo/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: armadillo"
+tag: armadillo
+robots: noindex
+---
diff --git a/tag/backup/index.md b/tag/backup/index.md
new file mode 100644
index 0000000..215d6fe
--- /dev/null
+++ b/tag/backup/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: backup"
+tag: backup
+robots: noindex
+---
diff --git a/tag/content-trust-series/index.md b/tag/content-trust-series/index.md
new file mode 100644
index 0000000..e52fc97
--- /dev/null
+++ b/tag/content-trust-series/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: content-trust-series"
+tag: content-trust-series
+robots: noindex
+---
diff --git a/tag/data-warehouse/index.md b/tag/data-warehouse/index.md
new file mode 100644
index 0000000..94b3e2f
--- /dev/null
+++ b/tag/data-warehouse/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: data-warehouse"
+tag: data-warehouse
+robots: noindex
+---
diff --git a/tag/github/index.md b/tag/github/index.md
new file mode 100644
index 0000000..789e13f
--- /dev/null
+++ b/tag/github/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: github"
+tag: github
+robots: noindex
+---
diff --git a/tag/kotlin/index.md b/tag/kotlin/index.md
new file mode 100644
index 0000000..73921b7
--- /dev/null
+++ b/tag/kotlin/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: kotlin"
+tag: kotlin
+robots: noindex
+---
diff --git a/tag/kyc-series/index.md b/tag/kyc-series/index.md
new file mode 100644
index 0000000..7f0b8b1
--- /dev/null
+++ b/tag/kyc-series/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: kyc-series"
+tag: kyc-series
+robots: noindex
+---
diff --git a/tag/looker/index.md b/tag/looker/index.md
new file mode 100644
index 0000000..0ef156d
--- /dev/null
+++ b/tag/looker/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: looker"
+tag: looker
+robots: noindex
+---
diff --git a/tag/oidc/index.md b/tag/oidc/index.md
new file mode 100644
index 0000000..7eaf341
--- /dev/null
+++ b/tag/oidc/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: oidc"
+tag: oidc
+robots: noindex
+---
diff --git a/tag/okta/index.md b/tag/okta/index.md
new file mode 100644
index 0000000..d9efd59
--- /dev/null
+++ b/tag/okta/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: okta"
+tag: okta
+robots: noindex
+---
diff --git a/tag/python/index.md b/tag/python/index.md
new file mode 100644
index 0000000..cf7bf08
--- /dev/null
+++ b/tag/python/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: python"
+tag: python
+robots: noindex
+---
diff --git a/tag/s3/index.md b/tag/s3/index.md
new file mode 100644
index 0000000..1cc87ad
--- /dev/null
+++ b/tag/s3/index.md
@@ -0,0 +1,6 @@
+---
+layout: tag_page
+title: "Tag: s3"
+tag: s3
+robots: noindex
+---
diff --git a/tag/terraform/index.md b/tag/terraform/index.md
index 8aae77f..cf41b51 100644
--- a/tag/terraform/index.md
+++ b/tag/terraform/index.md
@@ -1,6 +1,6 @@
---
layout: tag_page
-title: "Tag: terraform"
-tag: terraform
+title: "Tag: Terraform"
+tag: Terraform
robots: noindex
---