Image of a person's hand writing on a document to indicate the use of OCR to detect the document content once scanned.

Microsoft Purview OCR and SharePoint Premium OCR

Reading Time: 7 minutes

There’s a lot of unstructured content in a Microsoft 365 environment that isn’t found in a text-only office or PDF document. In most organizations, content also includes images that may be embedded in Office docs or PDFs, scanned (image-based) PDFs, and regular image files (.png, jpg, etc.). Like text-based content sometimes can be, some image-based content is also important and sensitive.

Optical Character Recognition (OCR) is a technology that can help. It extracts printed or handwritten text in images or scanned documents into machine-readable digital text.

This is important for 2 practical reasons:

  • allows the extracted text to be searchable thereby increasing its value (think of downstream capabilities that could use the extracted text to take other actions such as workflows, notifications, content assembly, etc.)
  • allows (some) Purview controls to be applied based on the extracted text

In Microsoft 365, there are 2 places where OCR can be enabled:

  • Purview – can target Exchange, SPO, OD, Teams, Windows, macOS endpoints
  • SharePoint Premium (SPP) – can target SPO, OD

You can also extract text from images using other related Microsoft OCR services such as Azure’s AI Document Intelligence and Azure Computer Vision OCR. This post isn’t about those. 🙂

Since both Purview and SPP OCR services can target SPO and OD, this left me wondering… if I were to use each OCR service in SharePoint, what’s the difference between them… is there a difference? Do I need both?

My TL;DR Takeaways

  • It doesn’t matter which OCR service extracts the data from images, you can use the extracted text from either service in supported Purview policies (assuming the policy is scoped to the SharePoint location of course).
  • Both OCR services appear to behave the same as far as text extraction and SharePoint metadata properties it exposes:
    • For image file types only (png, jpg, jpeg, bmp, etc.) , the OCR-extracted text is visible in the ‘Extracted Text’ column in a SharePoint library and is accessible thru the SharePoint crawled property, ows_MediaServiceOCR. You must map the crawled property to a RefinableString managed property to use in a search query. Be assured, even if the ‘Extracted Text’ column isn’t populated in SharePoint for non-image file types (image-based PDFs, embedded images, etc.), the value is still stored and searchable in the search index (Purview can see it)
    • You can see which OCR service was used on a file in the managed property, MediaServiceMetadata. If you have both Purview OCR and SPP OCR enabled for the same site, you won’t be charged twice
    • Office docs and text-based PDFs are both searchable without requiring OCR; however, if they include embedded images, OCR will scan the images within them to extract text it finds
    • If images are the same, there is some de-duplication that occurs to prevent the image from being scanned (billed for) again. I saw this during testing when I re-uploaded the same image multiple times.
    • By default, both OCR services will apply to new/updated content after the service was enabled.

Conclusion: it is 2 ways of arriving at the same destination. Text will be extracted from images and made available to both SharePoint and Purview for downstream processing.

Read on for more details.


Use-case Scenario for OCR

You’re an account rep having an on-site meeting with a customer. After the meeting,  you take a picture of some handwritten notes you made during the meeting including some potentially sensitive information such as an account number, customer name, etc. You then upload the picture to your team’s SharePoint site where it is stored as a .jpg file.

From a business usability perspective, the account rep will likely need to find some information from their handwritten notes at some later time. The expectation would be that a simple search for the customer in SharePoint would quickly surface the image file that held the meeting notes. However, without OCR enabled on the SharePoint site, the text in the image file is not searchable… a potentially frustrating and time-consuming experience would ensue for the account rep to click around and track it down.

From a security and compliance perspective, some of the content in the image is sensitive information and should be protected with the same controls as everything else in the environment with that same sensitive information (a blocking DLP policy for example). However, without OCR enabled on that SharePoint site, the sensitive information found in the image file goes undetected… a potentially significant data risk.


My Site Setup for Testing

To help showcase the capabilities and highlight the differences, if any, between Purview OCR and SPP OCR, I set up 3 sites with the following configuration:

  • 1 site with BOTH Purview OCR and SPP OCR enabled
  • 1 site with ONLY Purview OCR enabled
  • 1 site with ONLY SPP OCR enabled
  • DLP policy targeting ALL 3 sites with a block with override action when sensitive information (from a custom Sensitive Information Type) is found

Venn diagram showing: 1 site with BOTH Purview OCR and SPP OCR enabled 1 site with ONLY Purview OCR enabled 1 site with ONLY SPP OCR enabled


Setup and Billing for each OCR service

To enable each of these OCR services:

  • Purview OCR: Purview… Settings… Optical Character Recognition
  • SPP OCR: M365 Admin Center… Settings… Org settings… Services tab… Pay-as-you-go services… Settings tab… Syntex services for Documents & images… Optical character recognition

Both Purview and SPP OCR are pay-as-you-go services in Azure that will track usage and cost with a meter. For both OCR services, Syntex billing setup is required. If you’ve already configured Syntex billing to enable OCR for SPP, there is no extra setup required for billing Purview OCR.

Link: Set up pay-as-you-go billing

To track usage for OCR billing, you will see which OCR service is being used on the site when scanning images in the MediaServiceMetadata managed property value of each file:

  • Both Purview and SPP OCR enabled on site:
    • MediaServiceMetadata: “sendBilling:”true,”additionalData”:{“source”:”Compliance; M365”}
  • Only Purview OCR enabled on site:
    • MediaServiceMetadata: “sendBilling:”true,”additionalData”:{“source”:”Compliance”}
  • Only SPP OCR enabled on site:
    • MediaServiceMetadata: “sendBilling:”true,”additionalData”:{“source”:”M365”}
  • Either OCR enabled on the site, but not required for the file:
    • MediaServiceMetadata: “billedEvents”:[]

My File Setup for Testing

Although I didn’t test every supported file type, I did test some common ones based on what I see in my customers’ tenants. I added some sensitive information (based on a custom Sensitive Information Type) into each of the file types below (to test the Purview controls):

  • image files (png)
  • image-based PDF
  • text-based PDF with embedded images
  • Office doc with embedded images
  • text-based PDF and Office files containing no images

Where can you see the Extracted Text in SharePoint?

It depends entirely on the file type. For the file types I tested with:

  •  png image files (I assume this is the same for all supported image file types)
    • the extracted text is visible in the SharePoint column ‘Extracted Text’ and it is searchable
    • the extracted text is in crawled property ows_MediaServiceOCR and in the worddump property of the MediaServiceMetadata managed property
  • image-based PDF, text-based PDF with embedded images, Office doc with embedded images
    • the extracted text is not visible in the SharePoint column ‘Extracted Text’, but the text is still stored (somewhere??) in the search index which makes it searchable 🙂
    • sometimes I see the HitHighlightedSummary managed property populated with snippets of the extracted text, but that seems to be hit and miss and it wouldn’t contain ALL of the extracted text anyway. This managed property is a portion of text around a matched search term and is only filled in when a search query happens.
  • text-based PDF and Office files containing no images
    • no OCR required, the SharePoint column ‘Extracted Text’ is not populated, all text is searchable without requiring OCR.

The takeaway from this? Don’t assume that OCR didn’t scan and extract text from the images simply because you don’t see the Extracted Text SharePoint column populated with a value.


Which Purview Solutions can use the Extracted Text in SharePoint?

The below Purview solutions can act upon an image’s extracted text if they are targeting the SharePoint location where OCR is being used:

  • Data Loss Prevention policy (for SITs)
  • Records Management auto-apply retention label policy (for keywords and SITs)
  • Insider Risk Management policy (SITs and Trainable classifiers for risk scoring)

Link: Supported locations and solutions

Note: SIT confidence levels are detected and honored as is shown in the image below from one of the sites in my test (the SPP OCR site) where a few DLP policies are looking for high confidence CCNs and high confidence custom Customer # SIT). The DLP icon indicates that it sees the high confidence SIT in the image content. The same would apply for an auto-apply retention label policy.

image of a SharePoint document library with different file types shown with DLP policy icons beside each one that was a match.

The takeaways from this?

  • You don’t have to change existing Purview policies to pick up matches on the extracted text. They will automatically include extracted text when looking for matches defined in the policy.
  • As long as either Purview OCR or SPP OCR is enabled on a site, Purview policies can see the extracted text.

Disclaimer: I enabled Purview OCR in my tenant, and then explicitly excluded the SPP OCR Only site from it. From my observations while testing, Purview was able to see the extracted text regardless of which OCR service was used to extract it.

Sidebar: Other Purview solutions that have their own built-in OCR functionality:

  • Communication Compliance has its own built-in OCR functionality and as of August 2025 does NOT support OCR Purview. (Comm Compliance and OCR)
  • eDiscovery cases have their own built-in OCR functionality you can enable to OCR the contents of data sources for a review set. This may be a requirement if data sources haven’t been previously OCR’d in your tenant and you want to apply OCR to review (this is done in the advanced indexing step for ‘partially indexed’ items). From my testing, a search for extracted text or a sensitive information type will return images that have previously been OCR’d with either the Purview or SPP OCR service though.

Parting thoughts

That’s a lot of detail. I hope this has provided some additional guidance around the 2 services and why you may want to consider enabling OCR on some of your sites due to the downstream, value-add it provides.

Reach out if you have questions! I’d love to hear about your feedback on the OCR capabilities.

Thanks for reading.

-JCK

2 comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.