Purview Data Classification and the Downstream Purview Effects

Reading Time: 7 minutes

Disclaimer: this was a challenging post to write. It’s been in the hopper for several months, but the topic is important so I thought it was worth sharing.

Purview data classification (in particular for custom classifiers) and its impact on Purview service-side features/solutions is really, really, really important to understand.

Do I have your attention? 😉

What do I mean by a service-side Purview feature/solution? A few key ones…

  • An auto-labeling policy to apply a sensitivity label to at-rest SharePoint/OneDrive content
  • An auto-apply retention label policy to apply a retention label to at-rest SharePoint/OneDrive content
  • An eDiscovery search that is querying for a classifier in at-rest SharePoint/OneDrive content

Notice a theme there? I’m focused on at-rest content in SharePoint and OneDrive in this post and there’s a lot of it.

Obvious by its absence is the mention of at-rest Exchange Online content. As of September 2025, at-rest content in Exchange Online is not classified so you won’t be able to find classifier matches (either pre-built or custom). Exchange content is only classified while in transit (as emails are sent/received).

Check out a recent post from my friend, Matthew Silcox, where he provides a PowerShell script to find classifier matches in Exchange Online mailboxes you may find useful.

I regularly hear these 2 concepts being used interchangeably: “classifying your data” and “applying sensitivity labels”. They’re not the same. In fact, data can be classified without ever applying a sensitivity label. Data classification can certainly inform the application of a sensitivity label, but it can do so many other things.

Why does this distinction matter? It matters because data classification is a prerequisite and separate process required before many Purview features (not just sensitivity labels) can take action based upon the classification details.

Many client-side and service-side Purview solutions can consume classification details. Some of the main ones:

  • Information Protection (the application of sensitivity labels)
  • Records Management (the application of retention labels)
  • Data Loss Prevention (DLP) policies (icons, policy tips, restrict/block actions)
  • Insider Risk Management (IRM) policies (prioritize classifier content)
  • Communication Compliance (CC) policies (detect classifier content in communications)
  • eDiscovery search (find classifier matches)

Ironically, this process harkens back to the Purview diagram below that Microsoft has used for many years and as it turns out, still holds true today. You see this diagram in the official Microsoft guidance for data classification (link).

Central to the KNOW YOUR DATA phase (the first step) is the Purview data classification service:

image showing Microsoft's

A practical example to set the stage: if the data classification service scans a file’s content (in the KNOW YOUR DATA phase) and determines there are 3 high confidence credit card numbers (CCNs) in the file, it stores this information in the classification details for the file (in a location available to Purview). Only then can downstream Purview solutions consume this information as a condition to act upon. Some examples of downstream actions:

  • A DLP policy could block sharing of the file outside of your organization
  • A sensitivity label could be automatically applied
  • A retention label could be automatically applied
  • IRM could use this as a risk indicator to feed into a user’s risk score
  • A CC policy could generate an alert if users are exfiltrating this content
  • An eDiscovery search will return this information when searching for CCNs.

Custom Classifiers

As you can imagine, central to the data classification service are classifiers. 🙂 Microsoft Purview provides many pre-built classifiers as well as capabilities to create your own custom ones. A custom classifier can be built by either modifying one of the pre-built classifiers to improve its accuracy or creating a new one specific to your organization that has no pre-built equivalent.

Examples:

  • copy the pre-built Canada SIN or U.S. SSN classifier and update it to include more keywords for improved accuracy
  • create an Exact Data Match classifier to identify your unique customer account numbers

A custom classifier helps to ensure comprehensive coverage of sensitive assets across your environment. (E.g., internal project names/codes, proprietary formulas, proprietary forms, unique identifiers such as customer #, patient #, student #, etc.)

Once classified, an item’s classification details includes metadata such as the classifier name and GUID, confidence levels detected, and the # of classifier occurrences found. A few examples where you’ll see this information exposed in Purview:

  • Data Explorer detailed item view
  • DLP alert event metadata
  • On-demand classification detailed item view

Below is an example of the classification details for a file containing 3 counts of a medium confidence custom classifier called “ABC Corp Customer Number”. In this example, no high confidence match was found for the classifier. This is from the on-demand classification service targeting a SharePoint site’s content:


ClassificationInfo:{"SensitiveInformation":[{"ClassifierType":"Content","Confidence":65,"Count":3,"PrivacyPrimaryMatches":[],"SensitiveInformationDetailedClassificationAttributes":[{"Confidence":65,"Count":3},{"Confidence":75,"Count":3},{"Confidence":85,"Count":0}],"SensitiveInformationDetections":{"DetectedOffsetsAndLengths":"9#9,0#8,19#9","ResultsTruncated":false},"SensitiveType":"83180073-5b57-4e2a-9ed0-602818b6dbda","SensitiveTypeSource":"Tenant","SensitiveInfoTypeName":"ABC Corp Customer Number"}]}


A more user-friendly view of classification information is shown in numerous locations across Purview. Here’s an example of the Match summary tab from on-demand classification for the same file described above:


Custom Classifier Timeline for Classification

This timeline is important…

Custom classifiers may be created and changed over time as business/regulatory requirements evolve and as you refine the classifier definition to improve its accuracy. Due to this, it becomes important to understand the timeline of when a custom classifier was built/last changed in your tenant and the potential downstream effects.

In the perfect world, all active and historical/at-rest content would be automatically (re)classified with the latest custom classifier definitions to allow Purview service-side solutions referenced earlier to take action based on the up-to-date classification. But… since this isn’t “automatic” for all at-rest content, this leaves a gap in coverage. How can you ensure all data is reclassified?

Data classification is triggered in these ways…

  1. Automatically with continuous classification. This happens automatically when content is created, accessed, or modified. An example of this is when you create/modify/upload documents into SharePoint/OneDrive, or when you open a document to share it, or when you are composing/sending an email. The content is inspected and classified in-the-moment. Although this covers your active content which is great, it does NOT classify all existing, at-rest content which leads us to the next 2 methods it can be triggered…
  2. Initiated via a full re-index of the SharePoint site. This method is when you initiate a re-index of the SharePoint site (site settings… Search and Offline availability… Reindex site) which will update the search index as well as the classification details; however, you do not have any insight on the re-indexing status, it can take significantly longer, and, perhaps more importantly, it can potentially have a negative impact on the end-user performance on the site while it is being reindexed. This leaves the next method which is my recommendation…
  3. Initiated via an on-demand classification. This method is where you explicitly request a classification of at-rest content in a location. Note that this is a pay-as-you-go service in Purview (learn about on-demand classification). Due to this, thoughtful planning must go into which locations require classification and when you should do it.

Why would on-demand classification be required? Imagine a scenario where a custom classifier is created several years after you’ve migrated content into SharePoint/OneDrive and you haven’t touched a majority of the migrated files since? To ensure these historical files are classified with the up-to-date classification details for your new/changed custom classifier so the appropriate downstream protection and retention controls can be applied, it requires on-demand classification.

A practical example

This is based on a recent customer’s scenario… You have built a custom classifier to identify Canadian Social Insurance Numbers (SINs) across your tenant because the pre-built classifier for Canadian SIN was not accurate enough. You have been in Microsoft 365 for years and have many historical sites and OneDrive accounts with many legacy, untouched, at-rest files you suspect contain a SIN according to your custom classifier definition.

It is important for you to identify any documents containing your custom SIN definition (even historical) because the desired outcome for all files containing a SIN is to do these 2 things:

  1. Automatically apply an encrypting sensitivity label to both protect against data exfiltration risks and to ensure the usage rights granted via the label prevent Copilot usage of the file (EXTRACT and VIEW)
  2. Automatically apply a retention label to delete them from all SharePoint/OneDrive sites 1 month after their created date due to a regulatory compliance audit requirement

You may be thinking you could leverage the 2 service-side Purview capabilities below to accomplish this using the high confidence custom SIN as a condition:

  • automatically apply the sensitivity label using an auto-labeling policy
  • automatically apply the retention label using an auto-apply label policy

… but they won’t find the SINs because both those service-side policies work against the current classification of data (which for many of your historical files will be out-of-date) and since the custom SIN classifier won’t be detected, neither label will be applied.

As a pragmatic approach, I prefer to assume a location will require a reclassification to update the classification details anytime I’m creating/updating custom SITs and the format, keywords, or confidence levels have changed.


Mitigating Controls

There are several controls that could be used to mitigate the risk of data exfiltration of at-rest content. Data exfiltration causes the content to be in-motion so features that work against data-in-motion will trigger the continuous classification which will allow the updated classification details to be consumed.

Example: a simple control is to configure a DLP policy to detect the custom classifier and block/restrict access. This works because it will automatically trigger (continuous) classification of both an Exchange email when it’s in transit and a SharePoint/OneDrive file when it’s accessed.

In my testing, the DLP (warn/block) icons will not appear on historical SharePoint/OneDrive content matching your custom classifier if their classification details are out-of-date and the DLP policy has a condition targeting the custom classifier. They will only appear once the content is reclassified.

Why this timeline is important

Building custom classifiers is a common practice for many of my customers. The iterative process of refining a classifier over time also means a reclassification may be required for some locations once you have the classifier fine-tuned for accuracy. However, once content is classified, you can confidently leverage other Purview solutions to act upon the classification, particularly these service-side features:

  1. Auto-labeling policy to apply a sensitivity label
  2. Auto-apply label policy to apply a retention label
  3. An eDiscovery search will find classifier matches across all classified SharePoint content

My key takeaways

  • At-rest content must be classified before service-side Purview controls can consume the classification details as a condition to take action
  • The earlier you can build your custom classifiers in your tenant timeline, the more content you can cover without requiring on-demand classification since new/changed content is automatically classified (doing as much of this work prior to migration may be worth the effort)
  • If you’re building new or changing custom classifiers (changing format and/or confidence levels that downstream solutions are using as a condition), use on-demand classification to (re)classify historical content
  • Target only new/changed classifiers in on-demand classification (a configuration option) to reduce classification time (and cost?)
  • If Microsoft were to change one of the pre-built classifiers, (to my knowledge, this doesn’t haven’t often), then it would be subject to the same downstream effects as a custom classifier

Thanks for reading.

-JCK

 

3 comments

  1. Hello Joanne,

    Normally I use re-index all the sharepoint sites to be able to trigger SPO indexing.
    unfortunately, it is not possible detect data at-REST in Exchange online for the new custom Sensitive Info Type.

    I tried to use on-demand classification, however, it wasn’t accurate at all, The on-demand classification “viewer” will charge 38k for the new scan, however, when I checked the discovered files, no one of the contained the custom sensitive information.

    How did you re-classify content before on-demand classification?

    1. Hi Sergio,
      I added in some verbiage around Exchange online at-rest content. The post is focusing on SPO/OD at-rest content. Thank you for pointing that out!!
      [Update] Re-indexing SPO sites to refresh the search index does in fact update the classification details for the historical content, however it will potentially take much longer, has end-user impacts, and does not allow for granular controls like on-demand classification such as allowing you to specifically target classifiers, file types, and date range. On larger historical sites this may be a significant benefit for classification.
      -Joanne

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.