The post OpenAI Releases Double-Checking Tool For AI Safeguards That Handily Allows Customizations appeared on BitcoinEthereumNews.com. AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal. getty In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration. The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request. OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations. Let’s talk about it. This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here). The Importance Of AI Safeguards One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them. AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs. There is a… The post OpenAI Releases Double-Checking Tool For AI Safeguards That Handily Allows Customizations appeared on BitcoinEthereumNews.com. AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal. getty In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration. The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request. OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations. Let’s talk about it. This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here). The Importance Of AI Safeguards One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them. AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs. There is a…

OpenAI Releases Double-Checking Tool For AI Safeguards That Handily Allows Customizations

2025/11/04 17:25

AI developers need to double-check their proposed AI safeguards and a new tool is helping to accomplish that vital goal.

getty

In today’s column, I examine a recently released online tool by OpenAI that enables the double-checking of potential AI safeguards and can be used for ChatGPT purposes and likewise for other generative AI and large language models (LLMs). This is a handy capability and worthy of due consideration.

The idea underlying the tool is straightforward. We want LLMs and chatbots to make use of AI safeguards such as detecting when a user conversation is going afield of safety criteria. For example, a person might be asking the AI how to make a toxic chemical that could be used to harm people. If a proper AI safeguard has been instituted, the AI will refuse the unsafe request.

OpenAI’s new tool allows AI makers to specify their AI safeguard policies and then test the policies to ascertain that the results will be on target to catch safety violations.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

The Importance Of AI Safeguards

One of the most disconcerting aspects about modern-day AI is that there is a solid chance that AI will say things that society would prefer not to be said. Let’s broadly agree that generative AI can emit safe messages and also produce unsafe messages. Safe messages are good to go. Unsafe messages ought to be prevented so that the AI doesn’t emit them.

AI makers are under a great deal of pressure to implement AI safeguards that will allow safe messaging and mitigate or hopefully prevent unsafe messaging by their LLMs.

There is a wide range of ways that unsafe messages can arise. Generative AI can produce so-called AI hallucinations or confabulations that tell a user to do something untoward, but the person assumes that the AI is being honest and apt in what has been generated. That’s unsafe. Another way that AI can be unsafe is if an evildoer asks the AI to explain how to make a bomb or produce a toxic chemical. Society doesn’t want that type of easy-peasy means of figuring out dastardly tasks.

Another unsafe angle is for AI to aid people in concocting delusions and delusional thinking, see my coverage at the link here. The AI will either prod a person into conceiving of a delusion or might detect that a delusion is already on their mind and aid in embellishing the delusion. The preference is that AI provides upside mental health advice over downside mental health guidance.

Devising And Testing AI Safeguards

I’m sure you’ve heard the famous line that you ought to try it before you buy it, meaning that sometimes being able to try out an item is highly valuable before making a full commitment to the item. The same wisdom applies to AI safeguards.

Rather than simply tossing AI safeguards into an LLM that is actively being used by perhaps millions upon millions of people (sidenote: ChatGPT is being used by 800 million weekly active users), we’d be smarter to try out the AI safeguards and see if they do what they are supposed to do.

An AI safeguard should catch or prevent whatever unsafe messages we believe need to be stopped. There is a tradeoff involved since an AI safeguard can become an overreach. Imagine that we decide to adopt an AI safeguard that prevents anyone from ever making use of the word “chemicals” because we hope to avoid allowing a user to find out about toxic chemicals.

Well, denying the use of the word “chemicals” is an exceedingly bad way to devise an AI safeguard. Imagine all the useful and fair uses of the word “chemicals” that can arise. Here’s an example of an innocent request. People might be worried that their household products might contain adverse chemicals, so they ask the AI about this. An AI safeguard that blindly stopped any mention of chemicals would summarily turn down that legitimate request.

The crux is that AI safeguards can be very tricky when it comes to writing them and ensuring that they do the right things (see my discussion on this, at the link here). The preference is that an AI safeguard stops the things we want to stop, but doesn’t go overboard and stop things that we are fine with having proceed. A poorly devised AI safeguard will indubitably produce a vast number of false positives, meaning that it will stop an otherwise upside and allowable action.

If possible, we should try out any proposed AI safeguards before putting them into active action.

Using Classifiers To Help Out

There are online tools that can be used by AI developers to assist in classifying whether a given snippet of text is considered safe versus unsafe. Usually, these classifiers have been pretrained on what constitutes safety and what constitutes being unsafe. The beauty of these classifiers is that an AI developer can simply feed various textual content into the tool and see which, if any, of the AI safeguards embedded into the tool will react.

One difficulty is that those kinds of online tools don’t necessarily allow you to plug in your own proposed AI safeguards. Instead, the AI safeguards are essentially baked into the tool. You can then decide whether those are the same AI safeguards you’d like to implement in your LLM.

A more accommodating approach would be to allow an AI developer to feed in their proposed AI safeguards. We shall refer to those AI safeguards as policies. An AI developer would work with other stakeholders and come up with a slate of policies about what AI safeguards are desired. Those policies then could be entered into a tool that would readily try out those policies on behalf of the AI developer and their stakeholders.

To test the proposed policies, an AI developer would need to craft text to be used during the testing or perhaps grab relevant text from here or there. The aim is to have a sufficient variety and volume of text that the desired AI safeguards all ultimately get a chance to shine in the spotlight. If we have an AI safeguard that is proposed to catch references to toxic chemicals, the text that is being used for testing ought to contain some semblance of references to toxic chemicals; otherwise, the testing process won’t be suitably engaged and revealing about the AI safeguards.

OpenAI’s New Tool For AI Safeguard Testing

In a blog posting by OpenAI on October 29, 2025, entitled “Introducing gpt-oss-safeguard”, the well-known AI maker announced the availability of an AI safeguard testing tool:

  • “Safety classifiers, which distinguish safe from unsafe content in a particular risk area, have long been a primary layer of defense for our own and other large language models.”
  • “Today, we’re releasing a research preview of gpt-oss-safeguard, our open-weight reasoning models for safety classification tasks, available in two sizes: gpt-oss-safeguard-120b and gpt-oss-safeguard-20b.”
  • “The gpt-oss-safeguard models use reasoning to directly interpret a developer-provided policy at inference time — classifying user messages, completions, and full chats according to the developer’s needs.”
  • “The model uses chain-of-thought, which the developer can review to understand how the model is reaching its decisions. Additionally, the policy is provided during inference, rather than being trained into the model, so it is easy for developers to iteratively revise policies to increase performance.”

As per the cited indications, you can use the new tool to try out your proposed AI safeguards. You provide a set of policies that represent the proposed AI safeguards, and also provide whatever text is to be used during the testing. The tool attempts to apply the proposed AI safeguards to the given text. An AI developer then receives a report analyzing how the policies performed with respect to the provided text.

Iteratively Using Such A Tool

An AI developer would likely use such a tool on an iterative basis.

Here’s how that goes. You draft policies of interest. You devise or collect suitable text for testing purposes. Those policies and text get fed into the tool. You inspect the reports that provide an analysis of what transpired. The odds are that some of the text that should have triggered an AI safeguard did not do so. Also, there is a chance that some AI safeguards were triggered even though the text per se should not have set them off.

Why can that happen?

In the case of this particular tool, a chain-of-thought (CoT) explanation is being provided to help ferret out the culprit. The AI developer could review the CoT to discern what went wrong, namely, whether the policy was insufficiently worded or the text wasn’t sufficient to trigger the AI safeguard. For more about the usefulness of chain-of-thought in contemporary AI, see my discussion at the link here.

A series of iterations would undoubtedly take place. Change the policies or AI safeguards and make another round of runs. Adjust the text or add more text, and make another round of runs. Keep doing this until there is a reasonable belief that enough testing has taken place.

Rinse and repeat is the mantra at hand.

Hard Questions Need To Be Asked

There is a slew of tough questions that need to be addressed during this testing and review process.

First, how many tests or how many iterations are enough to believe that the AI safeguards are good to go? If you try too small a number, you are likely deluding yourself into believing that the AI safeguards have been “proven” as ready for use. It is important to perform somewhat extensive and exhaustive testing. One means of approaching this is by using rigorous validation techniques, as I’ve explained at the link here.

Second, make sure to include trickery in the text that is being used for the testing process.

Here’s why. People who use AI are often devious in trying to circumvent AI safeguards. Some people do so for evil purposes. Others like to fool AI just to see if they can do so. Another perspective is that a person tricking AI is doing so on behalf of society, hoping to reveal otherwise hidden gotchas and loopholes. In any case, the text that you feed into the tool ought to be as tricky as you can make it. Put yourself into the shoes of the tricksters.

Third, keep in mind that the policies and AI safeguards are based on human-devised natural language. I point this out because a natural language such as English is difficult to pin down due to inherent semantic ambiguities. Think of the number of laws and regulations that have loopholes due to a word here or there that is interpreted in a multitude of ways. The testing of AI safeguards is slippery because you are testing on the merits of human language interpretability.

Fourth, even if you do a bang-up job of testing your AI safeguards, they might need to be revised or enhanced. Do not assume that just because you tested them a week ago, a month ago, or a year ago, they are still going to stand up today. The odds are that you will need to continue to undergo a cat-and-mouse gambit, whereby AI users are finding insidious ways to circumvent the AI safeguards that you thought had been tested sufficiently.

Keep your nose to the grind.

Thinking Thoughtfully

An AI developer could use a tool like this as a standalone mechanism. They proceed to test their proposed AI safeguards and then subsequently apply the AI safeguards to their targeted LLM.

An additional approach would be to incorporate this capability into the AI stack that you are developing. You could place this tool as an embedded component within a mixture of LLM and other AI elements. A key aspect will be the proficiency in running, since you are now putting the tool into the stream of what is presumably going to be a production system. Make sure that you appropriately gauge the performance of the tool.

Going even further outside the box, you might have other valuable uses for a classifier that allows you to provide policies and text to be tested against. In other words, this isn’t solely about AI safeguards. Any other task that entails doing a natural language head-to-head between stated policies and whether the text activates or triggers those policies can be equally undertaken with this kind of tool.

I want to emphasize that this isn’t the only such tool in the AI community. There are others. Make sure to closely examine whichever one you might find relevant and useful to you. In the case of this particular tool, since it is brought to the market by OpenAI, you can bet it will garner a great deal of attention. More fellow AI developers will likely know about it than would a similar tool provided by a lesser-known firm.

AI Safeguards Need To Do Their Job

I noted at the start of this discussion that we need to figure out what kinds of AI safeguards will keep society relatively safe when it comes to the widespread use of AI. This is a monumental task. It requires technological savviness and societal acumen since it has to deal with both AI and human behaviors.

OpenAI has opined that their new tool provides a “bring your own policies and definitions of harm” design, which is a welcome recognition that we need to keep pushing forward on wrangling with AI safeguards. Up until recently, AI safeguards generally seemed to be a low priority overall and given scant attention by AI makers and society at large. The realization now is that for the good and safety of all of us, we must stridently pursue AI safeguards, else we endanger ourselves on a massive scale.

As the famed Brigadier General Thomas Francis Meagher once remarked: “Great interests demand great safeguards.”

Source: https://www.forbes.com/sites/lanceeliot/2025/11/04/openai-releases-double-checking-tool-for-ai-safeguards-that-handily-allows-customizations/

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.
Share Insights

You May Also Like

Franklin Templeton updates XRP ETF filing for imminent launch

Franklin Templeton updates XRP ETF filing for imminent launch

Franklin Templeton, one of the world’s largest asset management firms, has taken a significant step in introducing the Spot XRP Exchange-Traded Fund (ETF). The company submitted an updated S-1 registration statement to the U.S. Securities and Exchange Commission (SEC) last week, removing language that likely stood in the way of approval. The change is indicative of a strong commitment to completing the fund sale in short order — as soon as this month. The amendment is primarily designed to eliminate the “8(a)” delay clause, a technological artifact of ETF filings under which the SEC can prevent the effectiveness of a registration statement from taking effect automatically until it affirmatively approves it. By deleting this provision, Franklin Templeton secures the right to render effective the filing of the Registration Statement automatically upon fulfillment of all other conditions. This development positions Franklin Templeton as one of the most ambitious asset managers to file for a crypto ETF amid the current market flow. It replicates an approach that Bitcoin and Ethereum ETF issuers previously adopted, expediting approvals and listings when the 8(a) clause was removed. The timing of this change is crucial. Analysts say it betrays a confidence that the SEC will not register additional complaints against XRP-related products — especially as the market continues to mature and regulatory infrastructures around crypto ETFs take clearer shape. For Franklin Templeton, which manages assets worth more than $1 trillion globally, an XRP ETF would be a significant addition to its cryptocurrency investment offerings. The firm already offers exposure to Bitcoin and Ethereum through similar products, indicating an increasing confidence in digital assets as an emerging investment asset class. Other asset managers race to launch XRP ETFs Franklin Templeton isn’t the only one seeking to launch an XRP ETF. Other asset managers, such as Canary Funds and Bitwise, have also revised their S-1 filings in recent weeks. Canary Funds has withdrawn its operating company’s delaying amendment and is seeking to go live in mid-November, subject to exchange approval. Bitwise, another major player in digital asset management, announced that it would list an XRP ETF on a prominent U.S. exchange. The company has already made public fees and custodial arrangements — the last steps generally completed when an ETF is on the verge of a launch. The surge in amended filings indicates growing industry optimism that the SEC may approve several XRP ETFs for marketing around the same time. For investors, this would provide new, regulated access to one of the world’s most widely traded cryptocurrencies, without the need to hold a token directly. Investors prepare for ripple effect on markets The competition to offer an XRP ETF demonstrates the next step toward institutional involvement in digital assets. If approved, these funds would provide investors with a straightforward, regulated way to gain token access to XRP price movements through traditional brokerages. An XRP ETF could also onboard new retail investors and boost the liquidity and trust of the asset, similarly to what spot Bitcoin ETFs achieved earlier this year. Those funds attracted billions of dollars in inflows within a matter of weeks, a subtle indication of the pent-up demand among institutional and retail investors. The SEC, which has become more receptive to digital-asset ETFs after approving products including Bitcoin and Ethereum, is still carefully weighing every filing. Final approval will be based on full disclosure, custody, and transparency of how pricing is happening through the base market. Still, market participants view the update in Franklin Templeton’s filing as their strongest sign yet that they are poised. With a swift response from the firm and news of other competing funds, this should mean that we don’t have long to wait for the first XRP ETF — marking another key turning point in crypto’s journey into traditional finance. If you're reading this, you’re already ahead. Stay there with our newsletter.
Share
Coinstats2025/11/05 09:16