Format strings – good / easy / elegant way to do contextual capitalization of placeholders?

:information_source: Attention Topic was automatically imported from the old Question2Answer platform.
:bust_in_silhouette: Asked By twi

Hello, I’m writing a script that will allow strings to refer properly to a custom player character or something of that nature (chiefly, implementing pronouns and the grammatical rules to go along with them). Right now the issue I’m having trouble with is how I would go about dynamically capitalizing {placeholder}s in accordance with their context within a string (the most obvious example being when they appear at the beginning of a sentence or string).

Oversimplified example:

var testcase = "Susie is a {species}. {species}s have soft fur and four legs."
print(testcase.format({"species" : "cat"})

naïve output: Susie is a cat. cats have soft fur and four legs.

By now I’ve designed a function which in its current state looks like this:

assert(typeof(a) == TYPE_STRING)
var i : int = 0 # Character pointer

var s = ""
var t = ""
var skip = []
var res = a 
while i < res.find_last("{"):
	var cap = false

	# Locate and store the placeholder
	i = res.find("{",0)
	while i in skip:
		i = res.find("{",i+1)
	s = res.substr(i, (res.find("}",i+1) - i) + 1)
	
	match s:
		"{NAME}": t = pcname # yes this is against style; i'm not bothered
		"{THEY}" : t = set[prns.SUB] # "set" is the selected array containing the pronouns themselves
		"{THEM}" : t = set[prns.OBJ] #  and "prns" is an enum defining the indices of each class of pronoun.
		"{THEIR}" : t = set[prns.POS] # This structure can present its own problems but it's what I have right now
		"{THEIRS}" : t = set[prns.IPOS] # Placeholder name differs from abbreviated technical name of pronoun
		"{THEYRE}" : t = set[prns.PRES] # so as to be more natural to type in a sentence
		"{THEYVE}" : t = set[prns.HAVE]
		"{IS}" : t = set[prns.IS] # Singular "they" requires this
		"{S}" : t = set[prns.S] # For grammatical reasons
		"{ES}" : t = set[prns.ES] # Also for grammatical reasons
		"{N}" : t = set[prns.N] # Ditto
		_ : # placeholder not found
			skip.append(i)
			t = s 
	
	if i == 0:
		cap = true
	elif res[i-2] == "." and res[i-1] == " ":
		cap = true
	t = t.capitalize() if cap else t
	res.erase(i, (res.find("}",i+1) - i) + 1)
	res = res.insert(i,t)
return(res)

Forgive my beginner programming skills but I can’t help but feel like there’s a better or more efficient way to do this. This doesn’t seem like that out-there of a use case so maybe I’m missing something obvious? Yet I can’t find any questions or instruction relating to this online.

EDIT: I’ve made a few fixes which should prevent improper capitalization or hanging given the wrong string but my question still stands.

:bust_in_silhouette: Reply From: jgodfrey

Just a thought, but assuming the tokenized strings are all canned, why not just record the appropriate case within the token representation itself? So, for your example above, maybe something like this:

var testcase = "Susie is a {species}. {Species}s have soft fur and four legs."

Note that I’ve used two different character cases for the species token as required for the sentence structure.

If something like that would work, you’d then just need to find the tokens in a case insensitive way, but then transfer the token’s actual case to the substituted string. That might make the substitution a bit trickier, but at least then, you’d have absolute control over the use of case in the sentences, rather than relying on an algorithm to “figure it out”.

Regarding your general method of parsing an input string, finding the tokens, and substituting them in some output string…

I’d assume a more elegant solution could be devised leveraging regular expressions to locate the tokens and basic string replace calls to substitute them, with some logic in between to determine appropriate substitution strings.

That’s probably a path I’d investigate if I were doing this. That said, I don’t have any code I can point to that’d be particularly helpful here ATM…

RegEx — Godot Engine (stable) documentation in English

String — Godot Engine (stable) documentation in English

jgodfrey | 2020-12-16 17:49

I was hoping there’d be some way to not have to essentially reimplement format(), but honestly now that it’s all said and done the function doesn’t really seem all that bad. When first considering how to approach this problem I rejected the idea of just encoding the case in the token because the only way I could think to do it at the time was to create a duplicate for every token, which is obviously absurd, but I’ve used a token[1].casecmp_to("a") to implement your suggested approach and it works perfectly
This solution doesn’t work with diacritics, but I wasn’t really planning to use them in tokens anyhow

twi | 2020-12-16 22:22

Yes, exactly. You don’t want to duplicate the tokens, just compare them in a case insensitive manner. Anyway, glad you have it working.

jgodfrey | 2020-12-16 22:31