To start, I took the differential expression data from Fallon et al. (2018) and applied two layers of filtering (expression enrichment and enzyme annotation) to reduce the Photinus pyralis genome from 15,773 genes to a manageable set of luciferin biosynthesis candidates. After cross-species BLAST analysis, phylogenetics, and manual review, three novel candidates and one known candidate emerged.

Data sources

  • PPYR_OGS1.1.enzyme.ids.txt Fallon’s enzyme annotation list. Essential for the second filtering step, where I narrow candidates to genes that are likely to encode enzymes.
  • PPYR_OGS1.1_fatbody-vs-lantern…_test.txtDifferential expression results, including q-values (adjusted p-values). Used to determine which genes are significantly enriched in the lantern.
  • HMMER hmmscan: Used to look for known protein domains in candidate sequences.

Why compare lantern vs. fat body?

The fat body in insects is analogous to a liver. It handles general metabolic tasks: fat storage, detoxification, immune activity, and protein synthesis.

Fallon’s experiment compared gene expression in the lantern (the abdominal light organ) to gene expression in the fat body. The reasoning is straightforward:

  • If a gene is strongly enriched in the lantern, it may be involved in bioluminescence.
  • If it’s expressed equally in both tissues, it’s probably doing general metabolic work.

What I did:

I took the three files from Fallon et al. (2018) and applied two layers of filtering.

Layer A: Expression filter

Starting from all 15,773 genes, I kept only those that met all three criteria:

  • 1Lantern TPM ≥ 50: Moderate expression in the lantern. For reference, luciferase sits at ~66,743 TPM.
  • 2Sleuth b ≥ 3: Estimated log₂ fold change. b = 3 means ~8× higher expression in lantern; b = 5 means ~32× higher.
  • 3qval ≤ 1e-10: Adjusted p-value corrected for testing ~15,773 genes at once. Less than 1 in 10 billion chance the observed difference is random.

This reduced the list to 41 genes, or what I’m calling the “lantern module.”

Layer B: Enzyme filter

From the 41 genes that passed Layer A, I kept only those that appear in Fallon’s enzyme annotation file. That brought the list down to 18 genes.

Fallon’s team used InterProScan to analyze every predicted protein in the firefly genome. InterProScan compares protein sequences to known domain databases and assigns Gene Ontology (GO) terms. I filtered for genes tagged with GO:0003824, “catalytic activity.”

Out of ~15,773 genes total, about 3,890 were labeled this way. Intersecting that list with the 41 lantern-enriched genes left 18 strong enzyme candidates.

Interpreting the candidates

After filtering, I reviewed each remaining gene using its GO annotations and HMMER domain hits to see whether the predicted chemistry made sense. I focused on activities consistent with luciferin biosynthesis (things like small-molecule metabolism, oxidation, or cyclization) and deprioritized genes involved in clearly unrelated processes.

Side note: why the filters aren’t perfect

The b ≥ 3 and qval ≤ 1e-10 criteria actually excluded luciferase itself.

Luciferase (PPYR_00001) has b = 5.27 (very strong enrichment) but qval ≈ 9e-5, which is above my cutoff. This happene as one fat-body replicate had unexpectedly high luciferase expression (~3,693 TPM, compared to ~61 and ~162 in the other two). That inflated the variance, so sleuth still flagged it as significant, but with less statistical confidence.

This is a reminder that the pipeline is conservative. Even known bioluminescence genes can fall outside strict thresholds because of replicate noise. I will need to expand my search going forward.

Knocked-out candidates

The following genes passed both filters but were discarded after manual review and cross-species comparison.

PPYR_09240 Discarded

A CoA-transferase family III enzyme (PF02515) with a very strong domain match (E-value 2.5e-70) and high lantern expression (TPM ~1,210). Annotated as alpha-methylacyl-CoA racemase (AMACR), and in one Photinus annotation called “lantern racemase.” AMACR enzymes convert stereoisomers of branched-chain acyl-CoA molecules and typically function in peroxisomal metabolism.

At first glance this seemed promising: luciferase itself operates in a similar biochemical space (acyl-CoA-like chemistry). However, sequence comparisons show that this protein is broadly conserved across insects, including many non-luminous species at ~60–65% identity. This indicates it’s an ancient, general metabolic enzyme rather than a firefly-specific innovation.

PPYR_06980 Discarded

Cytochrome P450 4g15 (CYP4G15). Present across essentially all beetles, not just fireflies: ~74–97% identity in fireflies, ~71–73% in non-luminous beetles, and ~73–74% even in other luminous beetles (e.g., click beetles).

CYP4G15 is a well-characterized insect enzyme involved in cuticular hydrocarbon biosynthesis, it produces the waxy waterproof coating on the insect’s surface. Since every beetle needs this function, the gene is highly conserved and not specific to bioluminescent species. Although strongly expressed in the lantern, the most likely explanation is structural rather than involvement in luciferin biosynthesis.

PPYR_07361 Discarded

A UDP-glucuronosyl/glucosyl transferase (UGT) (PF00201, E-value 9.6e-83), with an additional C-terminal glycosyltransferase domain (PF06722). UGTs attach sugar groups to small molecules and are widespread in insects for detoxification and metabolite processing.

In theory, a UGT could modify luciferin or a precursor for storage or transport. But sequence comparisons show it’s a standard, highly conserved insect enzyme: Tribolium ~71%, Tenebrio ~70–72%, Leptinotarsa ~71.6%, Rhyzopertha ~73.3%. Annotated as UDP-glucuronosyltransferase 2C1, it’s clearly present across many non-luminous beetles.

Relaxing the filter

My original (“strict”) filter was:

Strict filter

  • Lantern TPM ≥ 50
  • sleuth b ≥ 3 (≈ 8×)
  • qval ≤ 1e-10

15,773 → 41 genes → 18 enzymes

Relaxed filter

  • Lantern TPM ≥ 20
  • sleuth b ≥ 2 (≈ 4×)
  • qval ≤ 0.001

15,773 → 117 genes → 94 enzymes (76 new)

I lowered the minimum lantern expression (50 → 20 TPM), the fold-change requirement (8× → 4×), and loosened the significance cutoff (1e-10 → 1e-3). The goal was to catch genes that are less dramatically lantern-enriched but still clearly biased toward the lantern.

Flagging interesting candidates

From the 76 new enzyme-annotated genes, I manually scanned GO terms and pulled out anything with chemistry plausibly relevant to luciferin biosynthesis:

  • Oxidoreductases (GO:0016491, GO:0055114) — luciferin pathways typically involve oxidation chemistry
  • FAD-binding enzymes (GO:0050660) — flavoenzymes are common small-molecule oxidizers
  • P450 / heme oxidases (GO:0016705, GO:0020037) — classic aromatic oxidation machinery
  • Monooxygenases (GO:0004497) — potential hydroxylation steps
  • Transferases that move acyl / sulfur / sugar groups — plausible modification of luciferin precursors
  • Anything tagged GO:0008218 (bioluminescence)

PPYR_10049 Deprioritized

Annotated as “protein henna,” the Drosophila name for phenylalanine hydroxylase. A conserved insect enzyme in pterin/melanin-related metabolism. Appears across many beetles at ~76–80% identity (Tribolium ~79.8%, Diorhabda ~80.3%, Diabrotica ~77–80%). Not firefly-specific and unlikely to represent a luciferin pathway innovation.

PPYR_03580 Deprioritized

A glucose dehydrogenase (FAD, quinone)-like enzyme. Homologs show up across beetles (Tenebrio ~57.7%, Tribolium ~54.3%, Zophobas ~54.2%). A conserved FAD-dependent housekeeping dehydrogenase, not a specialized bioluminescence-pathway enzyme.


Final candidates

After filtering, manual review, and cross-species BLAST, these are the genes that survived. They share a key trait: their BLAST hit distributions are skewed toward luminous beetles, with much lower identity to non-luminous species, a pattern consistent with lineage-specific evolution.

PPYR_14756 Candidate

The BLAST hit list is all luminous beetles (multiple fireflies + Ignelater luminosus), with no obvious non-luminous beetles showing up. Sequence identity drops hard outside Photinus: ~95% to self, then down to ~62% even in other fireflies, and into the ~45–63% range more broadly. That pattern is consistent with a lineage-restricted enzyme that’s evolving fast.

It’s annotated as UGT 3A2-like, which is a different UGT subfamily from PPYR_07361 (UGT 2C1). UGTs attach sugars to small molecules. In a luciferin pathway, glycosylation is a very plausible move for storage, transport, detox control, or keeping a reactive intermediate stable.

HMMER domain hits for PPYR_14756 showing UDP-glucoronosyl and UDP-glucosyl transferase (PF00201), Erythromycin biosynthesis protein CIII-like C-terminal domain (PF06722), and Glycosyltransferase family 28 C-terminal domain (PF04101)
Figure 1 — HMMER domain architecture of PPYR_14756. Three domains detected: the core UDPGT domain (PF00201, E-value 1.7e-65), an erythromycin biosynthesis C-terminal domain (PF06722), and a glycosyltransferase family 28 C-terminal domain (PF04101). All belong to clan CL0113.
BLAST results for PPYR_14756 showing hits exclusively in luminous beetle species: Photinus pyralis, Pyrocoelia pectoralis, Aquatica leii, Lamprigera yunnana, Abscondita terminalis, and Ignelater luminosus
Figure 2 — BLAST hit list for PPYR_14756. Every visible hit is a luminous beetle species. Identity drops from 95% (self) to ~62% in other fireflies and ~45–48% more broadly. No non-luminous beetles appear in the top results.

PPYR_02911 Candidate

A CYP4C-type cytochrome P450 whose BLAST pattern is skewed toward luminous beetles: Photinus has multiple close paralogs (~74–96% identity, including the nearby duplicate PPYR_02910). Other fireflies: Pyrocoelia ~71.6%, Aquatica ~70.5%, Abscondita ~66.3%. Other luminous beetles: Ignelater luminosus ~58–64%, Lamprigera ~56–64%. The first clearly non-luminous beetle is Leptinotarsa decemlineata at ~49%, far down the list.

Two extra signals make this candidate worth keeping:

  1. The top hits are overwhelmingly luminous species. Even if it isn’t perfectly exclusive, the distribution is strongly biased.
  2. There are multiple Photinus paralogs plus a nearby neighbor (PPYR_02910), consistent with recent duplication, a common signature of genes that get recruited and specialized into pathway roles.
HMMER domain hit for PPYR_02911 showing a single Cytochrome P450 domain (PF00067) spanning positions 49 to 542
Figure 3 — HMMER domain architecture of PPYR_02911. A single Cytochrome P450 domain (PF00067) spans nearly the entire protein (positions 49–542), with strong E-values (independent: 1.4e-73, conditional: 5.8e-78).
BLAST results for PPYR_02911 showing top hits dominated by luminous beetle species, with the first non-luminous beetle (Leptinotarsa decemlineata) appearing far down the list at ~49% identity
Figure 4 — BLAST hit list for PPYR_02911. Top hits are dominated by Photinus CYP4C paralogs (74–96%), followed by other luminous beetles (58–72%). The first non-luminous beetle (Leptinotarsa decemlineata) appears near the bottom at ~49% identity.

PPYR_02910 Candidate

PPYR_02910 shows the same overall pattern as PPYR_02911: the top BLAST hits are almost entirely luminous beetles, with the first clearly non-luminous species (Leptinotarsa) appearing much further down at ~51.9% identity. PPYR_02911 also appears as a top hit to PPYR_02910 (~87.1%), confirming that the two genes are closely related tandem paralogs located next to each other on chromosome LG10.

Together with the presence of multiple CYP4C1-like copies in Photinus (74–95% identity), this points to a locally expanded CYP4C P450 gene cluster that has duplicated and diversified within fireflies, with homologs present but more diverged in non-luminous beetles.

Because PPYR_02910 shares ~87% sequence identity with PPYR_02911, a separate HMMER analysis is unnecessary, at that level of similarity the two proteins share the same domain architecture (a single Cytochrome P450 domain, PF00067) and would produce essentially identical results.

BLAST results for PPYR_02910 showing top hits dominated by luminous beetle species including Photinus pyralis CYP4C paralogs and PPYR_02911 at 87.11% identity, with the first non-luminous beetle Leptinotarsa decemlineata at ~51.9% identity
Figure 5 — BLAST hit list for PPYR_02910. The tandem paralog PPYR_02911 appears at 87.1% identity, confirming their close relationship. Top hits are overwhelmingly luminous beetles (64–95%), with Leptinotarsa decemlineata as the first non-luminous species at ~51.9%.

PPYR_14056 Candidate

Annotated as 4-coumarate–CoA ligase 1-like. The BLAST distribution is striking: the visible hits are all luminous beetles (Photinus, Aquatica, Abscondita, Pyrocoelia, Lamprigera, Ignelater), with no obvious non-luminous beetles showing up. On top of that, the family looks massively expanded, Photinus alone has ~8–10 paralogs. That combination (luminous-only skew + big local expansion) is a classic signature of a lineage-specialized pathway module.

Mechanistically, the annotation makes sense. 4CL enzymes are adenylate-forming enzymes: they activate aromatic acids by forming an AMP intermediate, then often proceed to CoA thioester formation. That’s the same core chemistry luciferase uses, and luciferase is known to have evolved from this broader adenylate-forming enzyme superfamily. The domain confirmation shows:

  • AMP-binding enzyme domain (PF00501)
  • AMP-binding C-terminal domain (PF13193)

That is the same AMP-binding “luciferase / 4CL / acyl-CoA synthetase” architecture. With this many paralogs in luminous beetles, it’s very plausible the cluster contains a mix of functions: some true luciferases, and others that activate aromatic precursors upstream and feed them into luciferin biosynthesis rather than emitting light directly.

HMMER domain hits for PPYR_14056 showing AMP-binding enzyme domain (PF00501) spanning positions 24 to 397 and AMP-binding enzyme C-terminal domain (PF13193) spanning positions 446 to 524
Figure 6 — HMMER domain architecture of PPYR_14056. Two domains detected: AMP-binding enzyme (PF00501, E-value 1.1e-46) and AMP-binding enzyme C-terminal domain (PF13193, E-value 1.4e-13)—the same architecture shared by luciferase, 4-coumarate–CoA ligases, and acyl-CoA synthetases.
BLAST results for PPYR_14056 showing hits exclusively in luminous beetle species including Photinus pyralis, Lamprigera yunnana, Abscondita terminalis, Aquatica leii, and Ignelater luminosus, with massive paralog expansion in Photinus
Figure 7 — BLAST hit list for PPYR_14056. Every visible hit is a luminous beetle species. Photinus alone has ~8–10 4-coumarate–CoA ligase paralogs (39–97% identity), and the family is also expanded in Aquatica, Lamprigera, and Ignelater. No non-luminous beetles appear in the top results.

Sanity-checking the BLAST pattern

BLAST hit lists can be misleading with big gene families like CYP4. A “top hit” might just reflect generic CYP4 domain similarity rather than a true one-to-one ortholog.

So I used reciprocal best hit (RBH) logic as a reality check:

  1. Forward BLAST: start with a Photinus gene (e.g., PPYR_02910), BLAST it broadly, and pick a top hit in another species.
  2. Reverse BLAST: take that other-species hit and BLAST it back against the Photinus pyralis proteome.
  3. If it comes back to the same Photinus gene (or the same local cluster), that supports shared orthology rather than a random CYP4 match.
  4. If it comes back to a different CYP4 subfamily, the “luminous-only” interpretation may just be mixing subfamilies.

Reciprocal BLAST suggests that PPYR_02910 and PPYR_02911 are not unique to fireflies, but belong to a conserved CYP4C ortholog group present across beetles. When I took top hits from luminous species like Pyrocoelia and blasted them back against the Photinus proteome, they didn’t point to a single gene but to the same local CYP4C cluster. Doing the same with the first clearly non-luminous beetle hit (Leptinotarsa) gave the same outcome.

However, there’s still an interesting pattern in the identity values. Homologs from luminous beetles are much more similar to the Photinus CYP4C cluster (~73–77%) than the first non-luminous beetle tested (~51–54%), and Photinus itself contains multiple tandem duplicates in this region. Taken together, this suggests the CYP4C family is broadly conserved, but has undergone lineage-specific expansion and divergence in luminous beetles, which could indicate functional specialization in the lantern.

Phylogenetic analysis (IQ-TREE)

A BLAST search shows pairwise similarity, but it doesn’t reveal evolutionary relationships. To understand how these CYP4C genes evolved, I built a small phylogenetic tree using 12 protein sequences: four from Photinus (including my two candidates and an outgroup), five from other luminous beetles, and three from non-luminous beetles. MAFFT was used to align the sequences, and IQ-TREE inferred the most likely evolutionary relationships.

The results were consistent and informative:

  • PPYR_02910 and PPYR_02911 cluster tightly together, confirming they are recent paralogs, likely from a tandem duplication in Photinus.
  • The CYP4C genes from other luminous beetles (Pyrocoelia, Aquatica, Abscondita, Lamprigera, Ignelater) form a distinct subcluster that sits closer together than to the non-luminous beetle sequences.
  • Non-luminous beetles branch earlier and sit outside this group, matching the identity gap seen in BLAST (~70%+ in luminous vs. ~50% in non-luminous).
  • The outgroup (PPYR_06980, a CYP4G15) falls clearly outside the CYP4C cluster, confirming the analysis isn’t mixing different P450 families.

How my candidates fit into the proposed pathway

The current working model from Zhang et al. (2020) proposes that luciferin is derived from tyrosine through a series of oxidation and activation steps, eventually forming p-benzoquinone, which then reacts with cysteine to produce luciferin.

Zhang et al. 2020 proposed luciferin biosynthesis pathway showing the steps from Tyrosine through TAT, HPPD, PO, 4CL, ABC-D transporter, ScpX, cysteine additions, ACOT1, and finally luciferase producing light
Figure 8 — Zhang et al. (2020) proposed luciferin biosynthesis pathway. PPYR_14056 sits at the 4CL (4-coumarate–CoA ligase) step. The two cysteine additions may occur spontaneously without dedicated enzymes.

Most of the early steps (TAT, HPPD) are standard metabolism and well understood. The uncertainty begins in the middle of the pathway, especially around how benzoquinone is generated, how it’s activated and transported, and how reactive intermediates are handled safely. That’s where my candidate genes sit.

The key insight from newer work

The biggest shift in thinking is this:

The benzothiazole ring formation (the cysteine additions) may not need a dedicated enzyme at all. It can happen spontaneously if p-benzoquinone is present, cysteine is present, and the environment is right.

If that’s true, then the real biological problem isn’t “how do you close the ring?” It’s: How does the lantern reliably produce, control, and localize the quinone precursor?

That reframing makes enzymes like CYP4C P450s (oxidation of aromatic precursors), UGTs (stabilizing or transporting reactive intermediates), and 4CL-like enzymes (activating aromatic acids) look very relevant to the missing steps.