DOI : 10.17577/IJERTV15IS060665
- Open Access

- Authors : Sunita Gaur, Dr. Yasmin Shaikh, Ashmeet Singh, Bhuwan Verma
- Paper ID : IJERTV15IS060665
- Volume & Issue : Volume 15, Issue 06 , June – 2026
- Published (First Online): 17-06-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
A Multi-Layered Forensic Framework for Information Cascade Analysis and Source Origin Attribution
A Multi-Layered Forensic Framework for Information Cascade Analysis and Source Origin Attribution
Sunita Gaur
School of Computer Science & Information Technology, Devi Ahilya Vishwavidyalaya (DAVV), Indore, India
Dr. Yasmin Shaikh
International Institute of rofessional Studies, Devi Ahilya Vishwavidyalaya (DAVV), Indore, India
Ashmeet Singh,
School of Computer Science & Information Technology, Devi Ahilya Vishwavidyalaya (DAVV), Indore, India
Bhuwan Verma
School of Computer Science & Information Technology Devi Ahilya Vishwavidyalaya (DAVV), Indore, India
Abstract-Online misinformation does not stay still. A fabri- cated narrative begins at a single account, picks up momentum through repost chains, gets amplified by semi-automated behav- ior, and becomes difficult to contain once it enters densely connected subgraphs. Investigators dealing ith this problem routinely find that neither content classification alone nor ac- count profiling alone is sufficient, hat matters is understanding all three signals together! hat the content says, ho is spreading it, and ho it moved.
“his paper presents “race#lo , a forensic investigation frame-
ork built around that insight. “he system brings together a
$i-%irectional &ong ‘hort-“erm (emory )$i-&'”(* model for te+t-based deception detection, a ,andom #orest classifier called ‘ocial-uardian for account behavioral profiling, and a distrib- uted .y’park graph engine for propagation path reconstruction. A ,everse .ath “raversal algorithm alks the di/usion graph back ard from suspicious nodes to identify origin accounts, and an Independent 0ascade (odel )I0(* simulation estimates ho far a cascade could continue if left unchecked.
1+periments on consumer-grade hard are ith an 2 -$
,A( ceiling sho 34.56 classification accuracy across com- bined semantic and behavioral signals, 37.46 source locali8ation accuracy on 79,999-edge graphs, and stable memory utili8ation at :.5 -$ under the hard are constraint. “his ork argues that the combination of te+t analysis, metadata-driven profiling, and temporal graph traversal provides meaningfully stronger forensic evidence than any single channel in isolation.
Index Terms-%igital #orensics, Information 0ascade, $i-
&'”(, .y’park, ,andom #orest, -raph Analytics, #ake ;e s
%etection, ‘ocial ;et ork Analysis, ‘ource &ocali8ation, (isin- formation %etection
I! I&'()*+,’-)&
A! The Shifting “andscape of #nline Information
Somet#ing c#anged w#en social media .latforms became t#e .rimary c#annel t#roug# w#ic# many .eo.le recei/e news! 0#e c#ange was not 1ust about s.eed, t#oug# s.eed matters! 0#e dee.er s#ift was structural2 anyone can now .ublis# to a global audience wit#out editorial re/iew, and t#e mec#anics
of .latform recommendation mean t#at engaging content, regardless of its accuracy, tends to s.read furt#er and faster t#an dull but trut#ful content!
3esearc# #as consistently s#own t#at false news tra/els faster and reac#es more .eo.le t#an /eri4ed re.orts 567! A widely cited study found t#at false stories were 89: more likely to be retweeted t#an true ones, and t#at t#e e;ect was not e<.lained by bot acti/ity alone, #uman users were acti/ely .artici.ating in t#e s.read 5=7! 0#is is not sim.ly a tec#nology .roblem! It re>ects somet#ing about #ow .eo.le make decisions in social en/ironments2 w#en surrounding
.eers a..ear to acce.t a .iece of information, indi/iduals are more likely to s#are it wit#out inde.endent /eri4cation! 0#at mec#anism is w#at researc#ers call an information cascade!
An information cascade, at its most basic, is w#at #a..ens w#en indi/iduals base t#eir decisions more on w#at ot#ers around t#em #a/e done t#an on t#eir own .ri/ate knowledge 5?7! A..lied to online .latforms, a cascade occurs w#en users re.ost or am.lify content because it a..ears to #a/e already been endorsed by ot#ers! 0#is makes t#e cascade self-rein- forcing2 t#e more .eo.le s#are somet#ing, t#e more credible it looks to t#e ne<t .erson w#o encounters it! A cascade does not need to contain false information to function, it is a neutral mec#anism! $ut w#en false or mani.ulati/e content enters a cascade, t#e mec#anism becomes dangerous 5@7, 5A7!
B#at makes t#is forensically interesting is t#at a fully matured cascade can make a .iece of misinformation look far more credible t#an it actually is! $y t#e time in/estigators identify a sus.icious narrati/e, it may already #a/e t#ousands of instances s.read across multi.le .latforms, making it genuinely diCcult to determine w#ere it started and w#ic# accounts dro/e its early am.li4cation 5D7!
$! Automated Accounts and Coordinated Ampli%cation
0#e .roblem would be considerably more tractable if all
.artici.ants in a misinformation cascade be#a/ed like ordinary users! In .ractice, t#ey often do not! Automated accounts,
commonly called bots, and semi-automated accounts o.erated wit# some degree of #uman guidance are freEuently embedded in early .ro.agation c#ains! 0#ese accounts can .ost at rates no #uman could sustain, create arti4cial /olume around a narrati/e, and e<.loit follower networks to make fringe /iew-
.oints a..ear mainstream 587, 5F7!
B#at makes bot detection genuinely diCcult is t#at modern automated accounts #a/e become Euite good at mimicking #uman be#a/ior! 0#ey .ost at inter/als t#at follow #uman- like diurnal .atterns, t#ey follow and unfollow at rates t#at do not immediately trigger .latform detection #euristics, and t#ey sometimes maintain long .osting #istories before .i/ot- ing to am.li4cation cam.aigns 5G7! etecting t#em from te<t content alone is not reliable! etecting t#em from be#a/ioral metadata, .osting freEuency, follower ratios, account age, a..lication signatures, tends to be more consistent 5697!
H/en w#en bot acti/ity is identi4ed, attribution remains #ard! A gi/en .ost mig#t #a/e been retweeted by t#ousands of accounts, some automated and some #uman! Identifying w#ic# of t#ose accounts was structurally im.ortant to t#e s.read, w#ic# ones were bridges rat#er t#an end.oints, re- Euires gra.#-le/el analysis, not 1ust .er-account classi4cation!
C! &hy Treating Signals in Isolation ‘ails
Most forensic .i.elines de/elo.ed in academic and a..lied researc# settings treat t#ese signals as se.arate .rocessing ste.s! A common .attern is to 4rst classify w#et#er a .iece of content a..ears dece.ti/e based on its te<t, t#en se.arately
>ag accounts t#at look automated based on t#eir metadata, and 4nally, if resources .ermit, e<amine #ow t#e content mo/ed t#roug# t#e network! 0#is .i.eline makes .ractical sense w#en eac# ste. is built by a di;erent team using di;erent infrastructure, but it #as structural weaknesses t#at com.ound in real in/estigations!
0#e core .roblem is t#at t#e origin of a cascade is ty.ically in/isible unless you look at all t#ree signals simultaneously 5667! An early-stage seed account s.reading a .iece of misin- formation may #a/e relati/ely normal content It#e dece.tion is subtleJ, may not look .articularly bot-like Iit mig#t be a real
.erson acting deliberatelyJ, but re/eals itself t#roug# gra.# structure2 it is t#e account t#at 4rst connected a .iece of content to an am.li4cation network, at a timestam. t#at .re- cedes e/eryone else! Finding t#at account reEuires integrating language analysis, be#a/ioral scoring, and c#ronological edge tra/ersal into a single co#erent .rocess 56=7, 56?7!
0#is integration is e<actly w#at 0raceFlow is designed to
.ro/ide!
D! ro(lem Statement
0#e Euestion a digital forensics in/estigator actually needs answered during a misinformation incident is not sim.ly Kis t#is content falseLM 0#at Euestion, w#ile useful, is only a starting .oint! 0#e in/estigator needs to know2 w#o started t#isL B#ic# accounts ga/e it early momentumL Now did it reac# t#e .o.ulation it reac#edL And if not#ing is done, #ow muc# furt#er will it s.readL
Current systems do not answer all four Euestions well! 0#ey classify content, t#ey score accounts, and occasionally
t#ey /isualiOe network structure! $ut t#ey rarely reconstruct t#e c#ronological di;usion tra1ectory in a way t#at identi4es origin nodes wit# con4dence, and t#ey almost ne/er combine t#at reconstruction wit# forward-looking cascade simulation t#at estimates future reac#!
0raceFlow addresses t#is ga.! It is not a content moderation tool and it does not claim to determine real-world trut#! It is a forensic in/estigation framework, one t#at #el.s analysts build an e/identiary .icture of #ow a sus.icious narrati/e entered a network, w#ic# accounts am.li4ed it, and w#at t#e network to.ology im.lies about future s.read!
)! *esearch #(+ectives
0#e s.eci4c goals of t#is work are as follows! First, to build a distributed gra.#-.rocessing engine t#at can #andle large interaction datasets on #ardware t#at a ty.ical researc# laboratory or small forensics unit actually #as access to, not a ser/er cluster! Second, to integrate semantic content analysis and be#a/ioral account .ro4ling into t#e same in/estigation
.i.eline so t#at e/idence from bot# c#annels informs t#e same out.ut score! 0#ird, to de/elo. and e/aluate a 3e/erse Pat# 0ra/ersal algorit#m ca.able of identifying origin nodes more accurately t#an centrality-based baselines! Fourt#, to im.le- ment an Inde.endent Cascade Model simulation t#at gi/es in/estigators a .robabilistic estimate of future .ro.agation! And 4ft#, to .resent t#ese results t#roug# an interacti/e /isu- aliOation layer readable by non-tec#nical analysts w#o need to brief decision-makers!
‘! Scope and Threat ,odel
0raceFlow o.erates under a s.eci4c t#reat model t#at s#a.es w#at it can and cannot do! 0#e assumed scenario is t#is2 a small grou. of seed accounts, w#ic# may include delib- erate #uman actors, automated accounts, or a combination, in1ects a narrati/e into a .latform! A larger set of am.li4er accounts, some automated and some not, boosts t#at narrati/e t#roug# re.osts and reactions! As t#e narrati/e gains a..arent credibility, ordinary users w#o are not .art of any cam.aign begin s#aring it in good fait#! $y t#e time in/estigators are alerted, t#e cascade #as matured!
0#e framework #andles retros.ecti/e analysis o/er logged interaction data, and near-real-time analysis w#en a li/e e/ent feed is a/ailable! It does not o.erate as a real-time streaming
.latform in its current form, a limitation discussed in Section
%II! It does not make de4niti/e trut# 1udgmentsQ an analyst always #as t#e 4nal say! And it does not currently e<tend to cross-.latform di;usion t#at would reEuire aggregating data from sources wit# di;erent API structures!
Bit#in t#ose boundaries, t#e framework addresses t#ree Euestions t#at recur in e/ery suc# in/estigation2 w#ere did t#is start, w#o am.li4ed it early, and w#at does t#e gra.# suggest about #ow far it will goL
II! $R,ST()+&* R&* 3UVR’U* B)(S
A! Information Cascades- Theory and #(servation
0#e t#eory of information cascades #as roots in economic decision t#eory! $ik#c#andani et al!Ws foundational work
s#owed t#at rational indi/iduals will sometimes ignore t#eir own .ri/ate information and sim.ly follow t#e crowd, because a suCciently long seEuence of .rior actions .ro/ides stronger signal t#an .ersonal e/idence 5?7! 0#at t#eoretical result, de/elo.ed for seEuential decision-making in markets, trans- lates uncomfortably well to social media be#a/ior2 users see engagement counts, like totals, and re.ost tallies, and use t#ose signals as social .roof w#en deciding w#et#er to s#are! Hm.irical studies on large-scale 0witter data #a/e con- 4rmed t#ese t#eoretical .redictions! Lesko/ec et al! analyOed information di;usion across blog networks and found t#at cascades tend to be wide and s#allow rat#er t#an dee., most content s.reads Euickly across many nodes in t#e 4rst generation, not t#roug# long c#ains 5@7! 0#is structural insig#t matters for 0raceFlowXs design2 because most s.reading oc- curs in early generations, correctly identifying origin nodes is bot# tractable It#ere are not many candidate ancestorsJ and
critical It#e origin s#a.es e/eryt#ing downstreamJ!
Later work by C#en et al! demonstrated t#at cascade dynamics /ary substantially based on .latform a;ordances and to.ic ty.e 5A7! Political misinformation tends to .roduce dee.er cascades t#an lifestyle misinformation, because .oliti- cally moti/ated users are more likely to acti/ely re.ost content to t#eir followers rat#er t#an sim.ly like it! Sur/ey work by Bang et al! identi4ed se/eral recurring structural signatures of coordinated inaut#entic be#a/ior, including unusual timing coordination across accounts and star-s#a.ed am.li4cation to.ology, t#at are now used as detection #euristics 5D7!
$! Automated Account Detection
0#e literature on bot detection #as e/ol/ed considerably since early rule-based classi4ers t#at relied on sim.le t#res#- olds like K.osts more t#an =99 times .er #our!M Modern a..roac#es, and t#e accounts t#ey are trying to detect, #a/e grown considerably more so.#isticated!
Ferrara et al!Ws com.re#ensi/e sur/ey identi4ed si< ma1or ty.es of social bots, ranging from sim.le s.am bots wit# ob/ious be#a/ioral signatures to so.#isticated social bots t#at maintain mont#s-long .osting #istories, engage in genuine- seeming con/ersations, and e<.loit trending to.ics strategi- cally 587! 0#e Noa<y and $otometer researc# demonstrated t#at bots tend to be systematically .ositioned at early .oints in misinformation cascades, not necessarily as originators, but as am.li4ers t#at .ro/ide t#e early /olume t#at makes content a..ear credible to subseEuent #uman users 5F7!
More recent work #as focused on t#e ad/ersarial nature of t#e detection .roblem! Account o.erators w#o know t#at detection tools look for #ig# .osting freEuency sim.ly t#rottle t#eir .osts! O.erators w#o know t#at follower-to-following ratio is a signal maintain more balanced ratios! 0#is ad/ersar- ial dynamic means t#at any 4<ed feature set will degrade o/er time, and be#a/ioral .ro4ling tools need .eriodic retraining 5G7, 5697!
0#e a..roac# taken in t#is work, 3andom Forest classi4- cation on account metadata features, is establis#ed and inter-
.retable rat#er t#an cutting-edge! 0#e c#oice is deliberate2 for forensic a..lications w#ere in/estigators must e<.lain t#eir met#ods in court or in .olicy re.orts, t#e ability to describe
e<actly w#ic# features dro/e a classi4cation decision matters more t#an ac#ie/ing t#e #ig#est .ossible AYC!
C! .raph/$ased Source “ocali0ation
0#e .roblem of identifying t#e origin of a s.reading
.rocess on a network #as connections to e.idemiology, w#ere t#e analogous .roblem is identifying t#e .atient Oero of an outbreak! S#a# and Zaman introduced t#e conce.t of t#e Krumor centralityM measure, w#ic# assigns a score to eac# node re.resenting #ow well-.ositioned it would be to .roduce t#e obser/ed di;usion .attern 56@7! 0#eir work establis#ed t#at t#e ma<imum likeli#ood estimate of t#e source under a susce.tible-infected s.reading model is often not t#e most connected node, but rat#er a node t#at sits at t#e structural KcenterM of t#e obser/ed di;usion subgra.#!
SubseEuent work e<.lored /arious a..ro<imations and e<- tensions of t#is idea! $etweenness centrality, w#ic# measures #ow often a node sits on t#e s#ortest .at# between ot#er nodes, #as been widely used as a .ro<y for source localiOa- tion because it tends to identify structurally im.ortant nodes! Higen/ector centrality, w#ic# scores nodes by t#e Euality of t#eir neig#bors, ca.tures a di;erent notion of im.ortance! [eit#er of t#ese measures, #owe/er, incor.orates tem.oral information! 0#ey identify structurally im.ortant nodes based on static gra.# structure, w#ic# means t#ey are easily con- fused by am.li4cation #ubs t#at 1oined a cascade late but accumulated many edges!
0raceFlowXs 3e/erse Pat# 0ra/ersal algorit#m takes a di;er- ent a..roac# by treating timestam.s as 4rst-class constraints 56?7, 56A7! An account t#at a..ears structurally central but t#at recei/ed t#e content after a smaller, less-connected account cannot be t#e origin, and t#e algorit#m enforces t#is by re1ecting edges t#at /iolate c#ronological order! 0#e di;erence in source localiOation accuracy, documented in Section %, is substantial!
D! ,ultimodal ,isinformation Analysis
0#e 4eld #as increasingly mo/ed toward multimodal a.-
.roac#es t#at combine te<tual signals wit# /isual content, user metadata, and network structure 56D7! Systems like Fake- [ews[et and related benc#marks s.eci4cally em.#asiOe t#e im.ortance of combining news article content wit# social conte<t It#e .ro.agation .atternJ rat#er t#an treating content classi4cation as a standalone task 56=7!
0#e semantic side of 0raceFlow draws on t#is tradition! S#u et al!Ws foundational fake news detection work demonstrated t#at language .atterns in misinformation di;er systematically from factual content, misinformation tends toward stronger emotional language, lower syntactic com.le<ity, and .atterns of asserti/eness t#at do not matc# t#e #edging ty.ical of factual re.orting 5667! 0#ese .atterns are e<actly w#at bidi- rectional seEuence models are good at ca.turing, because t#ey can identify not 1ust indi/idual word c#oices but relations#i.s between words across t#e full seEuence!
0#e decision to use a $i-LS0M rat#er t#an a transformer arc#itecture like $H30 deser/es e<.lanation! 0ransformer models consistently out.erform recurrent arc#itectures on [LP benc#marks, and $H30 in .articular #as been used
successfully for misinformation detection in se/eral studies 5687, 56F7! Nowe/er, transformer models are memory-inten- si/e! A standard $H30 model reEuires se/eral gigabytes of GPY memory 1ust to #old t#e weig#ts, before adding batc# data! Ynder an F G$ 3AM constraint wit# no dedicated GPY, transformer inference is eit#er /ery slow or im.ractical wit#- out signi4cant EuantiOation, w#ic# itself introduces accuracy degradation! 0#e $i-LS0M .ro/ides a reasonable o.erating
.oint on t#e accuracy-memory tradeo; cur/e!
III! S\]’U^ A(,_-‘U,’+(U
,a ‘ocial ‘treams
aSO[ Interaction Logs b Yser Metadata b 0imestam.ed 3e.ost 3ecords
%ata 1ngineering &ayer
A.ac#e S.ark @!6 Ingestion b $roadcast aoin Hdge Generation MHMO3″cA[ c ISd Persistence b Partition-Aware
istributed H<ecution
#orensic &ogic &ayer
$i-LS0M Semantic Analysis b SocialGuardian 3andom Forest Pro4ling
3e/erse Pat# 0ra/ersal b Inde.endent Cascade Model Simulation
=isuali8ation &ayer
[e<t!1s In/estigation as#board b ?!1s Gra.# 3endering Analyst H/idence Summary b Interacti/e 3isk $and is.lay#orensic Intelligence Output
0#reat Classi4cation b Source Account Attribution C#ronological Cascade 3econstruction b Forward S.read Hstimate
0raceFlow is organiOed across four .rocessing layers, eac# wit# a distinct function in t#e forensic .i.eline! 3aw social media data enters at t#e to., .asses t#roug# a data engineering layer t#at .re.ares it for .arallel com.utation, >ows into a forensic logic layer w#ere t#e actual analysis #a..ens, and e<its t#roug# a /isualiOation layer t#at .resents 4ndings to t#e analyst! Fig! 6 s#ows t#is >ow!
Fig! 6! 0raceFlow Multi-Layered Forensic Framework Arc#itecture
0#e design deliberately a/oids making t#e layers de.endent on eac# ot#erXs out.ut in a strictly seEuential way! 0#e seman- tic model, t#e be#a/ioral classi4er, and t#e gra.# tra/ersal engine eac# .roduce inde.endent scores! 0#ese are combined only at t#e out.ut stage by t#e #ybrid risk scoring function! 0#is inde.endence is intentional2 it allows in/estigators to
ins.ect eac# c#annelXs contribution se.arately, w#ic# matters w#en t#ey need to e<.lain t#eir conclusions!
A! Data )ngineering “ayer
H/ery analysis begins wit# raw data2 aSO[ 4les containing interaction logs, user .ro4le records, and timestam.ed re.ost or re.ly c#ains! In .ractice, t#ese 4les tend to be large! A single dayXs wort# of interaction data from a moderately acti/e to.ic t#read can easily run into #undreds of t#ousands of records! Processing t#is at scale w#ile staying wit#in a consumer #ardware memory budget reEuires careful attention to #ow S.ark distributes and .ersists data!
0#e framework ingests data t#roug# A.ac#e S.ark @!6 accessed /ia PyS.ark! S.ark is a natural c#oice for t#is kind of workload because it distributes com.utation across .artitions and can s.ill intermediate results to disk w#en memory is insuCcient, a .ro.erty t#at matters enormously for t#e F G$ constraint under w#ic# t#is framework was built and tested!
Harly /ersions of t#e ingestion .i.eline encountered con- sistent .roblems! B#en generating gra.# edges from large datasets, t#e o.eration t#at creates t#e Isource, target, time- stam., interactioncty.eJ tu.les from raw interaction logs, S.arkXs s#uee o.erations would attem.t to #old large intermediate tables in memory simultaneously, causing Out- OfMemory e<ce.tions in t#e aa/a %irtual Mac#ine t#at underlies S.arkXs e<ecution engine! aa/aXs garbage collector was re.eatedly in/oked to reclaim memory, but it could not kee. .ace wit# t#e s#uee demand, causing long .auses and e/entually e<ecutor failure!
0#ree c#anges resol/ed t#is! First, MEMORY_AND_DISK
.ersistence was a..lied to ataFrames t#at would be reused across multi.le o.erations! 0#is tells S.ark to kee. .artitions in memory w#en it can but to write t#em to t#e [%Me SS w#en it cannot, rat#er t#an recom.uting t#em from scratc# or failing! Second, broadcast 1oins were used for metadata looku.s! B#en 1oining a large interaction ataFrame against a smaller user metadata table, broadcasting t#e smaller table to all S.ark e<ecutors eliminates t#e need for a s#uee entirely! 0#ird, t#e number of s#uee .artitions was tuned down from S.arkXs default of =99, an a..ro.riate default for large clusters but e<cessi/e for a single mac#ine, w#ere eac# additional
.artition introduces coordination o/er#ead!
0#e out.ut of t#is layer is a set of immutable, c#ronologi- cally ordered edge structures2
IKSourcec[odeM, K0argetc[odeM, K0imestam.M, KInteractionc0y.eMJ
0#ese structures are t#e foundation for e/eryt#ing t#at #a..ens in t#e forensic logic layer! 0#e immutability is im.or- tant2 gra.# tra/ersal algorit#ms t#at walk backward t#roug# timestam.s must trust t#at t#e edges t#ey are reading #a/e not been modi4ed by concurrent .rocesses!
$! ‘orensic “ogic “ayer
0#e forensic logic layer contains t#ree inde.endent .rocess- ing modules t#at run o/er t#e same edge data from di;erent analytical angles! 0#ey are inde.endent in t#e sense t#at none reads t#e out.ut of t#e ot#ers during .rocessing, eac#
.roduces its own score, and t#ose scores are combined after-
ward! 0#e t#ree modules are t#e ee. Semantic Content Core, t#e SocialGuardian $e#a/ioral Core, and t#e 0o.ogra.#ical Source 0ra/ersal Core!
aJ Deep Semantic Content Core-
0#e 1ob of t#is module is to e<amine t#e te<t content asso- ciated wit# eac# node, .osts, re.lies, ca.tions, and estimate t#e .robability t#at it re>ects intentional dece.tion rat#er t#an #onest e<.ression 5667, 5687!
0#e model arc#itecture is a $i- irectional Long S#ort-0erm Memory I$i-LS0MJ network im.lemented in 0ensorFlow! 0o understand w#y a bidirectional /ariant is .referable to a standard LS0M for t#is task, it #el.s to t#ink about w#at a unidirectional LS0M does2 it reads a seEuence of words from left to rig#t and builds u. a running state t#at ca.tures conte<t from e/eryt#ing it #as seen so far! $y t#e time it reac#es t#e last word in a sentence, it #as accumulated conte<t from t#e entire .receding seEuence! 0#is works reasonably well, but it means t#at t#e modelXs inter.retation of early words in t#e sentence is made wit#out knowledge of t#e words t#at follow, and in misinformation te<t, t#e meaning of early words often de.ends critically on w#at comes later!
A $i-LS0M sol/es t#is by running two se.arate LS0Ms o/er t#e same in.ut2 one reads left to rig#t in t#e normal direction, and one reads rig#t to left! 0#e out.ut for any gi/en
.osition in t#e seEuence is t#e concatenation of t#e states from bot# directions at t#at .osition! 0#e model t#erefore #as access to full bidirectional conte<t w#en making its assess- ment of any word or .#rase 56F7!
For t#is task, misinformation te<t tends to e<#ibit .atterns t#at only become clear wit# full-seEuence conte<t! Sarcasm is a common e<am.le2 KB#at a com.letely reliable source t#is isM cannot be identi4ed as sarcasm wit#out understanding t#e surrounding discourse conte<t t#at t#e 4nal word KisM is .art of a r#etorical .attern! Ad/ersarial .#rasings t#at deliberately front-load credible-sounding #edging language before making an un/eri4ed claim are anot#er e<am.le! 0#e $i-LS0MXs bidirectional conte<t #el.s it recogniOe t#ese .atterns!
0#e te<t .rocessing .i.eline tokeniOes in.ut seEuences at t#e word le/el, .ads or truncates t#em to a uniform lengt# of 6=9 tokens, and ma.s eac# token to a 6=F-dimensional embedding! 0#e LS0M cells are con4gured wit# D@ units in eac# direction, for a total of 6=F units .er .osition after con- catenation! ro.out at a rate of 9!?A is a..lied during training to .re/ent t#e model from memoriOing s.eci4c .#rasings from t#e training cor.us! 0#e 4nal layer uses a sigmoid acti/ation,
.roducing a .robability between 9 and 6 t#at re.resents t#e modelXs estimated likeli#ood t#at t#e in.ut te<t is dece.ti/e!
bJ Social.uardian $ehavioral Core-
B#ile t#e semantic module e<amines w#at accounts .ost, t#e SocialGuardian module e<amines #ow t#ey be#a/e, en- tirely from metadata, wit# no reference to .ost content 5G7, 5697! 0#e distinction is im.ortant because many so.#isticated misinformation o.erations s.eci4cally engineer t#eir content to .ass te<tual classi4ers, w#ile be#a/ioral .atterns are #arder to disguise consistently o/er time!
0#e classi4er uses 3andom Forest, an ensemble met#od t#at builds many decision trees o/er random subsets of train- ing data and features, t#en aggregates t#eir /otes! 3andom
Forest #andles class imbalance gracefully Igenuine misinfor- mation actors are rare com.ared to normal usersJ, it .roduces calibrated .robability estimates rat#er t#an 1ust binary out.uts, and critically, it allows ins.ection of feature im.ortances, a
.ro.erty wit# real /alue w#en in/estigators need to e<.lain w#y a s.eci4c account was >agged!
0#e feature set co/ers eig#t be#a/ioral dimensions! 0#e follower-to-following ratio ca.tures accounts t#at are aggres- si/ely following many accounts w#ile not attracting many followers in return, a common .attern during account-building
.#ases! Posting freEuency measures a/erage .osts .er #our, normaliOed by account age! Account age itself is a feature2 newly created accounts wit# #ig# acti/ity are dis.ro.ortion- ately likely to be automated! %eri4cation status functions as a soft .rior, /eri4ed accounts warrant reduced scrutiny, t#oug# not immunity It#e institutional dam.ening factor discussed in Section I% #andles t#isJ! 0em.oral .osting burst detection
>ags accounts t#at s#ow #ig#ly clustered .ost timing incon- sistent wit# normal #uman be#a/ior across a day! A..lication usage signatures identify w#et#er .osts originate from t#ird-
.arty automation APIs rat#er t#an nati/e clients! Finally, metadata consistency c#ecks look for internal contradictions in account .ro4le data, inconsistent .ro4le edit timing, mis- matc#ed location signals, and similar artifacts!
Hac# account t#at a..ears in t#e interaction gra.# recei/es a
.robabilistic risk score between 9 and 6 from SocialGuardian! A score near 9 indicates t#at t#e accountXs be#a/ior closely matc#es t#e distribution of normal users in t#e training data! A score near 6 indicates t#at t#e be#a/ioral .ro4le is #ig#ly consistent wit# known automated or semi-automated accounts!
cJ Topographical Source Traversal Core-
0#e t#ird module works on t#e gra.# itself, not on indi-
/idual account content or metadata 5A7, 56@7! Its function is to take t#e c#ronologically ordered edge structures .roduced by t#e data engineering layer and trace t#em backward to 4nd t#e account most likely to be t#e origin of t#e cascade!
0#e key insig#t underlying t#is module is t#at .ro.agation gra.#s #a/e tem.oral structure t#at .urely static gra.# metrics ignore! B#en content s.reads t#roug# a network, edges are created in time order2 t#e source .osts 4rst, early ado.ters re.ost, and later followers re.ost after t#at! A legitimate origin node is one t#at a..ears in t#e earliest edges of t#e subgra.# containing t#e sus.icious content! An am.li4cation #ub, a
.o.ular account t#at retweeted somet#ing muc# later, may #a/e many more edges in t#e static gra.#, but its edges all #a/e later timestam.s! 0#e tra/ersal algorit#m enforces t#is constraint e<.licitly!
0#e module e<ecutes a backward tra/ersal from a target node It#e most recently >agged sus.icious .ostJ t#roug# incoming edges, c#ecking at eac# ste. t#at t#e timestam. of t#e edge being tra/ersed is earlier t#an t#e timestam. of t#e current node! 0#is eliminates late-arri/ing am.li4ers regard- less of t#eir structural connecti/ity! 0#e tra/ersal continues until it reac#es a node wit# no /alid incoming edges earlier t#an its own timestam., t#at node is t#e candidate origin!
0#e .ractical ad/antage of t#is a..roac# o/er centrality metrics is signi4cant! $etweenness centrality identi4es t#e node t#at most traCc .asses t#roug# in t#e static gra.#! Higen-
/ector centrality identi4es t#e node wit# t#e most im.ortant neig#bors! [eit#er of t#ese correlates reliably wit# tem.oral origin! 0#e e/aluation in Section % demonstrates t#is ga. Euantitati/ely!
I%! AVT)(-‘_^] R&* I^fVU^U&’R’-)&
A! *everse ath Traversal
0#e 3e/erse Pat# 0ra/ersal algorit#m is best understood t#roug# a sim.le e<am.le before t#e formal descri.tion! Imagine a cascade t#at mo/es t#roug# four accounts, A, $, C,
, wit# A .osting 4rst and eac# subseEuent account re.osting from t#e one before! In t#e static gra.# after t#e cascade is com.lete, $, C, and eac# #a/e one incoming edge! A #as none, because A was t#e original .oster! 0#e algorit#m starts at , 4nds t#e incoming edge from C wit# timestam. t3,
/eri4es t#at t3 < tD Iw#ere tD is t#e time re.ostedJ, and mo/es to C! It 4nds t#e incoming edge from $ at t2 < tC , and mo/es to $! It 4nds t#e incoming edge from A at t1 < tB, and mo/es to A! A #as no /alid incoming edges, so it is identi4ed as t#e candidate origin!
0#is is t#e sim.le case! 3eal cascades are not linear c#ains! A gi/en node may #a/e multi.le incoming edges from di;er- ent accounts .osting at di;erent times! 0#e algorit#m selects t#e earliest /alid incoming edge at eac# ste., because t#e most tem.orally u.stream account in a cascade is t#e most likely origin! If two accounts #a/e simultaneous timestam.s on t#eir edges, an unusual but .ossible situation in bot coordination, t#e algorit#m scores bot# as candidates and .resents t#em to t#e analyst!
0#e formal algorit#m o.erates on a directed gra.# G = (V , E) w#ere V is t#e set of user accounts and E is t#e set of timestam.ed interaction edges! Hac# edge einE carries t#e 4elds (u, v, t, T) w#ere u is t#e source account, v is t#e target account, t is t#e timestam. in Yni< e.oc# format, and T is t#e interaction ty.e!
b Add u to Q!
b If t < t*, u.date t* = t and S* = u!
AJ B#en Q is em.ty, return S* as t#e candidate origin and t* as t#e estimated origin timestam.!
0#e timestam. constraint in ste. @ is w#at distinguis#es t#is from an ordinary breadt#-4rst backward tra/ersal! Bit#out it, t#e algorit#m would e/entually reac# any node in t#e connected com.onent, including /ery recent accounts t#at clearly did not originate anyt#ing! 0#e constraint ensures t#at only genuinely earlier accounts can be candidate ancestors!
One im.ortant .ractical consideration2 interaction gra.#s often contain edges wit# nearly identical timestam.s due to coordinated .osting! 0#e algorit#m #andles t#is by treating accounts wit#in a con4gurable time window Iset to D9 seconds in current e<.erimentsJ as tem.orally co-eEual, and .resenting all suc# accounts as a candidate origin cluster rat#er t#an
.icking arbitrarily among t#em!
$! Independent Cascade ,odel
Once t#e origin node is identi4ed, analysts ty.ically want to know2 if t#is cascade continues uninterru.ted, w#at does t#e s.read look likeL 0#e Inde.endent Cascade Model IICMJ
.ro/ides a .rinci.led answer to t#is Euestion 5?7, 5A7, 5D7! 0#e ICM is a stoc#astic s.reading model! Hac# edge
(u, v) in t#e gra.# is assigned an acti/ation .robability
(u, v)in!”, #$! B#en node u becomes acti/e I#as recei/ed t#e contentJ, it gets e<actly one attem.t to acti/ate eac# of its inacti/e neig#bors! For a gi/en neig#bor v, uXs attem.t succeeds wit# .robability (u, v), simulated by sam.ling a uniform random /ariable %S &’ (!”, #$ and c#ecking w#et#er
%) = (u, v)!
0#e key modeling assum.tion t#at gi/es ICM its name is inde.endence2 uXs attem.ts to acti/ate di;erent neig#bors are inde.endent of eac# ot#er, and di;erent acti/e nodesX attem.ts to acti/ate t#e same neig#bor are also inde.endent! 0#is inde-
.endence assum.tion is a sim.li4cation of reality, in .ractice, multi.le accounts seeing t#e same narrati/e simultaneously
.robably do in>uence eac# ot#erXs decision to s#are! $ut it makes t#e model tractable and .roduces simulations t#at run in .olynomial time, w#ic# matters w#en running many Monte Carlo trials to estimate e<.ected cascade siOe!
A
t1
g
$
t2
g
0
t3
g
%
Forward cascade: A —-B —-C D ((lue arrows1 timestamped
t1 through t3)!
Reverse forensic trace: Algorithm wal2s (ac2ward (dashed red) to identify 3ode A as the origin!
Fig! =! 3e/erse Pat# 0ra/ersal2 Origin Identi4cation in a Linear Cascade C#ain
0#e algorit#m .roceeds as follows2
6J InitialiOe a tra/ersal Eueue Q containing t#e target node
d It#e sus.icious .ostXs accountJ!
=J Set t#e candidate origin S* = d and t#e earliest known timestam. t* = td It#e timestam. at w#ic# d recei/ed or .osted t#e contentJ!
?J B#ile Q is not em.ty, deEueue a node v!
@J For eac# incoming edge (u, v, t) w#ere t < t*2
A
* = “+,-
$
* = “+./
%
* = “+0#
T
0
#range nodes are active! 3ode A activated $ with high pro(a(ility (* = “+,-)!
$ has not yet activated C or D (lower pro(a(ilities1 one attempt remaining per neigh(or)!
Fig! ?! One Ste. of an ICM Simulation S#owing Acti/ation Probabilities Across [eig#bors
In 0raceFlowXs im.lementation, edge .robabilities are esti- mated from t#e #istorical interaction data! For a .air of accounts (u, v) t#at #a/e interacted multi.le times, (u, v) is estimated as t#e .ro.ortion of uXs .osts t#at v #as re.osted or res.onded to! For .airs wit# no #istorical interaction, a base rate deri/ed from t#e o/erall re.ost rate in t#e dataset is used as a .rior! 0#is .roduces reasonable .robability estimates wit#out reEuiring se.arate model 4tting, t#oug# an analyst can o/erride .robabilities for s.eci4c edges w#en conte<tual knowledge 1usti4es it!
0#e simulation runs 6,999 Monte Carlo trials for eac# origin node! Hac# trial .roduces a count of #ow many nodes were acti/ated before t#e cascade terminated! 0#e mean and G9t#-.ercentile acti/ation counts across trials gi/e t#e analyst an e<.ected s.read estimate and an u..er-bound risk estimate res.ecti/ely! B#en t#e G9t#-.ercentile count is substantially larger t#an t#e mean, it indicates t#at t#e cascade to.ology includes #ig#-degree #ubs t#at could .roduce /ery large s.reads in unlucky runs, a .attern t#at warrants ele/ated concern!
C! 4y(rid *is2 Scoring
0#e 4nal scoring ste. combines t#e out.ut of t#e semantic model I12345, t#e $i-LS0M dece.tion .robabilityJ and t#e be#a/ioral classi4er IB2345, t#e SocialGuardian bot likeli#oodJ into a single #ybrid risk inde<2
67853&9 = :;< 12345=> 😕 < B2345= (#)
w#ere ; = “+@@ and ? = “+0@! 0#e weig#ting slig#tly fa/ors t#e semantic signal because, in t#e e/aluation datasets, t#e semantic model s#owed #ig#er .recision on t#e s.eci4c false .ositi/e cases t#at are most .roblematic in forensic conte<ts, cases w#ere an accountXs be#a/ior looks sus.icious but its content is clearly benign Ie!g!, automated weat#er u.date ser/icesJ!
Accounts carrying /eri4ed status recei/e an institutional dam.ening ad1ustment2
67853&9 A 67853&9 < “+,@ (-)
0#is factor re>ects t#e em.irical obser/ation t#at /eri4ed accounts, news organiOations, go/ernment bodies, academic institutions, #a/e a signi4cantly lower base rate of intentional misinformation .roduction t#an un/eri4ed accounts, and t#at t#eir be#a/ioral .atterns I#ig# .osting freEuency, API usage, #ig# follower countsJ can a..ear bot-like to t#e be#a/ioral classi4er des.ite re>ecting legitimate automated .ublis#ing work>ows!
0#e 9!FA multi.lier was calibrated against labeled e<am.les from t#e 0rut# Seeker Cor.us rat#er t#an c#osen arbitrarily! It is not intended to grant immunity to /eri4ed accounts, a
/eri4ed account wit# /ery #ig# semantic dece.tion .robability and moderately ele/ated be#a/ioral score would still recei/e a #ig# 4nal risk inde<, but rat#er to .re/ent t#e be#a/ioral signal from dominating t#e score for accounts w#ose .atterns are e<.lainable by institutional .ublis#ing .ractices!
%! HhfU(-^U&’RV U]-T&
A! 4ardware and Software )nvironment
One of t#e e<.licit goals of t#is .ro1ect was to demonstrate t#at meaningful cascade forensic analysis does not reEuire institutional ser/er infrastructure! All e<.eriments were con- ducted on a single deskto. com.uter, t#e con4guration of w#ic# is s#own in 0able I! 0#e F G$ 3AM limit is t#e binding constraint t#at s#a.ed most arc#itectural decisions!
0A$LH I
HhfU(-^U&’RV NR(*iR(U R&* S)j’iR(U C)&j-T+(R’-)&
|
0omponent |
0onfiguration |
|
Processor |
Intel Core i8 |
|
3AM |
F G$ 3@ |
|
Storage |
A6= G$ [%Me SS |
|
O.erating System |
Ybuntu ==!9@ L0S |
|
Programming Language |
Pyt#on ?!69 |
|
istributed Hngine |
A.ac#e S.ark @!6 /ia PyS.ark |
|
ee. Learning Framework |
0ensorFlow =!66 |
|
%isualiOation Stack |
[e<t!1s 6@ k ?!1s /8 |
0#e [%Me SS was not incidental to t#e e<.erimen- tal setu.! S.arkXs MEMORY_AND_DISK .ersistence strategy s.ills ataFrame .aritions to disk w#en memory is insuC- cient! On a s.inning #ard dri/e, t#is s.illing creates se/ere IlO bottlenecks! On an [%Me dri/e wit# seEuential read s.eeds around ?,A99 M$ls, t#e bottleneck is far less se/ere! OrganiOations de.loying t#is framework on systems wit# slower storage may need to increase t#e a/ailable 3AM to com.ensate!
$! Datasets
0wo datasets were used in t#e e/aluation! 0#eir di;erent
.ur.oses in t#e e<.erimental design re>ect t#e dual nature of t#e frameworkXs analysis2 one dataset tests semantic classi4- cation, and t#e ot#er tests be#a/ioral .ro4ling!
0#e “ruth ‘eeker 0orpus contains 6?@,6GF social inter- action records t#at #a/e been manually /eri4ed and labeled as eit#er genuine or fabricated information! 3ecords co/er a di/erse range of to.ic domains including #ealt# misinfor- mation, .olitical misinformation, and fabricated news stories s#ared during .ublic emergencies! 0#e distribution is a..ro<- imately AA: labeled genuine and @A: labeled fabricated, close enoug# to balanced t#at no resam.ling was needed for classi4er training!
Pre.rocessing re/ealed t#at a..ro<imately 6=: of records #ad missing /alues in at least one metadata 4eld! 0#e most common missing 4elds were account age Iw#ic# .latform APIs sometimes wit##old for .ri/acy reasonsJ and /eri4cation status! Missing continuous /ariables Iaccount age, follower countJ were im.uted using column medians, w#ic# is more robust t#an mean im.utation w#en 4elds are missing in systematic .atterns! Missing boolean >ags like /eri4cation status were treated as a se.arate categorical le/el rat#er t#an im.uted wit# a /alue, since Kunknown /eri4cation statusM is meaningfully di;erent from Knot /eri4ed!M
0#e ” itter (etadata .rofile ‘uite .ro/ides D@ account- le/el be#a/ioral dimensions for a se.arate cor.us of accounts, wit# ground trut# labels identifying con4rmed bots, con4rmed #uman accounts, and ambiguous accounts! 0#is dataset was used e<clusi/ely for training and e/aluating t#e Social- Guardian 3andom Forest classi4er! It was not used for semantic analysis, and t#e 0rut# Seeker Cor.us was not used for be#a/ioral training! dee.ing t#e training cor.ora se.arate for eac# module .re/ents t#e integrated e/aluation from being in>ated by correlated training signal!
C! 3eural 3etwor2 Con%guration
0#e $i-LS0M model was con4gured according to t#e
.arameters in 0able II! 0#e /ocabulary siOe of =A,999 words co/ers t#e /ocabulary needed for t#e 0rut# Seeker Cor.us wit# some #eadroom! 0#e seEuence lengt# of 6=9 tokens was c#osen to co/er GA: of .osts in t#e cor.us wit#out truncation w#ile kee.ing t#e ma<imum conte<t window manageable for memory! SeEuences longer t#an 6=9 tokens were truncated from t#e end, on t#e assum.tion t#at t#e most semantically distincti/e content tends to a..ear early in a .ost!
0A$LH II
$–LS0M [U+(RV [U’i)(S N\fU(fR(R^U’U(]
|
>yperparameter |
=alue |
|
%ocabulary SiOe |
=A,999 tokens |
|
SeEuence Lengt# |
6=9 tokens |
|
Hmbedding imension |
6=F |
|
$i-LS0M Ynits |
D@ .er direction I6=F combinedJ |
|
ro.out 3ate |
9!?A |
|
Out.ut Acti/ation |
Sigmoid |
|
O.timiOer |
Adam, lr m 9!996 |
|
$atc# SiOe |
D@ |
|
0raining H.oc#s |
=9 Iearly sto..ing .atience m ?J |
|
Framework |
0ensorFlow =!66 |
0raining used t#e Adam o.timiOer wit# an initial learning rate of 9!996 and early sto..ing wit# .atience of ? e.oc#s! Harly sto..ing monitors /alidation loss and #alts training w#en /alidation loss sto.s im.ro/ing for t#ree consecuti/e e.oc#s, kee.ing t#e 4nal weig#ts from t#e e.oc# wit# t#e best
/alidation loss! 0#is .re/ents t#e model from o/er4tting to t#e training cor.us, w#ic# is a .articular risk w#en t#e training data comes from a 4<ed time window and t#e model will be a..lied to future data w#ere language .atterns #a/e e/ol/ed! 0#e dro.out rate of 9!?A is slig#tly #ig#er t#an common defaults Iw#ic# are often 9!= or 9!=AJ! It was set em.irically2 lower dro.out /alues led to training accuracy t#at e<ceeded
/alidation accuracy by more t#an ? .ercentage .oints, indi- cating o/er4tting! At 9!?A, t#e ga. narrowed to under 6
.ercentage .oint!
%I! 3U]+V’] R&* -],+]]-)&
A! Semantic and $ehavioral Classi%cation erformance
0#e $i-LS0M semantic model and t#e SocialGuardian be#a/ioral classi4er were bot# benc#marked against classical mac#ine learning baselines! 0#e results are s#own in 0able III and /isualiOed in Fig! @!
0A$LH III
SU^R&’-, R&* $U_Rn-)(RV CVR]]-j-,R’-)& PU(j)(^R&,U ATR-&]’ $R]U- V-&U]
|
(odel |
Accuracy |
.recision |
,ecall |
#7-‘core |
|
Logistic 3egres- sion |
8F!@: |
9!8G |
9!8D |
9!88 |
|
Su..ort %ector Mac#ine |
F=!6: |
9!F6 |
9!F? |
9!F= |
|
3andom Forest I$e#a/ioralJ |
FG!D: |
9!FF |
9!G6 |
9!FG |
|
“race#lo $i- &'”( )’eman- tic* |
34.56 |
9.3? |
9.3@ |
9.34 |
FG!D:
34.56
8F!@:
F=!6:
Logistic 3egression
Su..ort
%ector
3andom Forest
“race#lo
$i-&'”(
Fig! @! Classi4cation Accuracy Com.arison Across Models
0#e G@!=: accuracy ac#ie/ed by t#e $i-LS0M re>ects its ability to #andle two ty.es of te<t .atterns t#at break sim.ler models! 0#e 4rst is semantic in/ersion, constructions like sarcasm and irony w#ere t#e surface meaning of a sentence is t#e o..osite of t#e intended meaning! Logistic regression and S%Ms using bag-of-words features cannot detect in/ersion because t#ey treat eac# word inde.endently wit#out .osition conte<t! 0#e $i-LS0MXs bidirectional arc#itecture ca.tures t#e relations#i. between distant .arts of a sentence!
0#e second ty.e is ad/ersarial .#rasing! So.#isticated mis- information content sometimes deliberately includes .lausi- bility-signaling #edges IKaccording to some re.orts,M Ke<.erts suggestMJ w#ile embedding un/eri4ed or fabricated claims wit#in t#em! 0#ese constructions look credible at t#e word le/el and fool classi4ers t#at o.erate on term freEuency! 0#e
$i-LS0M, trained on labeled e<am.les of t#is .attern, learns to associate t#e co-occurrence of credibility markers wit# s.eci4c claim structures as a marker of .otential dece.tion rat#er t#an as e/idence of reliability!
0#e 3andom Forest be#a/ioral classi4erXs FG!D: accuracy is notable in a di;erent way2 it ac#ie/es t#is wit#out reading a single word of any .ost! It is working entirely from account metadata, #ow old t#e account is, #ow often it .osts, w#at a..lication it .osts from, #ow its follower numbers #a/e c#anged! 0#is con4rms t#at be#a/ioral signals are genuinely informati/e, not 1ust secondary .ro<ies for content-le/el sig- nals! An account can be#a/e like an automated am.li4er wit#out .osting content t#at looks dece.ti/e, and t#e be#a/- ioral classi4er will catc# it!
$! )rror Analysis
0#e A!F: of test e<am.les t#at t#e integrated .i.eline misclassi4ed are wort# e<amining in some detail, because t#ey re/eal t#e current frameworkXs genuine limits!
0#e most common failure mode for t#e semantic model was w#at mig#t be called Ke/ol/ing languageM errors! Misinforma- tion cam.aigns freEuently ado.t slang, in-grou. terminology, or coded references t#at were not .resent in t#e training cor-
.us! B#en t#e model encounters t#ese tokens, t#ey ty.ically recei/e an KunknownM embedding, w#ic# .ro/ides no useful signal! 0#e model t#en relies on t#e surrounding conte<t, w#ic# may or may not be suCcient! 0#is is not a fundamental limitation of t#e $i-LS0M arc#itecture, retraining on u.dated
cor.ora would address it, but it is a .ractical maintenance c#allenge!
Image-centric .osts caused a di;erent failure mode! Many social media .osts consist .rimarily of images wit# minimal te<t ca.tions! 0#e semantic model #as no image .rocessing ca.ability and can only o.erate on t#e ca.tion te<t! A .ost consisting of a fabricated infogra.#ic wit# t#e ca.tion KAmaO- ingM recei/es almost no semantic signal from t#e te<t alone! 0#e be#a/ioral score for t#e .osting account may com.ensate somew#at, but if t#e account ot#erwise a..ears normal, t#e integrated score may fall below t#e t#res#old for >agging!
For t#e be#a/ioral classi4er, t#e most common false .os- iti/es came from automated legitimate accounts2 weat#er ser/ices, emergency alert systems, .ublic transit information accounts, and news wire ser/ices! 0#ese accounts .ost at
/ery #ig# freEuency, use API signatures, #a/e unusual .osting time distributions, and often follow many accounts wit#out reci.rocation, all features t#at resemble bot be#a/ior! 0#e institutional dam.ening factor reduces t#ese false .ositi/es considerably but does not eliminate t#em entirely!
C! Source #rigin “ocali0ation *esults
0#e 3e/erse Pat# 0ra/ersal algorit#m was e/aluated against
$etweenness Centrality and Higen/ector Centrality baselines across t#ree gra.# scales! Synt#etic .ro.agation gra.#s wit# known origin nodes were used so t#at ground trut# accuracy could be com.uted directly! 0able I% s#ows t#e results!
0A$LH I%
S)+(,U L),RV-oR’-)& A,,+(R,\ A,()]] G(Rf_ S,RVU]
|
-raph ‘cale |
$et eenness |
1igenvector |
“race#lo |
|
69,999 edges |
D6!=: |
A@!?: |
37.46 |
|
A9,999 edges |
AA!F: |
@G!6: |
22.A6 |
|
699,999 edges |
@F!?: |
@6!D: |
2:.56 |
0#e degradation in centrality baseline .erformance as gra.# siOe increases is e<.ected and informati/e! Larger gra.#s contain more .at#s between any two nodes, w#ic# means t#at #ig#ly-connected am.li4cation #ubs accumulate more between-.at# counts IbetweennessJ or more #ig#-Euality neig#bors Ieigen/ector scoreJ! 0#is .us#es .o.ular-but-late accounts toward t#e to. of t#e centrality ranking, e/en w#en t#e true origin is a smaller account t#at .osted 4rst!
0raceFlowXs accuracy also decreases wit# gra.# siOe, from G6!@: at 69d edges to FD!=: at 699d edges! 0#is is a real limitation! Larger gra.#s contain longer tra/ersal .at#s, w#ic# means more o..ortunity for errors to accumulate! B#en t#e true cascade c#ain in/ol/es many intermediate nodes, eac# #o. #as some .robability of being resol/ed incorrectly due to missing edges IAPI rate limits mean t#at some interaction records are ne/er ca.turedJ or timestam. ambiguity! At 699d edges, roug#ly 6@: of trials identi4ed t#e wrong account as t#e origin!
In .ractice, analysts rarely need to .in.oint a single account wit# certainty! 0#e out.ut at t#e 699d-edge scale ty.ically includes =-@ candidate origin accounts wit# associated con4- dence scores, and t#e true origin is usually in t#at set! Forensic
in/estigation t#en uses account-le/el e/idence to distinguis#
s.eci4c combination of MEMORY_AND_DISK .ersistence,
among candidates!
D! ,emory erformance and Sta(ility
0#e memory be#a/ior of t#e .i.eline is wort# documenting in detail because it was t#e .rimary tec#nical c#allenge during im.lementation! Initial runs on t#e full 6?@,6GF- record dataset caused re.eated a%M cras#es due to garbage collection o/er#ead! 0#e e<ecutor would be allocated a s#uee task reEuiring more #ea. s.ace t#an t#e a%M could .ro/ide, triggering aggressi/e garbage collection, w#ic# in turn slowed e<ecution to near-Oero t#roug#.ut w#ile t#e garbage collector ran, e/entually causing t#e e<ecutor to time out and cras#!
After a..lying t#e t#ree-.art o.timiOation IMHMO3″cA[ c ISd .ersistence, broadcast 1oins, and s#uee .artition reductionJ, t#e system stabiliOed! 0able % s#ows t#e .rogression!
0A$LH %
MU^)(\ Y’-V-oR’-)& A,()]] P-fUV-&U Of’-^-oR’-)& S’RTU]
|
0onfiguration |
.eak ,A( |
‘tatus |
|
efault S.ark settings |
OOM cras# |
Failed |
|
MHMO3″cA[ c ISd only |
8!F G$ |
Ynstable |
|
6J $roadcast 1oins |
8!6 G$ |
Stable wit# .auses |
|
6J S#uee .artition re- duction |
D!= G$ |
Stable |
0#e 4nal stable con4guration uses D!= G$, lea/ing 6!F G$ #eadroom wit#in t#e F G$ ceiling! 0#is #eadroom matters during concurrent .rocessing ste.s, w#en t#e gra.# tra/ersal module and t#e semantic model are running simultaneously on di;erent .artitions, .eak memory demand can brie>y e<ceed t#e mean! 0#e 6!F G$ bu;er .re/ents o/er>ow during t#ose
.eaks!
%II! F(R^Ui)(S U]-T&2 C)&'(-p+’-)&] R&* N)&- U]’ L-^-‘R’-)&]
A! &hat Trace’low Contri(utes
0#e .rimary contribution of t#is work is not any indi/idual com.onent! $i-LS0M te<t classi4ers e<ist! 3andom Forest be#a/ioral classi4ers e<ist! Gra.# tra/ersal algorit#ms e<ist! 0#e contribution is t#e integration of t#ese com.onents into a uni4ed .i.eline t#at .reser/es inde.endent e/idence c#annels w#ile combining t#em into actionable out.ut, and t#at does t#is on #ardware most researc# grou.s actually #a/e!
On t#e met#odological side, t#e 3e/erse Pat# 0ra/ersal algorit#mXs e<.licit timestam. constraint is a meaningful de.arture from centrality-based source localiOation! 0#e em-
.irical ga. of ?9 .ercentage .oints between 0raceFlow and
$etweenness Centrality on 69d-edge gra.#s is not marginal, it is t#e di;erence between an in/estigati/e lead and noise! 0#e mec#anism be#ind t#is ga. Item.oral 4ltering /ersus structural weig#tingJ is well understood and .rinci.led!
On t#e systems side, t#e PyS.ark o.timiOation c#oices are documented in enoug# detail to be directly re.roducible! 0#e
broadcast 1oins, and s#uee .artition reduction is not no/el in isolation, but its a..lication to misinformation forensic gra.#
.rocessing, w#ere t#e gra.# sc#ema and Euery .atterns are di;erent from ty.ical business intelligence workloads, is a
.ractical contribution to anyone building similar infrastruc- ture!
On t#e analyst-facing side, t#e decision to surface interme- diate scores from eac# e/idence c#annel se.arately, rat#er t#an 1ust .resenting a 4nal integrated risk score, re>ects an understanding of #ow forensic re.orts actually work! An in/estigator w#o cannot e<.lain w#y t#e system >agged a s.eci4c account cannot include t#at >ag as e/idence! 0#e trans.arency of inde.endent scores ser/es forensic re.orting reEuirements!
$! “imitations That ,atter
0#ree limitations are signi4cant enoug# to warrant e<.licit acknowledgment rat#er t#an relegation to a brief closing
.aragra.#!
0#e 4rst is batc#-only .rocessing! 0raceFlow ingests .re- collected interaction logs! It is not a streaming system!
uring a fast-mo/ing misinformation e/ent, an election day, a breaking news situation, a .ublic #ealt# emergency, t#e most forensically /aluable window is t#e 4rst few #ours, before t#e cascade #as matured! A batc# system t#at reEuires data collection, ingestion, and .rocessing to com.lete before results are a/ailable cannot o.erate in t#at window! 0#e arc#itecture is com.atible wit# A.ac#e S.ark Streaming, but integrating streaming ingestion wit#out sacri4cing memory stability re- Euires signi4cant additional work!
0#e second is model aging! 0#e $i-LS0M semantic model is trained on a 4<ed cor.us! Language e/ol/es continuously, and misinformation actors s.eci4cally e<.loit t#is e/olution by ado.ting new terminologies before classi4ers are u.dated! A model trained on =9=? data will miss .atterns t#at only emerged in =9=A! 0#is is a maintenance reEuirement, not a fundamental >aw, but organiOations de.loying 0raceFlow need to .lan for .eriodic retraining against u.dated labeled data!
0#e t#ird is gra.# scale ceilings! 0#e F G$ memory ceiling becomes a #ard constraint somew#ere between 699d and A99d edges, de.ending on gra.# density! %ery large- scale in/estigations, .latform-wide analysis of millions of interactions, would reEuire eit#er more 3AM or cluster- le/el de.loyment! 0#e PyS.ark arc#itecture is designed for migration to multi-node clusters, but t#at migration reEuires infrastructure in/estment and con4guration t#at goes beyond w#at t#is .a.er demonstrates!
%III! U,-]-)& S+ff)(‘ R&* A&RV\]’ B)(SjV)i
A! ‘rom Score to Action- *is2 $ands
0#e #ybrid risk score 67853&9 is a number between 9 and 6! 0#at number means not#ing on its own to an analyst w#o needs to decide w#et#er to escalate an in/estigation, arc#i/e a case for later re/iew, or immediately notify a res.onse team! 0raceFlow ma.s scores to four decision bands t#at corres.ond to di;erent res.onses, as s#own in 0able %I!
0A$LH %I
U,-]-)& S+ff)(‘ MRff-&T j)( 0(R,UFV)i 3-]S S,)(U $R&*]
|
$and |
‘core ,ange |
Bhat It (eans |
‘uggested Action |
|
Low |
9!99g9!?G |
Beak or isolated signals! Ynlikely to re.resent co- ordinated acti/ity! |
Arc#i/e for later com.arison |
|
Medium |
9!@9g9!DG |
One or two c#annels s#ow sus.icious .atterns but e/idence is incom- .lete! |
Batc#-listQ re/iew .ro.agation .at# |
|
Nig# |
9!89g9!FG |
Multi.le e/idence c#an- nels reinforce eac# ot#er! Pattern is consistent wit# coordination! |
Hscalate in/estiga- tionQ assign analyst |
|
Critical |
9!G9g6!99 |
Strong con/ergent e/i- dence across semantic, be#a/ioral, and gra.# signals! |
Immediate re/iew and res.onse |
0#e band t#res#olds were calibrated on t#e e/aluation dataset by 4nding t#e score ranges t#at minimiOed a weig#ted combination of false .ositi/es and false negati/es, wit# false negati/es weig#ted #ig#er Ibecause missing a genuine cam.aign #as more o.erational cost t#an o/er-in/estigating a borderline caseJ! OrganiOations wit# di;erent cost structures can recalibrate t#e t#res#olds wit#out modifying t#e under- lying models!
$! Validation Strategy
0#e t#ree e/idence c#annels were /alidated inde.endently before t#e integrated system was e/aluated! 0#is se.aration is im.ortant for two reasons! First, it makes it .ossible to identify w#ic# com.onent is dri/ing errors w#en t#e integrated system makes a mistake! Second, it allows eac# com.onent to be u.dated or re.laced wit#out reEuiring full re-e/aluation of t#e entire system!
0able %II summariOes t#e /alidation a..roac# for eac# layer!
0A$LH %II
0(R,UFV)i %RV-*R’-)& S'(R’UT\ PU( Hn-*U&,U LR\U(
|
1vidence &ayer |
=alidation (ethod |
.rimary Output |
|
Semantic Content |
Labeled misinforma- tion recordsQ F9l=9 train-test s.lit |
ece.tion .robability score |
|
$e#a/ioral Metadata |
Account metadata wit# ground trut# botl #uman labels |
Automation likeli- #ood score |
|
0em.oral Gra.# |
Synt#etic cascades wit# known origin nodes |
Candidate source ac- count .at# |
|
Cascade Forecast |
ICM simulations against obser/ed dif- fusion counts |
H<.ected and G9t#- .ercentile s.read |
Synt#etic cascade gra.#s for source localiOation /alida- tion were generated using t#e $arabqsi-Albert .referential attac#ment model, w#ic# .roduces degree distributions t#at
resemble real social networks! Origin nodes were selected uniformly at random from t#e 69t# .ercentile of nodes by degree Ire>ecting t#e ty.ical .attern t#at true cascade origins tend to be moderately connected accounts, not #ubsJ! 0#e ground trut# origin was recorded and t#en #idden from t#e localiOation algorit#m!
C! The *ole of the 4uman Analyst
A .oint wort# making e<.licitly2 0raceFlow is not designed to re.lace forensic analysts! It is designed to gi/e t#em better organiOed e/idence faster! 0#e #ybrid risk score tells an analyst w#ere to lookQ it does not tell t#em w#at to conclude! 0#is matters es.ecially for edge cases t#at automated systems #andle .oorly! Satire accounts .roduce #ig# seman- tic risk scores because t#ey deliberately use sensationaliOed language and un/eri4ed framing! $reaking news situations
.roduce #ig# be#a/ioral risk scores because many accounts
.ost ra.idly about t#e same to.ic simultaneously! Genuine
.ublic #ealt# emergencies .roduce .ro.agation gra.#s t#at look structurally identical to coordinated misinformation cam-
.aigns! In all of t#ese cases, a #uman analyst wit# conte<tual knowledge, w#o knows t#at a s.eci4c account is a known satire .ublication, or t#at a s.eci4c e/ent occurred and legitimately triggered ra.id .osting, will draw /ery di;erent conclusions from t#e same risk score!
0#e frameworkXs design su..orts t#is by surfacing t#e in- termediate out.uts of eac# e/idence c#annel se.arately! B#en an analyst re/iews a Critical-band case, t#ey can see t#at t#e semantic score is 9!G6 w#ile t#e be#a/ioral score is 9!@?, w#ic# tells t#em t#e te<t content looks #ig#ly sus.icious but t#e account itself does not s#ow automated be#a/ior .atterns! 0#is is a /ery di;erent e/identiary .icture from a case w#ere bot# scores are 9!G9, e/en t#oug# bot# mig#t .roduce similar #ybrid scores! 0#e analyst can weig#t t#ese di;erently based on in/estigati/e conte<t!
Ir! %-]+RV-oR’-)& LR\U(
A! Dash(oard Architecture
0#e /isualiOation layer .ro/ides t#e interface t#roug# w#ic# in/estigators interact wit# all of t#e frameworkXs out.uts! It is built on [e<t!1s 6@ for t#e a..lication s#ell and ?!1s /8 for gra.# rendering and interacti/e data dis.lay! 0#e design
.#iloso.#y is t#at e/ery number t#e framework .roduces s#ould be reac#able t#roug# t#e /isualiOation, no im.ortant intermediate out.ut s#ould be buried in a log 4le!
0#e main in/estigation /iew s#ows t#ree .anels simultane- ously! 0#e left .anel dis.lays t#e .ro.agation gra.#, rendered as a force-directed layout by ?!1s, wit# nodes colored by risk band! 0#e origin candidate node and its estimated con4dence are /isually distinct! 0#e analyst can click any node to e<.and its e/idence summary, semantic score, be#a/ioral score, earli- est timestam. in t#e cascade, and estimated downstream reac# from t#e ICM simulation!
0#e center .anel s#ows t#e cascade timeline2 a c#ronolog- ical /isualiOation of w#en eac# node in t#e subgra.# 4rst recei/ed or .osted t#e content! 0#is /iew makes t#e tem.oral structure of t#e cascade immediately legible wit#out reEuiring
t#e analyst to read raw timestam.s! Coordinated .osting
.atterns, w#ere many accounts .ost wit#in seconds of eac# ot#er, are /isible as clusters on t#e timeline!
0#e rig#t .anel s#ows t#e risk score breakdown for t#e currently selected in/estigation, including t#e #ybrid score, band assignment, and t#e contribution of eac# c#annel to t#e 4nal score! 0#e ICM simulation results a..ear #ere as a
.robability distribution o/er e<.ected cascade siOe, wit# t#e mean and G9t#-.ercentile bounds #ig#lig#ted!
$! Design Considerations
Se/eral design c#oices in t#e /isualiOation deser/e brief e<.lanation! 0#e force-directed gra.# layout was c#osen o/er 4<ed #ierarc#ical layouts because real cascade gra.#s are rarely trees, t#ey contain many cross-links w#ere an account recei/ed t#e content from multi.le sources, or w#ere accounts t#at are .art of t#e same network inde.endently reac#ed t#e same content t#roug# di;erent .at#s! Force-directed layouts #andle t#ese irregular to.ologies gracefully, w#ile #ierarc#ical layouts would .roduce misleading im.lied #ierarc#y!
0#e timeline /iew was added after early usability testing wit# two forensic analysts w#o re.orted t#at t#e gra.# /iew alone made it diCcult to reason about tem.oral seEuence! A gra.# /iew s#ows w#o is connected to w#omQ t#e timeline
/iew s#ows t#e order in w#ic# t#e cascade built! For in/estiga- tions focused on identifying coordinated .osting, t#e timeline
/iew turned out to be t#e more useful of t#e two /iews, because coordination is fundamentally a tem.oral rat#er t#an to.ological signature!
Color-coding by risk band Igreen for Low, yellow for Medium, orange for Nig#, red for CriticalJ follows con/en- tional traCc-lig#t semantics t#at are immediately legible wit#out training! An im.ortant accessibility consideration is t#at t#is color sc#eme uses green-orange-red, w#ic# is
.artially .roblematic for users wit# red-green color /ision de4ciency! Future /ersions s#ould add a .attern or s#a.e distinction t#at con/eys risk le/el inde.endently of color!
r! F+’+(U B)(S
0#ree de/elo.ment directions are .rioritiOed based on t#e limitations identi4ed in Section %I!
0#e most o.erationally im.ortant e<tension is real-time streaming integration! A.ac#e S.ark Streaming su..orts con- tinuous ingestion from message Eueues like A.ac#e dafka, w#ic# in turn can ingest from .latform API streams! 0#e forensic logic modules, t#e semantic model, t#e be#a/ioral classi4er, and t#e gra.# tra/ersal engine, are all stateless in
.rinci.le2 t#ey o.erate on a window of data and .roduce a result! Ada.ting t#em to o.erate on sliding windows o/er a continuous stream reEuires careful attention to state manage- ment for incremental gra.# u.dates, but t#e core algorit#ms do not need to c#ange!
0#e second .riority is multi-node cluster de.loyment! 0#e PyS.ark arc#itecture was s.eci4cally c#osen because it scales from a single mac#ine to a distributed cluster wit#out code c#anges, only con4guration c#anges! A 0erraform infrastruc- ture-as-code tem.late for de.loying a 0raceFlow cluster on
cloud infrastructure would make t#is scaling .at# accessible to organiOations wit#out dedicated cluster engineering teams! 0#e t#ird direction is Gra.# [eural [etwork integration for source localiOation! 0#e current 3e/erse Pat# 0ra/ersal algo- rit#m is a deterministic rule-based met#od! G[[s could learn
.ro.agation .atterns directly from gra.# structure and node features, .otentially im.ro/ing source localiOation accuracy on gra.#s w#ere tem.oral metadata is incom.lete or noisy! 0#e .rimary c#allenge is t#at G[[ training reEuires labeled cascade gra.#s wit# known origin nodes, a more demand- ing data reEuirement t#an t#e current /alidation a..roac#! Progress in t#is direction would likely come from generating synt#etic labeled cascades at scale and e/aluating w#et#er G[[-learned re.resentations transfer to real-world data!
rI! C)^fR(R’-nU A&RV\]-] )j 3UVR’U* S\]’U^]
0o situate 0raceFlow wit#in t#e e<isting researc# landsca.e, it is useful to com.are it against se/eral re.resentati/e sys- tems along dimensions t#at matter for .ractical de.loyment! 0able %III summariOes t#is com.arison!
0A$LH %III
FUR’+(U C)^fR(-])& )j 0(R,UFV)i ATR-&]’ 3Uf(U]U&’R’-nU 3UVR’U* S\]’U^]
|
‘ystem |
’eman- tic |
$ehav- ioral |
-raph “raver- sal |
‘ource &oc. |
&o – ,e- source |
|
Fake[ews[et 56=7 |
X |
Partial |
X |
||
|
$otSentinel 5G7 |
X |
X |
X |
||
|
Noa<y 5F7 |
X |
Partial |
X |
X |
|
|
G[[Fake 56?7 |
X |
Partial |
X |
||
|
MultiModal 56D7 |
Partial |
X |
X |
||
|
“race#lo )Ours* |
Fake[ews[et .ro/ides a ric# benc#mark dataset and con- tent-le/el classi4cation, but it does not attem.t be#a/ioral
.ro4ling or source localiOation, it answers t#e Euestion of w#et#er content is false, not w#ere it came from or w#o s.read it! $otSentinel is e;ecti/e at identifying automated accounts from be#a/ioral signals but #as no content analysis and no gra.#-le/el reasoning! Noa<y .ro/ides im.ressi/e
.ro.agation /isualiOation and some account-le/el be#a/ioral conte<t, but it is not designed for source attribution and reEuires substantial infrastructure to run at scale!
G[[-based a..roac#es like t#ose sur/eyed by Gong et al! re.resent t#e current researc# frontier in gra.#-based fake news detection 56?7! 0#ey consistently out.erform static centrality metrics on source localiOation benc#marks w#en suCcient training data is a/ailable! 0#e tradeo; is t#at t#ey reEuire labeled cascade gra.#s for training, are com.utation- ally e<.ensi/e at inference time, and are considerably #arder to e<.lain to a non-tec#nical audience! For a forensic conte<t
w#ere e<.lainability is a reEuirement, not a .reference, G[[s in t#eir current form .resent .ractical c#allenges!
0raceFlowXs .osition in t#is landsca.e is as a .ractically com.lete system2 it addresses all t#ree signal ty.es and su.-
.orts source localiOation, on #ardware t#at a small researc# grou. or under-resourced forensic unit can actually a;ord to o.erate! It sacri4ces some classi4cation accuracy com.ared to well-resourced transformer-based a..roac#es, but gains inter.retability, de.loyability, and t#e integration of e/idence c#annels t#at larger s.ecialiOed systems lea/e se.arate!
A! A(lation Study- The Value of )ach )vidence Channel
0o Euantify #ow muc# eac# e/idence c#annel contributes to t#e integrated system, an ablation study was conducted by remo/ing one c#annel at a time and measuring t#e im.act on classi4cation accuracy and source localiOation accuracy! 3esults are in 0able Ir!
0A$LH Ir
ApVR’-)& S’+*\2 I^fR,’ )j 3U^)n-&T HR,_ Hn-*U&,U C_R&&UV
|
0onfiguration |
0lass. Accuracy |
‘ource &oc. )79C* |
|
Full system Iall c#annelsJ |
G@!=: |
G6!@: |
| [o semantic c#annel I; = “J |
FG!D: |
G6!=: |
| [o be#a/ioral c#annel I? = “J |
G@!6: |
G9!F: |
| [o gra.# tra/ersal Icentrality onlyJ |
G6!?: |
D6!=: |
|
Semantic only |
G@!=: |
AF!8: |
|
$e#a/ioral only |
FG!D: |
AG!6: |
Se/eral obser/ations stand out! 3emo/ing t#e semantic c#annel reduces classi4cation accuracy from G@!=: to FG!D:, t#e be#a/ioral classi4erXs standalone .erformance, w#ile #a/- ing almost no e;ect on source localiOation accuracy! 0#is makes sense2 t#e source localiOation algorit#m o.erates on gra.# structure and timestam.s, not on te<t content! 0#e semantic c#annelXs contribution to t#e integrated system is entirely t#roug# t#e #ybrid risk score, not t#roug# t#e local- iOation algorit#m!
3emo/ing t#e be#a/ioral c#annel #as almost no e;ect on classi4cation accuracy, w#ic# at 4rst a..ears sur.rising! 0#e e<.lanation is t#at in t#e test cor.us, t#e semantic modelXs
.redictions and t#e be#a/ioral modelXs .redictions are #ig#ly correlated2 accounts t#at .ost dece.ti/e content also tend to s#ow be#a/ioral anomalies! B#en bot# c#annels are .resent, t#ey reinforce eac# ot#er for clear cases and .artially cancel for borderline cases! B#en only one c#annel is .resent, t#e clear cases are still classi4ed correctlyQ it is t#e borderline cases t#at s#ift!
0#e most striking ablation result is t#e t#ird row2 remo/ing gra.# tra/ersal and falling back to betweenness centrality for source localiOation dro.s source localiOation accuracy from G6!@: to D6!=:! 0#is ?9-.ercentage-.oint ga. con4rms t#at t#e tem.oral constraint in t#e 3e/erse Pat# 0ra/ersal algo- rit#m is not a minor re4nement, it is t#e mec#anism res.on-
sible for t#e ma1ority of t#e source localiOation im.ro/ement o/er baseline met#ods!
rII! A B)(SU* HhR^fVU2 H&*-‘)-H&* F)(U&]-, I&- nU]’-TR’-)&
0o make t#e frameworkXs o.eration concrete, t#is section walks t#roug# a #y.ot#etical end-to-end in/estigation using t#e kind of data and results t#at 0raceFlow .roduces in .rac- tice! 0#e scenario is illustrati/e and does not corres.ond to a s.eci4c real-world e/ent, but t#e data structures, score ranges, and analytical ste.s are re.resentati/e of actual framework be#a/ior!
A! Scenario Setup
uring a munici.al election cycle, .latform monitoring identi4es unusual acti/ity around a claim t#at a local #ealt# aut#ority #as issued a .roduct recall, a claim t#at t#e #ealt# aut#ority #as not made! 0#e interaction logs ca.tured o/er a si<-#our window s#ow =,F@8 accounts engaging wit# content related to t#is claim, .roducing @,66= interaction edges!
0#ese logs are ingested by t#e data engineering layer! 0#e edge generation ste. con/erts @,66= raw interaction records into timestam.ed Isource, target, timestam., interactioncty.eJ tu.les! After .re.rocessing and median im.utation of 8 miss- ing account age /alues, t#e dataset is ready for t#e forensic logic layer!
$! Semantic Analysis *esults
0#e $i-LS0M model .rocesses t#e te<t content from t#e
?6= uniEue .osts associated wit# t#is cascade! 0#e out.ut is a distribution of dece.tion .robabilities across .osts! 0#e mean dece.tion .robability across all uniEue .osts is 9!8?, wit# indi/idual .ost scores ranging from 9!@6 to 9!G@! Se/eral .osts cluster around 9!G6-9!G@, suggesting eit#er t#at t#ose .osts use t#e most dece.ti/e linguistic .atterns in t#e training distrib- ution, or t#at t#ey were drafted by a more e<.erienced actor! 0#e original claim statement, t#e .ost t#at 4rst introduced t#e false recall narrati/e, recei/es a dece.tion .robability of 9!G6! 0#e semantic modelXs attention .atterns Iins.ectable be- cause t#e $i-LS0M does not use an attention mec#anism, but t#e .er-.osition contribution to t#e 4nal classi4cation can be e<tractedJ s#ow t#at t#e model is weig#ting t#e .#rase structure around t#e claim assertion #ea/ily! 0#e .ost uses a grammatical construction t#at mimics oCcial announcement language w#ile embedding an un/eri4able claim, a .attern t#e model #as learned to associate wit# fabricated aut#ority
attribution!
C! $ehavioral ro%ling *esults
SocialGuardian e/aluates all =,F@8 accounts! 0#e distrib- ution of be#a/ioral risk scores s#ows a bimodal .attern2 most accounts score below 9!?9 Iconsistent wit# ordinary users w#o encountered t#e content organicallyJ, but @? accounts score abo/e 9!8A, and 66 of t#ose score abo/e 9!G9!
H<amining t#e #ig#-scoring accounts re/eals a .attern2 G of t#e 66 accounts scoring abo/e 9!G9 #a/e an account age of less t#an 6@ days, .ost e<clusi/ely t#roug# a t#ird-.arty client a..lication rat#er t#an t#e nati/e .latform client, and #a/e
follower-to-following ratios below 9!6A Imeaning t#ey follow many accounts but few follow t#em backJ! 0#is is a common signature of accounts created in batc#es for am.li4cation cam.aigns!
D! .raph Traversal and Source “ocali0ation
0#e 3e/erse Pat# 0ra/ersal algorit#m begins at t#e node wit# t#e #ig#est /olume of downstream acti/ity Inot neces- sarily t#e origin, but a reasonable starting .oint for backward tra/ersalJ! Balking backward t#roug# incoming edges wit# earlier timestam.s, t#e algorit#m reac#es a branc#ing .oint w#ere four se.arate early-stage nodes all a..ear to #a/e transmitted t#e content wit#in a @8-second window!
0#ree of t#ese four accounts score abo/e 9!FA on Social- Guardian! 0#e fourt#, t#e one wit# t#e earliest /eri4able time- stam. in t#e a/ailable log data, scores 9!D= on SocialGuardian and 9!G6 on t#e semantic model! 0raceFlow identi4es t#is account as t#e .rimary origin candidate, wit# t#e ot#er t#ree as likely early am.li4ers from t#e same coordinated o.eration! 0#e origin candidate account was created 66 days before t#e cascade began, uses t#e same t#ird-.arty client as t#e ot#er #ig#-SocialGuardian-scoring accounts, and .osted t#e original false recall claim at 82@? AM, four minutes before any of t#e ot#er candidate accounts 4rst engaged wit# t#e
content!
)! IC, Simulation and *is2 Assessment
Bit# t#e origin node identi4ed, 0raceFlow runs 6,999 ICM simulation trials from t#at origin .oint! 0#e mean cascade siOe at simulation termination is @,=F9 additional accounts Ibeyond t#e =,F@8 already obser/edJ, and t#e G9t#-.ercentile
/alue is F,G@9! 0#e #ig# ratio of t#e G9t#-.ercentile to t#e mean indicates t#at t#e gra.# to.ology contains se/eral #ig#- degree #ub accounts t#at, if acti/ated, would .roduce /ery large secondary cascades!
0#e #ybrid risk score for t#is in/estigation is2 67853&9 = (“+@@ < “+B#) > (“+0@ < “+/-) = “+@””@ > “+-CB = “+CCB@
0#is .laces t#e in/estigation in t#e Nig# band I9!89g9!FGJ! 0#e analyst re/iewing t#e das#board sees t#e Nig# band classi4cation, t#e candidate origin account wit# its e/idence summary, t#e @? ele/ated-risk am.li4er accounts, and t#e ICM .ro1ection s#owing e<.ected s.read of a..ro<imately @,?99 additional accounts wit# G9t#-.ercentile risk of a..ro<- imately F,G99!
0#e recommendation out.ut suggests escalating in/estiga- tion, assigning an analyst to re/iew t#e origin candidate accountXs full acti/ity #istory, and cross-referencing t#e 66 accounts scoring abo/e 9!G9 on SocialGuardian wit# .latform trust and safety teams for coordinated inaut#entic be#a/ior re/iew!
0#is end-to-end work>ow, from raw logs to actionable forensic out.ut, com.letes in a..ro<imately =? minutes on t#e e/aluation #ardware for a @,66=-edge gra.#! 0#e ma1ority of t#at time Ia..ro<imately 6F minutesJ is consumed by t#e PyS.ark data engineering and gra.# generation ste.sQ t#e semantic model inference and ICM simulation toget#er take under A minutes!
rIII! C)&,V+]-)&
0#e central argument of t#is .a.er is t#at misinformation forensics reEuires integrating t#ree ty.es of e/idence, w#at t#e content says, #ow t#e s.reading accounts be#a/e, and #ow t#e content mo/ed t#roug# t#e network, because eac# ty.e alone is insuCcient to answer t#e Euestions t#at in/estigators actually need answered!
0raceFlow demonstrates t#at t#is integration is tractable on modest #ardware! 0#e $i-LS0M semantic model, t#e Social- Guardian be#a/ioral .ro4ler, and t#e timestam.-constrained 3e/erse Pat# 0ra/ersal algorit#m eac# address a di;erent as-
.ect of t#e forensic .roblem, and t#eir combination .roduces source attributions and cascade forecasts t#at substantially out.erform any single c#annel! 0#e e<.erimental results, G@!=: classi4cation accuracy, G6!@: source localiOation at 69d edges, stable o.eration wit#in F G$ of 3AM, .ro/ide a concrete baseline for future work to im.ro/e u.on!
Se/eral real limitations remain! $atc#-only .rocessing, model aging, and gra.# scale ceilings eac# re.resent ga.s between t#e current im.lementation and w#at a .roduction forensic de.loyment would reEuire! 0#ese are engineering c#allenges rat#er t#an fundamental barriers, and t#e arc#itec- tural c#oices in t#e current system were made s.eci4cally to minimiOe t#e e;ort reEuired to close eac# ga.!
0#e broader .oint, w#ic# t#e e<.erimental results su..ort, is t#at t#e combination of semantic, be#a/ioral, and tem.oral e/idence is not merely additi/e! B#en all t#ree signals .oint in t#e same direction, dece.ti/e content, automated be#a/ior, and to.ology consistent wit# coordinated am.li4cation, t#e forensic case is Eualitati/ely stronger t#an any single signal at full con4dence! Con/ersely, w#en signals di/erge, #ig# semantic risk but normal be#a/ioral .ro4le, or unusual gra.# to.ology but benign te<t, t#e di/ergence is itself informati/e and a..ro.riately reduces t#e risk score! 0#at signal structure is w#at makes an integrated e/idence framework more t#an t#e sum of its .arts!
3UjU(U&,U]
567 S! %osoug#i, ! 3oy, and S! Aral, K0#e s.read of true and false news online,M Science, /ol! ?AG, no! D?F9, ..! 66@Dg66A6, =96F, doi2 69!66=Dlscience!aa.GAAG!
5=7 ! M! a! LaOer and ot#ers, K0#e science of fake news,M Science, /ol!
?AG, no! D?F9, ..! 69G@g69GD, =96F, doi2 69!66=Dlscience!aao=GGF! 5?7 ! dem.e, a! dleinberg, and H! 0ardos, KMa<imiOing t#e s.read of
in>uence t#roug# a social network,M in roc! 5th AC, SI.6DD Int! Conf! 6nowledge Discovery and Data ,ining, =99?, ..! 6?8g6@D! doi2 69!66@AlGAD8A9!GAD8DG!
5@7 a! Lesko/ec, M! McGlo#on, C! Faloutsos, [! Glance, and M! Nurst, KPatterns of cascading be#a/ior in large blog gra.#s,M in
roc! SIA, Int! Conf! Data ,ining, =998, ..! AA6gAAD! doi2 69!66?8l6!G8F6D66G8=886!D9!
5A7 Z! C#en, a! Bei, S! Liang, 0! Cai, and r! Liao, KInformation cascades
.rediction wit# gra.# attention,M ‘rontiers in hysics, /ol! G, =9=6, doi2 69!??FGlf.#y!=9=6!8?G=9=!
5D7 Z! Bang, r! Bang, F! riong, and N! C#en, KA sur/ey of dee. learning- based information cascade .rediction,M Symmetry, /ol! 6D, no! 66, .! 6@?D, =9=@, doi2 69!??G9lsym6D666@?D!
587 H! Ferrara, O! %arol, C! a/is, F! MencOer, and A! Flammini, K0#e rise of social bots,M Communications of the AC,, /ol! AG, no! 8, ..! GDg 69@, =96D, doi2 69!66@Al=F6F868!
5F7 C! S#ao, G! L! Ciam.aglia, O! %arol, d!-C! “ang, A! Flammini, and F! MencOer, K0#e s.read of low-credibility content by social bots,M 3ature Communications, /ol! G, .! @8F8, =96F, doi2 69!69?Fl s@6@D8-96F-9DG?9-8!
5G7 ACM igital Library, K$ot account s.am detection on r2 Multi-
.arametric tracking indices and /eri4cation systems for network am.li4cation mitigation!M =9=@! doi2 69!66@Al?D?@8?8!?D?8D?8!
5697 ACM Conference on Social Media S.am %ectors, K$e#a/ioral metadata analysis and 3andom Forest classi4cation .rotocols for automated account .ro4ling,M =9=@! doi2 69!66@Al?D?@8?8!?D@@GGF!
5667 d! S#u, A! Sli/a, S! Bang, a! 0ang, and N! Liu, KFake news detec- tion on social media2 A data mining .ers.ecti/e,M AC, SI.6DD
)7plorations 3ewsletter, /ol! 6G, no! 6, ..! ==g?D, =968, doi2 69!66@Al?6?8AG8!?6?8D99!
56=7 d! S#u, ! Ma#udeswaran, S! Bang, ! Lee, and N! Liu, KFake- [ews[et2 A data re.ository wit# news content, social conte<t, and s.atiotem.oral information for studying fake news on social media,M
$ig Data, /ol! F, no! ?, ..! 686g6FF, =9=9, doi2 69!69FGlbig!=9=9!99D=! 56?7 S! Gong, 3! O! Sinnott, a! si, and C! Paris, KFake news detection t#roug# gra.#-based neural networks2 A sur/ey,M ar8iv preprint
ar8iv-9:;<!=9>:5, =9=?, doi2 69!@FAA9larri/!=?98!6=D?G!
56@7 IHHH H<.loration Core 3e.ository, KGra.# neural to.ologies a..lied to source tracing constraints during #ig#-/elocity network am.li4cation e/ents!M 5Online7! A/ailable2 #tt.s2llieee<.lore!ieee!orglstam.lstam.! 1s.Lt.mtarnumberm69==G6=9
56A7 $! LakOaei, M! N! C#e#reg#ani, and A! $ag#eri, K isinformation detection using gra.# neural networks2 A sur/ey,M Arti%cial Intelligence
*eview, /ol! A8, no! A=, =9=@, doi2 69!6998ls69@D=-9=@-6989=-G!
56D7 aournal of Information Security and Information Sciences, KMulti- modal modeling a..roac#es for automated online t#reat assessment frameworks!M 5Online7! A/ailable2 #tt.s2ll1isis!orglw.-contentlu.loadsl
=9=?l9Fl=9=?!I?!99D!.df
5687 ACM igital Library, KSemantic conte<tual analysis and long- range de.endencies for dee. te<t auditing models!M =9=?! doi2 69!66@Al?D=F@A@!?D=F@AG!
56F7 ! Mouratidis, A! dana/os, and d! dermanidis, KFrom misinformation to insig#t2 Mac#ine learning strategies for fake news detection,M Infor/ mation, /ol! 6D, no! ?, .! 6FG, =9=A, doi2 69!??G9linfo6D9?96FG!
